Filtering GFF3 file
2
0
Entering edit mode
mictadlo ▴ 10
@mictadlo-10885
Last seen 4.2 years ago

Hi, I have the below GFF3 file.

NbV1Ch08        AUGUSTUS        gene    60876   63944   0.03    +       .       ID=g2
NbV1Ch08        AUGUSTUS        mRNA    60876   63944   0.03    +       .       ID=g2.t1;Note=B3 domain-containing protein Os03g0120900;Parent=g2
NbV1Ch08        AUGUSTUS        transcription_start_site        60876   60876   .       +       .       Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  60876   61072   0.19    +       .       ID=g2.t1.5UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    60876   61072   .       +       .       ID=g2.t1.exon1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  61673   61732   0.37    +       .       ID=g2.t1.5UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    61673   63449   .       +       .       ID=g2.t1.exon2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        start_codon     61733   61735   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        CDS     61733   62974   0.54    +       0       ID=g2.t1.CDS1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        stop_codon      62972   62974   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 62975   63449   1       +       .       ID=g2.t1.3UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 63565   63944   0.27    +       .       ID=g2.t1.3UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    63565   63944   .       +       .       ID=g2.t1.exon3;Parent=g2.t1
NbV1Ch08        AUGUSTUS        transcription_end_site  63944   63944   .       +       .       Parent=g2.t1
NbV1Ch08        AUGUSTUS        gene    64722   65524   0.32    -       .       ID=g3
NbV1Ch08        AUGUSTUS        mRNA    64722   65524   0.32    -       .       ID=g3.t1;Parent=g3
NbV1Ch08        AUGUSTUS        transcription_end_site  64722   64722   .       -       .       Parent=g3.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 64722   64792   0.77    -       .       ID=g3.t1.3UTR1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        exon    64722   65524   .       -       .       ID=g3.t1.exon1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        stop_codon      64793   64795   .       -       0       Parent=g3.t1
NbV1Ch08        AUGUSTUS        CDS     64793   65494   0.44    -       0       ID=g3.t1.CDS1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        start_codon     65492   65494   .       -       0       Parent=g3.t1

I would like to keep those features because they mRNA contain Note (writing into keep.gff3 file)

NbV1Ch08        AUGUSTUS        gene    60876   63944   0.03    +       .       ID=g2
NbV1Ch08        AUGUSTUS        mRNA    60876   63944   0.03    +       .       ID=g2.t1;Note=B3 domain-containing protein Os03g0120900;Parent=g2
NbV1Ch08        AUGUSTUS        transcription_start_site        60876   60876   .       +       .       Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  60876   61072   0.19    +       .       ID=g2.t1.5UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    60876   61072   .       +       .       ID=g2.t1.exon1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  61673   61732   0.37    +       .       ID=g2.t1.5UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    61673   63449   .       +       .       ID=g2.t1.exon2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        start_codon     61733   61735   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        CDS     61733   62974   0.54    +       0       ID=g2.t1.CDS1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        stop_codon      62972   62974   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 62975   63449   1       +       .       ID=g2.t1.3UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 63565   63944   0.27    +       .       ID=g2.t1.3UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    63565   63944   .       +       .       ID=g2.t1.exon3;Parent=g2.t1
NbV1Ch08        AUGUSTUS        transcription_end_site  63944   63944   .       +       .       Parent=g2.t1

On the other hand, I would like to reject those features because mRNA does not contain Note (writing into reject.gff3 file)

NbV1Ch08        AUGUSTUS        gene    64722   65524   0.32    -       .       ID=g3
NbV1Ch08        AUGUSTUS        mRNA    64722   65524   0.32    -       .       ID=g3.t1;Parent=g3
NbV1Ch08        AUGUSTUS        transcription_end_site  64722   64722   .       -       .       Parent=g3.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 64722   64792   0.77    -       .       ID=g3.t1.3UTR1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        exon    64722   65524   .       -       .       ID=g3.t1.exon1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        stop_codon      64793   64795   .       -       0       Parent=g3.t1
NbV1Ch08        AUGUSTUS        CDS     64793   65494   0.44    -       0       ID=g3.t1.CDS1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        start_codon     65492   65494   .       -       0       Parent=g3.t1

Is there already anything available?

Thank you in advance

annotation gff3 gff genome • 1.9k views
ADD COMMENT
2
Entering edit mode
Malcolm Cook ★ 1.6k
@malcolm-cook-6293
Last seen 1 day ago
United States

Three passes should do it easily.

First: Use perl to find lines matching "Note=" and to build new regular expressions based on the value of their ID.

Then: Use those regular expressions to grep for lines containing (or not) the new regular expressions.

perl -n -e 'print "ID=$1\\W\n" if m/ID=(\w+).*;Note=/' t.gff > IDmatch.txt
grep -f IDmatch.txt t.gff > keep.gff
grep -v -f IDmatch.txt t.gff > reject.gff

You could do it in two passes using perl but this is pretty simple and quick and done.

This approach depends upon the apparent naming conventions of your ID attributes.

ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

Assuming you want to do this with Bioconductor, you could do something like:

gr <- import("t.gff3")
haveNotes <- !is.na(drop(test$Note))
gene <- drop(gr$Parent)[haveNotes]
filtered <- subset(gr, gene == ID | haveNotes | startsWith(drop(Parent), paste0(gene, ".")))
export(filtered, "filtered.gff3")
ADD COMMENT

Login before adding your answer.

Traffic: 687 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6