Question

Does BSgenome.Dmelanogaster.UCSC.dm2 maintain non-coding RNAs?

0

Entering edit mode

Patrick Schorderet ▴ 20

@patrick-schorderet-6081

Last seen 10.7 years ago

United States

I was wondering whether anyone knows if the BSgenome.Dmelanogaster.UCSC.dm2 maintains non coding RNAs? Or does any other drosophila BSgenome contain non-coding RNAs? maybe the TxDb.Dmelanogaster.UCSC.dm3.ensGene?

Thanks

dm2 bsgenome noncoding RNA • 2.8k views

ADD COMMENT • link updated 10.8 years ago by Hervé Pagès 16k • written 10.8 years ago by Patrick Schorderet ▴ 20

0

Entering edit mode

The BSgenome.Dmelanogaster.UCSC.dm2 doesn't contain any RNAs. It contains the genomic sequence for that species. There are ways to get non-coding RNAs, but you will first need to tell us exactly what you want.

In other words, 'non-coding RNAs' encompasses a lot of different things. In addition, there are several things you could be interested in (genomic sequence, RNA sequence, genomic location, etc).

ADD REPLY • link 10.8 years ago James W. MacDonald 68k

0

Entering edit mode

Yes, sorry, you are right. Here is what I do: I count RNAseq reads using the TxDb.Dmelanogaster.UCSC.dm3.ensGene database to compute DEGs. However, I would also be interested in looking at whether ncRNAs (lincRNAs) are up or down regulated.

I hope this makes more sense. Thanks.

ADD REPLY • link 10.8 years ago Patrick Schorderet ▴ 20

score 1 · Answer 1 · 2015-04-08

Hi Patrick,

lincRNAs for Fly are annotated at Ensembl:

library(GenomicFeatures)
txdb <- makeTxDbFromBiomart(dataset="dmelanogaster_gene_ensembl")
tx <- transcripts(txdb, columns=c("tx_name", "gene_id", "tx_type"))
table(mcols(tx)$tx_type)
#  lincRNA       miRNA   pre_miRNA  protein_coding  pseudogene 
#     2776         304         238           30353         289 
#     rRNA      snoRNA       snRNA            tRNA 
#      147         288          31             314

For example, to extract lincRNA FBtr0345927:

tx[mcols(tx)$tx_name %in% "FBtr0345927"]
# GRanges object with 1 range and 3 metadata columns:
#       seqnames               ranges strand |     tx_name         gene_id
#          <Rle>            <IRanges>  <Rle> | <character> <CharacterList>
#   [1]        X [22514453, 22514891]      - | FBtr0345927     FBgn0264677
#           tx_type
#       <character>
#   [1]     lincRNA
#   -------
#   seqinfo: 1870 sequences from an unspecified genome

To extract the other lincRNAs linked to the same "gene" as FBtr0345927:

tx[as.logical(mcols(tx)$gene_id %in% "FBgn0264677")]
# GRanges object with 2 ranges and 3 metadata columns:
#       seqnames               ranges strand |     tx_name         gene_id
#          <Rle>            <IRanges>  <Rle> | <character> <CharacterList>
#   [1]        X [22514453, 22514891]      - | FBtr0345927     FBgn0264677
#   [2]        X [22514522, 22514891]      - | FBtr0333773     FBgn0264677
#           tx_type
#       <character>
#   [1]     lincRNA
#   [2]     lincRNA
#   -------
#   seqinfo: 1870 sequences from an unspecified genome

Note that tx_type is a new column in BioC 3.1 (our upcoming release, based on R 3.2, and scheduled for April 17) so make sure you use that version of BioC (just install R 3.2 and proceed as usual).

H.

score 0 · Answer 2 · 2015-04-08

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 22 hours ago

United States

That might be a tough one. A quick google search indicates that people are working on lincRNAs for Drosophila, but I don't know if there is a comprehensive source. Certainly I don't see anything on UCSC. Maybe you can dig up a gff or bed file somewhere, and convert to a GRangesList.

ADD COMMENT • link 10.8 years ago James W. MacDonald 68k

0

Entering edit mode

As an example, you could use this.

ADD REPLY • link 10.8 years ago James W. MacDonald 68k

score 0 · Answer 3 · 2015-04-09

0

Entering edit mode

Patrick Schorderet ▴ 20

@patrick-schorderet-6081

Last seen 10.7 years ago

United States

ok, great. Thanks for this info, i'll check it out.

Patrick

ADD COMMENT • link 10.8 years ago Patrick Schorderet ▴ 20

score 0 · Answer 4 · 2015-04-10

hi Patrick,

with the next Bioc release you can also use the ensembldb package to build EnsDb annotation packages (similar to the TxDb, just tailored for annotations from Ensembl) for drosophila based on the GTF files provided from Ensembl. I'm also working on adding the EnsDb classes to the AnnotationHub which would make it much easier to generate such packages.

cheers, jo

score 0 · Answer 5 · 2015-04-10

Thanks Hervé and Johannes,

Just tried to use the old function (makeTranscriptDbFromBiomart) and it looks like something is going wrong (pasting the error message below). I guess the easiest will be to wait for the new BioC update. Should this work well with the update?

Thanks for the help

Patrick

txdb <- makeTranscriptDbFromBiomart(dataset="dmelanogaster_gene_ensembl")

Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... Error in .stopWithBioMartDataAnomalyReport(bm_result, idx[bad_idx2], id_prefix, :
BioMart data anomaly: in the following transcripts,
located on the minus strand, the start of some 3' UTRs
(3_utr_start) doesn't match the start of the exon
(exon_chrom_start).
(Showing only the first 6 out of 9 transcripts.)
1. Transcript FBtr0084081:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21368380 21369062 FBtr0084081-E2
2 -1 2 21377288 21377399 FBtr0084081-E1
3 -1 3 21376819 21377076 FBtr0084081-E3
4 -1 4 21376602 21376741 FBtr0084081-E4
5 -1 5 21375060 21375912 FBtr0084081-E5
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 21368871 21369062 21368380 21368870 NA
2 NA NA NA NA 21377288
3 21376819 21377035 NA NA 21377036
4 21376602 21376741 NA NA NA
5 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA 1 192 521
2 21377399 193 304 521
3 21377076 305 521 521
4 NA NA NA 521
5 NA NA NA 521
2. Transcript FBtr0084084:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21366004 21366338 FBtr0084084-E2
2 -1 2 21377288 21377399 FBtr0084081-E1
3 -1 3 21376819 21377076 FBtr0084081-E3
4 -1 4 21376602 21376741 FBtr0084081-E4
5 -1 5 21375060 21375912 FBtr0084081-E5
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 21366294 21366338 21366004 21366293 NA
2 NA NA NA NA 21377288
3 21376819 21377035 NA NA 21377036
4 21376602 21376741 NA NA NA
5 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA 1 45 374
2 21377399 46 157 374
3 21377076 158 374 374
4 NA NA NA 374
5 NA NA NA 374
3. Transcript FBtr0084085:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21361398 21361610 FBtr0084085-E2
2 -1 2 21361670 21362138 FBtr0084085-E4
3 -1 3 21377288 21377399 FBtr0084081-E1
4 -1 4 21376819 21377076 FBtr0084085-E3
5 -1 5 21376602 21376741 FBtr0084085-E5
6 -1 6 21375060 21375912 FBtr0084085-E6
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 NA NA 21361398 21361610 NA
2 21361825 21362138 21361670 21361824 NA
3 NA NA NA NA 21377288
4 21376819 21377035 NA NA 21377036
5 21376602 21376741 NA NA NA
6 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA NA NA 643
2 NA 1 314 643
3 21377399 315 426 643
4 21377076 427 643 643
5 NA NA NA 643
6 NA NA NA 643
4. Transcript FBtr0084082:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21367910 21368238 FBtr0084082-E2
2 -1 2 21377288 21377399 FBtr0084081-E1
3 -1 3 21376819 21377076 FBtr0084081-E3
4 -1 4 21376602 21376741 FBtr0084081-E4
5 -1 5 21375060 21375912 FBtr0084081-E5
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 21368215 21368238 21367910 21368214 NA
2 NA NA NA NA 21377288
3 21376819 21377035 NA NA 21377036
4 21376602 21376741 NA NA NA
5 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA 1 24 353
2 21377399 25 136 353
3 21377076 137 353 353
4 NA NA NA 353
5 NA NA NA 353
5. Transcript FBtr0084083:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21366450 21366744 FBtr0084083-E2
2 -1 2 21377288 21377399 FBtr0084081-E1
3 -1 3 21376819 21377076 FBtr0084085-E3
4 -1 4 21376602 21376741 FBtr0084085-E5
5 -1 5 21375060 21375912 FBtr0084085-E6
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 21366710 21366744 21366450 21366709 NA
2 NA NA NA NA 21377288
3 21376819 21377035 NA NA 21377036
4 21376602 21376741 NA NA NA
5 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA 1 35 364
2 21377399 36 147 364
3 21377076 148 364 364
4 NA NA NA 364
5 NA NA NA 364
6. Transcript FBtr0307759:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21367363 21367688 FBtr0307759-E2
2 -1 2 21377288 21377399 FBtr0084081-E1
3 -1 3 21376819 21377076 FBtr0307759-E3
4 -1 4 21376602 21376741 FBtr0114359-E3
5 -1 5 21375060 21375912 FBtr0114359-E4
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 21367679 21367688 21367363 21367678 NA
2 NA NA NA NA 21377288
3 21376819 21377035 NA NA 21377036
4 21376602 21376741 NA NA NA
5 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA 1 10 339
2 21377399 11 122 339
3 21377076 123 339 339
4 NA NA NA 339
5 NA NA NA
In addition: Warning messages:
1: In assignProvIdsForSuperGroup(seqlevels, "") :
inaccurate integer conversion in coercion
2: In 3L * nb_ints : NAs produced by integer overflow
3: In matchCircularity(chromlengths$name, circ_seqs) :
None of the strings in your circ_seqs argument match your seqnames.

score 0 · Answer 6 · 2015-04-10

Hi Patrick,

Yes the dmelanogaster_gene_ensembl dataset in the latest release of Ensembl (v79) contains some transcripts that are mis-represented (wrong exon ranking and strand, wrong UTRs). These are detected by the sanity checks that makeTranscriptDbFromBiomart() applies to the incoming data. See

makeTranscriptDbFromBiomart failure from Data Anomaly

for a long version of this story.

Anyway, a patch was applied a couple of weeks ago to GenomicFeatures (1.18.5 in BioC release, and 1.19.35 in BioC devel) to address the issue. The new behavior is that makeTranscriptDbFromBiomart() (renamed makeTxDbFromBiomart() in BioC devel) now drops these problematic transcripts with a warning instead of failing. So please make sure your packages are up-to-date (run biocLite() with no arguments for that).

Thanks,

H.