I was wondering whether anyone knows if the BSgenome.Dmelanogaster.UCSC.dm2 maintains non coding RNAs? Or does any other drosophila BSgenome contain non-coding RNAs? maybe the TxDb.Dmelanogaster.UCSC.dm3.ensGene?
Thanks
I was wondering whether anyone knows if the BSgenome.Dmelanogaster.UCSC.dm2 maintains non coding RNAs? Or does any other drosophila BSgenome contain non-coding RNAs? maybe the TxDb.Dmelanogaster.UCSC.dm3.ensGene?
Thanks
Hi Patrick,
lincRNAs for Fly are annotated at Ensembl:
library(GenomicFeatures) txdb <- makeTxDbFromBiomart(dataset="dmelanogaster_gene_ensembl") tx <- transcripts(txdb, columns=c("tx_name", "gene_id", "tx_type")) table(mcols(tx)$tx_type) # lincRNA miRNA pre_miRNA protein_coding pseudogene # 2776 304 238 30353 289 # rRNA snoRNA snRNA tRNA # 147 288 31 314
For example, to extract lincRNA FBtr0345927:
tx[mcols(tx)$tx_name %in% "FBtr0345927"] # GRanges object with 1 range and 3 metadata columns: # seqnames ranges strand | tx_name gene_id # <Rle> <IRanges> <Rle> | <character> <CharacterList> # [1] X [22514453, 22514891] - | FBtr0345927 FBgn0264677 # tx_type # <character> # [1] lincRNA # ------- # seqinfo: 1870 sequences from an unspecified genome
To extract the other lincRNAs linked to the same "gene" as FBtr0345927:
tx[as.logical(mcols(tx)$gene_id %in% "FBgn0264677")] # GRanges object with 2 ranges and 3 metadata columns: # seqnames ranges strand | tx_name gene_id # <Rle> <IRanges> <Rle> | <character> <CharacterList> # [1] X [22514453, 22514891] - | FBtr0345927 FBgn0264677 # [2] X [22514522, 22514891] - | FBtr0333773 FBgn0264677 # tx_type # <character> # [1] lincRNA # [2] lincRNA # ------- # seqinfo: 1870 sequences from an unspecified genome
Note that tx_type
is a new column in BioC 3.1 (our upcoming release, based on R 3.2, and scheduled for April 17) so make sure you use that version of BioC (just install R 3.2 and proceed as usual).
H.
Hey Hervé,
Is the makeTxDfFromBiomart a function that only works on R 3.2? I tried and it gives me an error :-(
txdb <- makeTxDbFromBiomart(dataset="dmelanogaster_gene_ensembl")
Error: could not find function "makeTxDbFromBiomart"
and my sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.2 (Yosemite)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods
[9] base
other attached packages:
[1] GenomicFeatures_1.18.7 AnnotationDbi_1.28.2 Biobase_2.26.0
[4] GenomicRanges_1.18.4 GenomeInfoDb_1.2.5 IRanges_2.0.1
[7] S4Vectors_0.4.0 BiocGenerics_0.12.1 BiocInstaller_1.16.2
loaded via a namespace (and not attached):
[1] base64enc_0.1-2 BatchJobs_1.6 BBmisc_1.9
[4] BiocParallel_1.0.3 biomaRt_2.22.0 Biostrings_2.34.1
[7] bitops_1.0-6 brew_1.0-6 checkmate_1.5.2
[10] codetools_0.2-11 DBI_0.3.1 digest_0.6.8
[13] fail_1.2 foreach_1.4.2 GenomicAlignments_1.2.2
[16] iterators_1.0.7 RCurl_1.95-4.5 Rsamtools_1.18.3
[19] RSQLite_1.0.0 rtracklayer_1.26.3 sendmailR_1.2-1
[22] stringr_0.6.2 tools_3.1.3 XML_3.98-1.1
[25] XVector_0.6.0 zlibbioc_1.12.0
Thanks!
That might be a tough one. A quick google search indicates that people are working on lincRNAs for Drosophila, but I don't know if there is a comprehensive source. Certainly I don't see anything on UCSC. Maybe you can dig up a gff or bed file somewhere, and convert to a GRangesList.
hi Patrick,
with the next Bioc release you can also use the ensembldb package to build EnsDb annotation packages (similar to the TxDb, just tailored for annotations from Ensembl) for drosophila based on the GTF files provided from Ensembl. I'm also working on adding the EnsDb classes to the AnnotationHub which would make it much easier to generate such packages.
cheers, jo
Thanks Hervé and Johannes,
Just tried to use the old function (makeTranscriptDbFromBiomart) and it looks like something is going wrong (pasting the error message below). I guess the easiest will be to wait for the new BioC update. Should this work well with the update?
Thanks for the help
Patrick
txdb <- makeTranscriptDbFromBiomart(dataset="dmelanogaster_gene_ensembl")
Download and preprocess the 'transcripts' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Download and preprocess the 'splicings' data frame ... Error in .stopWithBioMartDataAnomalyReport(bm_result, idx[bad_idx2], id_prefix, :
BioMart data anomaly: in the following transcripts,
located on the minus strand, the start of some 3' UTRs
(3_utr_start) doesn't match the start of the exon
(exon_chrom_start).
(Showing only the first 6 out of 9 transcripts.)
1. Transcript FBtr0084081:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21368380 21369062 FBtr0084081-E2
2 -1 2 21377288 21377399 FBtr0084081-E1
3 -1 3 21376819 21377076 FBtr0084081-E3
4 -1 4 21376602 21376741 FBtr0084081-E4
5 -1 5 21375060 21375912 FBtr0084081-E5
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 21368871 21369062 21368380 21368870 NA
2 NA NA NA NA 21377288
3 21376819 21377035 NA NA 21377036
4 21376602 21376741 NA NA NA
5 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA 1 192 521
2 21377399 193 304 521
3 21377076 305 521 521
4 NA NA NA 521
5 NA NA NA 521
2. Transcript FBtr0084084:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21366004 21366338 FBtr0084084-E2
2 -1 2 21377288 21377399 FBtr0084081-E1
3 -1 3 21376819 21377076 FBtr0084081-E3
4 -1 4 21376602 21376741 FBtr0084081-E4
5 -1 5 21375060 21375912 FBtr0084081-E5
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 21366294 21366338 21366004 21366293 NA
2 NA NA NA NA 21377288
3 21376819 21377035 NA NA 21377036
4 21376602 21376741 NA NA NA
5 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA 1 45 374
2 21377399 46 157 374
3 21377076 158 374 374
4 NA NA NA 374
5 NA NA NA 374
3. Transcript FBtr0084085:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21361398 21361610 FBtr0084085-E2
2 -1 2 21361670 21362138 FBtr0084085-E4
3 -1 3 21377288 21377399 FBtr0084081-E1
4 -1 4 21376819 21377076 FBtr0084085-E3
5 -1 5 21376602 21376741 FBtr0084085-E5
6 -1 6 21375060 21375912 FBtr0084085-E6
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 NA NA 21361398 21361610 NA
2 21361825 21362138 21361670 21361824 NA
3 NA NA NA NA 21377288
4 21376819 21377035 NA NA 21377036
5 21376602 21376741 NA NA NA
6 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA NA NA 643
2 NA 1 314 643
3 21377399 315 426 643
4 21377076 427 643 643
5 NA NA NA 643
6 NA NA NA 643
4. Transcript FBtr0084082:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21367910 21368238 FBtr0084082-E2
2 -1 2 21377288 21377399 FBtr0084081-E1
3 -1 3 21376819 21377076 FBtr0084081-E3
4 -1 4 21376602 21376741 FBtr0084081-E4
5 -1 5 21375060 21375912 FBtr0084081-E5
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 21368215 21368238 21367910 21368214 NA
2 NA NA NA NA 21377288
3 21376819 21377035 NA NA 21377036
4 21376602 21376741 NA NA NA
5 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA 1 24 353
2 21377399 25 136 353
3 21377076 137 353 353
4 NA NA NA 353
5 NA NA NA 353
5. Transcript FBtr0084083:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21366450 21366744 FBtr0084083-E2
2 -1 2 21377288 21377399 FBtr0084081-E1
3 -1 3 21376819 21377076 FBtr0084085-E3
4 -1 4 21376602 21376741 FBtr0084085-E5
5 -1 5 21375060 21375912 FBtr0084085-E6
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 21366710 21366744 21366450 21366709 NA
2 NA NA NA NA 21377288
3 21376819 21377035 NA NA 21377036
4 21376602 21376741 NA NA NA
5 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA 1 35 364
2 21377399 36 147 364
3 21377076 148 364 364
4 NA NA NA 364
5 NA NA NA 364
6. Transcript FBtr0307759:
strand rank exon_chrom_start exon_chrom_end ensembl_exon_id
1 -1 1 21367363 21367688 FBtr0307759-E2
2 -1 2 21377288 21377399 FBtr0084081-E1
3 -1 3 21376819 21377076 FBtr0307759-E3
4 -1 4 21376602 21376741 FBtr0114359-E3
5 -1 5 21375060 21375912 FBtr0114359-E4
genomic_coding_start genomic_coding_end 5_utr_start 5_utr_end 3_utr_start
1 21367679 21367688 21367363 21367678 NA
2 NA NA NA NA 21377288
3 21376819 21377035 NA NA 21377036
4 21376602 21376741 NA NA NA
5 21375060 21375912 NA NA NA
3_utr_end cds_start cds_end cds_length
1 NA 1 10 339
2 21377399 11 122 339
3 21377076 123 339 339
4 NA NA NA 339
5 NA NA NA
In addition: Warning messages:
1: In assignProvIdsForSuperGroup(seqlevels, "") :
inaccurate integer conversion in coercion
2: In 3L * nb_ints : NAs produced by integer overflow
3: In matchCircularity(chromlengths$name, circ_seqs) :
None of the strings in your circ_seqs argument match your seqnames.
Hi Patrick,
Yes the dmelanogaster_gene_ensembl
dataset in the latest release of Ensembl (v79) contains some transcripts that are mis-represented (wrong exon ranking and strand, wrong UTRs). These are detected by the sanity checks that makeTranscriptDbFromBiomart()
applies to the incoming data. See
makeTranscriptDbFromBiomart failure from Data Anomaly
for a long version of this story.
Anyway, a patch was applied a couple of weeks ago to GenomicFeatures (1.18.5 in BioC release, and 1.19.35 in BioC devel) to address the issue. The new behavior is that makeTranscriptDbFromBiomart()
(renamed makeTxDbFromBiomart()
in BioC devel) now drops these problematic transcripts with a warning instead of failing. So please make sure your packages are up-to-date (run biocLite()
with no arguments for that).
Thanks,
H.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The BSgenome.Dmelanogaster.UCSC.dm2 doesn't contain any RNAs. It contains the genomic sequence for that species. There are ways to get non-coding RNAs, but you will first need to tell us exactly what you want.
In other words, 'non-coding RNAs' encompasses a lot of different things. In addition, there are several things you could be interested in (genomic sequence, RNA sequence, genomic location, etc).
Yes, sorry, you are right. Here is what I do: I count RNAseq reads using the TxDb.Dmelanogaster.UCSC.dm3.ensGene database to compute DEGs. However, I would also be interested in looking at whether ncRNAs (lincRNAs) are up or down regulated.
I hope this makes more sense. Thanks.