What populates makeTranscriptDbFromBiomart?

0

Entering edit mode

Ravi Karra ▴ 140

@ravi-karra-4463

Last seen 9.6 years ago

Hi, Just starting to learn how to look at RNA Seq data, so apologies in advance. I ran my RNA-Seq experiment on a GAII and aligned to the zebrafish genome using Bowtie2/Tophat2. I downloaded the current zebrafish genome (Zv9) and transcript gtf file from Ensembl for the reference indices. I am trying to use edgeR to look at differential expression, but am a little hung up on getting the count data. As you can see from the code below, I input 8835090 mapped reads, but only 5380643 are overlapped with known transcripts. It seems that I am losing reads in summarizing the count data and I can't really figure out why. Is the transcript information that results from makeTranscriptDbFromBiomart identical to the transcript information in the gtf files that can be downloaded via Ensembl? Thanks again, Ravi zv9txs = makeTranscriptDbFromBiomart(biomart="ensembl",dataset="drerio _gene_ensembl") tx_by_gene=transcriptsBy(zv9txs,'gene') #read in the mapped bam hits file reads=readBamGappedAlignments("accepted_hits.bam") #check compatibility of names txs = names(seqlengths(tx_by_gene)) r = as.character(unique(rname(reads))) table (r %in% txs) #all reads mapped reads have the same chr names TRUE 752 #make a genomic Ranges object and remove strand ambiguity reads=GRanges(seqnames=rname(reads),ranges=IRanges(start=start(reads), end=end(reads)), strand=rep("*",length(reads))) #summarize counts data counts=countOverlaps(tx_by_gene,reads) sum (counts) [1] 5380643 length (reads) [1] 8835090 > sessionInfo() R version 2.14.1 (2011-12-22) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.2.1 Rsamtools_1.6.3 Biostrings_2.22.0 biomaRt_2.10.0 [5] GenomicFeatures_1.6.9 AnnotationDbi_1.16.19 Biobase_2.14.0 GenomicRanges_1.6.7 [9] IRanges_1.12.6 loaded via a namespace (and not attached): [1] bitops_1.0-4.1 BSgenome_1.22.0 DBI_0.2-5 grid_2.14.1 hwriter_1.3 [6] lattice_0.20-6 RCurl_1.91-1 RSQLite_0.11.1 rtracklayer_1.14.4 ShortRead_1.12.4 [11] tools_2.14.1 XM

edgeR edgeR • 1.3k views

ADD COMMENT • link updated 12.0 years ago by Steve Lianoglou ★ 13k • written 12.0 years ago by Ravi Karra ▴ 140

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 13 months ago

United States

Hi, On Sat, Apr 14, 2012 at 4:40 PM, Ravi Karra <ravi.karra at="" gmail.com=""> wrote: > Hi, > > Just starting to learn how to look at RNA Seq data, so apologies in advance. ?I ran my RNA-Seq experiment on a GAII and aligned to the zebrafish genome using Bowtie2/Tophat2. ?I downloaded the current zebrafish genome (Zv9) and transcript gtf file from Ensembl for the reference indices. ? I am trying to use edgeR to look at differential expression, but am a little hung up on getting the count data. > > As you can see from the code below, I input 8835090 mapped reads, but only 5380643 are overlapped with known transcripts. ?It seems that I am losing reads in summarizing the count data and I can't really figure out why. ? Is the transcript information that results from makeTranscriptDbFromBiomart identical to the transcript information in the gtf files that can be downloaded via Ensembl? Assume for the moment that it is identical -- you will (for sure) still have reads to regions where no transcripts are annotated. This still happens in organisms with "better" annotations than zebrafish, such as fruit fly, mouse, and human. The limit of our knowledge about what, where, and why regions of the genome are transcribed can be equally exciting as it is frustrating depending on which side of the fence you happen to be standing on a particular day. -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 12.0 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Ravi, I think part of your question is about whether or not we can trust ensembl to be internally consistent between what they put into their gtf files and what they expose via biomaRt. That's not really a bioconductor question since we really only present what is available at the resource in question, but we can still use bioconductor to ask questions about it. You can for example use the import() method from rtracklayer to bring the information in from the gtf file and compare that to the information that makeTranscriptDbFromBiomart() assembles from biomaRt. I would encourage you to make comparisons if you feel motivated, (but bear in mind that some kinds of data may not be present in the GTF file). And if you should find any legitimate discrepancies, the people at ensembl are usually quite responsive at explaining or correcting them (depending on what is appropriate). But usually, there are no real problems with this resource. The folks at ensembl are highly reliable. But as Steve was pointing out, even if everything is the same you will have reads that are just not part of the known transcriptome. So some proportion of your reads are not going to match up to anything that is well characterized. Of the reads that don't match up, some of them are likely to be from unknown transcripts, and some will be noise, but both are to be expected. Marc On 04/16/2012 06:41 AM, Steve Lianoglou wrote: > Hi, > > On Sat, Apr 14, 2012 at 4:40 PM, Ravi Karra<ravi.karra at="" gmail.com=""> wrote: >> Hi, >> >> Just starting to learn how to look at RNA Seq data, so apologies in advance. I ran my RNA-Seq experiment on a GAII and aligned to the zebrafish genome using Bowtie2/Tophat2. I downloaded the current zebrafish genome (Zv9) and transcript gtf file from Ensembl for the reference indices. I am trying to use edgeR to look at differential expression, but am a little hung up on getting the count data. >> >> As you can see from the code below, I input 8835090 mapped reads, but only 5380643 are overlapped with known transcripts. It seems that I am losing reads in summarizing the count data and I can't really figure out why. Is the transcript information that results from makeTranscriptDbFromBiomart identical to the transcript information in the gtf files that can be downloaded via Ensembl? > Assume for the moment that it is identical -- you will (for sure) > still have reads to regions where no transcripts are annotated. This > still happens in organisms with "better" annotations than zebrafish, > such as fruit fly, mouse, and human. > > The limit of our knowledge about what, where, and why regions of the > genome are transcribed can be equally exciting as it is frustrating > depending on which side of the fence you happen to be standing on a > particular day. > > -steve >

ADD REPLY • link 12.0 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

One more thing that I neglected to mention is that you can learn details about your transcriptDb object by just looking at the output of it's show method. So for example if I have loaded: library(TxDb.Hsapiens.UCSC.hg19.knownGene) ## then I can just look at the object like: TxDb.Hsapiens.UCSC.hg19.knownGene ## And I see an output like this: TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: UCSC | Genome: hg19 | Genus and Species: Homo sapiens | UCSC Table: knownGene | Resource URL: http://genome.ucsc.edu/ | Type of Gene ID: Entrez Gene ID | Full dataset: yes | miRBase build ID: GRCh37 | transcript_nrow: 80922 | exon_nrow: 286852 | cds_nrow: 235842 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2012-03-12 21:45:23 -0700 (Mon, 12 Mar 2012) | GenomicFeatures version at creation time: 1.7.30 | RSQLite version at creation time: 0.11.1 | DBSCHEMAVERSION: 1.0 Which tells me a whole lot about where this data came from, what build of the genome it was based upon etc. Now if I had looked at a package based on biomaRt it would look similar. For example: library("TxDb.Athaliana.BioMart.plantsmart12") TxDb.Athaliana.BioMart.plantsmart12 ## Shows me this: TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: BioMart | Genus and Species: Arabidopsis thaliana | Resource URL: www.biomart.org:80 | BioMart database: plants_mart_12 | BioMart database version: ENSEMBL PLANTS 12 (EBI UK) | BioMart dataset: athaliana_eg_gene | BioMart dataset description: Arabidopsis thaliana genes (TAIR10) | BioMart dataset version: TAIR10 | Full dataset: yes | miRBase build ID: NA | transcript_nrow: 41671 | exon_nrow: 171013 | cds_nrow: 0 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2012-03-13 09:54:23 -0700 (Tue, 13 Mar 2012) | GenomicFeatures version at creation time: 1.7.30 | RSQLite version at creation time: 0.11.1 | DBSCHEMAVERSION: 1.0 Which tells me exactly which biomaRt data sources were used for constructing the database. Marc On 04/16/2012 10:50 AM, Marc Carlson wrote: > Hi Ravi, > > I think part of your question is about whether or not we can trust > ensembl to be internally consistent between what they put into their > gtf files and what they expose via biomaRt. That's not really a > bioconductor question since we really only present what is available > at the resource in question, but we can still use bioconductor to ask > questions about it. You can for example use the import() method from > rtracklayer to bring the information in from the gtf file and compare > that to the information that makeTranscriptDbFromBiomart() assembles > from biomaRt. I would encourage you to make comparisons if you feel > motivated, (but bear in mind that some kinds of data may not be > present in the GTF file). And if you should find any legitimate > discrepancies, the people at ensembl are usually quite responsive at > explaining or correcting them (depending on what is appropriate). But > usually, there are no real problems with this resource. The folks at > ensembl are highly reliable. > > But as Steve was pointing out, even if everything is the same you will > have reads that are just not part of the known transcriptome. So some > proportion of your reads are not going to match up to anything that is > well characterized. Of the reads that don't match up, some of them > are likely to be from unknown transcripts, and some will be noise, but > both are to be expected. > > > Marc > > > > > On 04/16/2012 06:41 AM, Steve Lianoglou wrote: >> Hi, >> >> On Sat, Apr 14, 2012 at 4:40 PM, Ravi Karra<ravi.karra at="" gmail.com=""> >> wrote: >>> Hi, >>> >>> Just starting to learn how to look at RNA Seq data, so apologies in >>> advance. I ran my RNA-Seq experiment on a GAII and aligned to the >>> zebrafish genome using Bowtie2/Tophat2. I downloaded the current >>> zebrafish genome (Zv9) and transcript gtf file from Ensembl for the >>> reference indices. I am trying to use edgeR to look at >>> differential expression, but am a little hung up on getting the >>> count data. >>> >>> As you can see from the code below, I input 8835090 mapped reads, >>> but only 5380643 are overlapped with known transcripts. It seems >>> that I am losing reads in summarizing the count data and I can't >>> really figure out why. Is the transcript information that results >>> from makeTranscriptDbFromBiomart identical to the transcript >>> information in the gtf files that can be downloaded via Ensembl? >> Assume for the moment that it is identical -- you will (for sure) >> still have reads to regions where no transcripts are annotated. This >> still happens in organisms with "better" annotations than zebrafish, >> such as fruit fly, mouse, and human. >> >> The limit of our knowledge about what, where, and why regions of the >> genome are transcribed can be equally exciting as it is frustrating >> depending on which side of the fence you happen to be standing on a >> particular day. >> >> -steve >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.0 years ago Marc Carlson ★ 7.2k

Login before adding your answer.