How to retrieve exon ID and gene ID from exon coordinates?

0

Entering edit mode

ying chen ▴ 340

@ying-chen-5085

Last seen 9.6 years ago

Hi guys, I have a RNASeq data table which has exon cooridinates (chrom, start. end) and raw count. I want to use DEXseq to see differential transcripts. To do it I need to get geneIDs and exonIDs from corresponding exon cooridinates. Any suggestion how to do it? Thanks a lot for the help! Ying [[alternative HTML version deleted]]

RNASeq DEXSeq RNASeq DEXSeq • 2.9k views

ADD COMMENT • link updated 11.6 years ago by James W. MacDonald 65k • written 11.6 years ago by ying chen ▴ 340

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

Hi Ying, On 9/10/2012 4:54 PM, ying chen wrote: > > > Hi guys, I have a RNASeq data table which has exon cooridinates (chrom, start. end) and raw count. I want to use DEXseq to see differential transcripts. To do it I need to get geneIDs and exonIDs from corresponding exon cooridinates. Any suggestion how to do it? Thanks a lot for the help! You don't give much to go on. Assuming you are working with a common species, it is simple. Let's assume you are working with mice. Something like this should work: yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE) library(TxDb.Mmusculus.UCSC.mm9.knownGene) ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns = c("exon_id","gene_id")) yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start, end=yourdata$end)) elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),] If you are planning on doing this sort of stuff, do yourself a favor and read the GenomicFeatures and GenomicRanges vignettes. They are chock full of info that you will need. Best, Jim > Ying > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 11.6 years ago James W. MacDonald 65k

0

Entering edit mode

Hi, I read the vignettes and tried the code Jim suggested, but got an error message I could not figure out why. Here is what I did: > library(GenomicFeatures) > library(GenomicRanges) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > TCGA_CRC <- read.delim("TCGA_CRC_UNC_RNASeq_ExonID_detail_new.txt",s tringsAsFactors=FALSE)> ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = c("exon_id","gene_id","tx_id")) > TCGA_CRC <- GRanges(seqnames=TCGA_CRC$Chrom, ranges=IRanges(start=TC GA_CRC$Start,end=TCGA_CRC$End),strand=TCGA_CRC$Strand,name=TCGA_CRC$Ex onID)> elementMetadata(TCGA_CRC) <- elementMetadata(ex)[match(TCGA_CRC, ex),] Error in elementMetadata(ex)[match(TCGA_CRC, ex), ] : selecting rows: subscript contains NAs or out of bounds indices > I cannot tell what is subscript it mentioned. Any suggestion? Thanks a lot fo the help! Ying The following are the some info: > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-mingw32/x64 (64-bit)locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 GenomicFeatures_1.8.3 AnnotationDbi_1.18.3 [4] Biobase_2.16.0 GenomicRanges_1.8.13 IRanges_1.14.4 [7] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] biomaRt_2.12.0 Biostrings_2.24.1 bitops_1.0-4.1 BSgenome_1.24.0 DBI_0.2-5 RCurl_1.91-1.1 Rsamtools_1.8.6 [8] RSQLite_0.11.1 rtracklayer_1.16.3 stats4_2.15.1 tools_2.15.1 XML_3.9-4.1 zlibbioc_1.2.0 > > TCGA_CRC GRanges with 239886 ranges and 1 elementMetadata col: seqnames ranges strand | name <rle> <iranges> <rle> | <character> [1] chr10 [100003848, 100004653] + | chr10:100003848-100004653:+ [2] chr10 [100007443, 100008748] - | chr10:100008748-100007443:- [3] chr10 [100010822, 100010933] - | chr10:100010933-100010822:- [4] chr10 [100011323, 100011459] - | chr10:100011459-100011323:- [5] chr10 [100012110, 100012225] - | chr10:100012225-100012110:- [6] chr10 [100013310, 100013553] - | chr10:100013553-100013310:- [7] chr10 [100015334, 100015496] - | chr10:100015496-100015334:- [8] chr10 [100016537, 100016704] - | chr10:100016704-100016537:- [9] chr10 [100017407, 100017561] - | chr10:100017561-100017407:- ... ... ... ... ... ... [239878] chrY [9611654, 9611928] - | chrY:9611928-9611654:- [239879] chrY [9638762, 9638916] + | chrY:9638762-9638916:+ [239880] chrY [9642383, 9642494] + | chrY:9642383-9642494:+ [239881] chrY [9646920, 9646994] + | chrY:9646920-9646994:+ [239882] chrY [9647680, 9647718] + | chrY:9647680-9647718:+ [239883] chrY [9650809, 9650854] + | chrY:9650809-9650854:+ [239884] chrY [9748407, 9748463] + | chrY:9748407-9748463:+ [239885] chrY [9748577, 9748722] + | chrY:9748577-9748722:+ [239886] chrY [9749263, 9749571] + | chrY:9749263-9749571:+ --- seqlengths: chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > > ex GRanges with 286852 ranges and 3 elementMetadata cols: seqnames ranges strand | exon_id gene_id tx_id <rle> <iranges> <rle> | <integer> <compressedcharacterlist> <compressedintegerlist> [1] chr1 [ 11874, 12227] + | 1 NA 1,2,3 [2] chr1 [ 12595, 12721] + | 5 NA 3 [3] chr1 [ 12613, 12721] + | 2 NA 1 [4] chr1 [ 12646, 12697] + | 4 NA 2 [5] chr1 [ 13221, 14409] + | 3 NA 1,2 [6] chr1 [ 13403, 14409] + | 6 NA 3 [7] chr1 [ 69091, 70008] + | 37 79501 25 [8] chr1 [321084, 321115] + | 44 NA 28 [9] chr1 [321146, 321207] + | 45 NA 29 ... ... ... ... ... ... ... ... [286844] chrY [27603458, 27603499] - | 142741 NA 40070 [286845] chrY [27604352, 27604379] - | 142742 NA 40071 [286846] chrY [27605645, 27605678] - | 142743 NA 40072 [286847] chrY [27606394, 27606421] - | 142744 NA 40073 [286848] chrY [27607404, 27607432] - | 142745 NA 40074 [286849] chrY [27635919, 27635954] - | 142750 NA 40078 [286850] chrY [59358329, 59359508] - | 142802 NA 40099 [286851] chrY [59360007, 59360115] - | 142803 NA 40099 [286852] chrY [59360501, 59360854] - | 142804 NA 40099 --- seqlengths: chr1 chr2 chr3 ... chrM chrUn_gl000226 chr18_gl000207_random 249250621 243199373 198022430 ... 16571 15008 4262 > > Date: Mon, 10 Sep 2012 17:29:24 -0400 > From: jmacdon@uw.edu > To: ying_chen@live.com > CC: bioconductor@r-project.org > Subject: Re: [BioC] How to retrieve exon ID and gene ID from exon coordinates? > > Hi Ying, > > On 9/10/2012 4:54 PM, ying chen wrote: > > > > > > Hi guys, I have a RNASeq data table which has exon cooridinates (chrom, start. end) and raw count. I want to use DEXseq to see differential transcripts. To do it I need to get geneIDs and exonIDs from corresponding exon cooridinates. Any suggestion how to do it? Thanks a lot for the help! > > You don't give much to go on. Assuming you are working with a common > species, it is simple. Let's assume you are working with mice. > > Something like this should work: > > yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE) > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns = > c("exon_id","gene_id")) > yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start, > end=yourdata$end)) > elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),] > > If you are planning on doing this sort of stuff, do yourself a favor and > read the GenomicFeatures and GenomicRanges vignettes. They are chock > full of info that you will need. > > Best, > > Jim > > > > > Ying > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > [[alternative HTML version deleted]]

ADD REPLY • link 11.6 years ago ying chen ▴ 340

0

Entering edit mode

Do not do this. You will go moderately insane. Use the TCGA GAF against which the reads were aligned to get the symbols and work on those instead. Katie Hoadley and her people at UNC put a lot of effort into describing this, e.g. https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anony mous/tumor/coad/cgcc/unc.edu/illuminaga_rnaseq/rnaseq/unc.edu_COAD.Ill uminaGA_RNASeq.mage-tab.1.7.0/DESCRIPTION.txt The corresponding GAF, it is therein noted, can be found at https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anony mous/other/GAF/GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf The reason you want to do this (IMHO) is that some TCGA projects were aligned to hg18 and some to hg19. Alhamdulillah, the RNAseq runs for COAD were aligned against hg19, as you will see in the DESCRIPTION file above. (No seriously, read it. They did a very good job describing the experiment.) But if you don't pay attention to what GAF is being used then, one of these days when you can least afford the time, you will be screwed, simple as that. More on GAF: https://tcga-data.nci.nih.gov/docs/GAF/GAF_index.html GAF3, which actually is sort of sensible, yet still provokes arguments: https://tcga-data.nci.nih.gov/docs/GAF/GAF3.0/ I will respond separately with a copy to Marc Carlson regarding this issue, since it is Yet Another Job For A SummarizedExperiment. Hopefully you will be one of the last non-TCGA people to have to do this. I can assure you, however, that you were not the first person to be puzzled by the format. --t On Tue, Sep 11, 2012 at 1:06 PM, ying chen <ying_chen@live.com> wrote: > Hi, I read the vignettes and tried the code Jim suggested, but got an > error message I could not figure out why. Here is what I did: > > library(GenomicFeatures) > > library(GenomicRanges) > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > > TCGA_CRC <- > read.delim("TCGA_CRC_UNC_RNASeq_ExonID_detail_new.txt",stringsAsFact ors=FALSE)> > ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = > c("exon_id","gene_id","tx_id")) > > TCGA_CRC <- GRanges(seqnames=TCGA_CRC$Chrom, > ranges=IRanges(start=TCGA_CRC$Start,end=TCGA_CRC$End),strand=TCGA_CR C$Strand,name=TCGA_CRC$ExonID)> > elementMetadata(TCGA_CRC) <- elementMetadata(ex)[match(TCGA_CRC, ex),] > Error in elementMetadata(ex)[match(TCGA_CRC, ex), ] : > selecting rows: subscript contains NAs or out of bounds indices > > I cannot tell what is subscript it mentioned. Any suggestion? Thanks a > lot fo the help! Ying The following are the some info: > sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-pc-mingw32/x64 (64-bit)locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 attached base packages: > [1] stats graphics grDevices utils datasets methods base > other attached packages: > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 GenomicFeatures_1.8.3 > AnnotationDbi_1.18.3 > [4] Biobase_2.16.0 GenomicRanges_1.8.13 > IRanges_1.14.4 > [7] BiocGenerics_0.2.0 loaded via a namespace (and not > attached): > [1] biomaRt_2.12.0 Biostrings_2.24.1 bitops_1.0-4.1 > BSgenome_1.24.0 DBI_0.2-5 RCurl_1.91-1.1 Rsamtools_1.8.6 > [8] RSQLite_0.11.1 rtracklayer_1.16.3 stats4_2.15.1 tools_2.15.1 > XML_3.9-4.1 zlibbioc_1.2.0 > > > TCGA_CRC > GRanges with 239886 ranges and 1 elementMetadata col: > seqnames ranges strand | > name > <rle> <iranges> <rle> | > <character> > [1] chr10 [100003848, 100004653] + | > chr10:100003848-100004653:+ > [2] chr10 [100007443, 100008748] - | > chr10:100008748-100007443:- > [3] chr10 [100010822, 100010933] - | > chr10:100010933-100010822:- > [4] chr10 [100011323, 100011459] - | > chr10:100011459-100011323:- > [5] chr10 [100012110, 100012225] - | > chr10:100012225-100012110:- > [6] chr10 [100013310, 100013553] - | > chr10:100013553-100013310:- > [7] chr10 [100015334, 100015496] - | > chr10:100015496-100015334:- > [8] chr10 [100016537, 100016704] - | > chr10:100016704-100016537:- > [9] chr10 [100017407, 100017561] - | > chr10:100017561-100017407:- > ... ... ... ... ... > ... > [239878] chrY [9611654, 9611928] - | > chrY:9611928-9611654:- > [239879] chrY [9638762, 9638916] + | > chrY:9638762-9638916:+ > [239880] chrY [9642383, 9642494] + | > chrY:9642383-9642494:+ > [239881] chrY [9646920, 9646994] + | > chrY:9646920-9646994:+ > [239882] chrY [9647680, 9647718] + | > chrY:9647680-9647718:+ > [239883] chrY [9650809, 9650854] + | > chrY:9650809-9650854:+ > [239884] chrY [9748407, 9748463] + | > chrY:9748407-9748463:+ > [239885] chrY [9748577, 9748722] + | > chrY:9748577-9748722:+ > [239886] chrY [9749263, 9749571] + | > chrY:9749263-9749571:+ > --- > seqlengths: > chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 > chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX > chrY > NA NA NA NA NA NA NA NA NA NA NA NA > NA NA NA NA NA NA NA NA NA NA NA NA > NA > > > > ex > GRanges with 286852 ranges and 3 elementMetadata cols: > seqnames ranges strand | exon_id > gene_id tx_id > <rle> <iranges> <rle> | <integer> > <compressedcharacterlist> <compressedintegerlist> > [1] chr1 [ 11874, 12227] + | 1 > NA 1,2,3 > [2] chr1 [ 12595, 12721] + | 5 > NA 3 > [3] chr1 [ 12613, 12721] + | 2 > NA 1 > [4] chr1 [ 12646, 12697] + | 4 > NA 2 > [5] chr1 [ 13221, 14409] + | 3 > NA 1,2 > [6] chr1 [ 13403, 14409] + | 6 > NA 3 > [7] chr1 [ 69091, 70008] + | 37 > 79501 25 > [8] chr1 [321084, 321115] + | 44 > NA 28 > [9] chr1 [321146, 321207] + | 45 > NA 29 > ... ... ... ... ... ... > ... ... > [286844] chrY [27603458, 27603499] - | 142741 > NA 40070 > [286845] chrY [27604352, 27604379] - | 142742 > NA 40071 > [286846] chrY [27605645, 27605678] - | 142743 > NA 40072 > [286847] chrY [27606394, 27606421] - | 142744 > NA 40073 > [286848] chrY [27607404, 27607432] - | 142745 > NA 40074 > [286849] chrY [27635919, 27635954] - | 142750 > NA 40078 > [286850] chrY [59358329, 59359508] - | 142802 > NA 40099 > [286851] chrY [59360007, 59360115] - | 142803 > NA 40099 > [286852] chrY [59360501, 59360854] - | 142804 > NA 40099 > --- > seqlengths: > chr1 chr2 chr3 ... > chrM chrUn_gl000226 chr18_gl000207_random > 249250621 243199373 198022430 ... > 16571 15008 4262 > > > > Date: Mon, 10 Sep 2012 17:29:24 -0400 > > From: jmacdon@uw.edu > > To: ying_chen@live.com > > CC: bioconductor@r-project.org > > Subject: Re: [BioC] How to retrieve exon ID and gene ID from exon > coordinates? > > > > Hi Ying, > > > > On 9/10/2012 4:54 PM, ying chen wrote: > > > > > > > > > Hi guys, I have a RNASeq data table which has exon cooridinates > (chrom, start. end) and raw count. I want to use DEXseq to see differential > transcripts. To do it I need to get geneIDs and exonIDs from corresponding > exon cooridinates. Any suggestion how to do it? Thanks a lot for the help! > > > > You don't give much to go on. Assuming you are working with a common > > species, it is simple. Let's assume you are working with mice. > > > > Something like this should work: > > > > yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE) > > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > > ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns = > > c("exon_id","gene_id")) > > yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start, > > end=yourdata$end)) > > elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),] > > > > If you are planning on doing this sort of stuff, do yourself a favor and > > read the GenomicFeatures and GenomicRanges vignettes. They are chock > > full of info that you will need. > > > > Best, > > > > Jim > > > > > > > > > Ying > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > University of Washington > > Environmental and Occupational Health Sciences > > 4225 Roosevelt Way NE, # 100 > > Seattle WA 98105-6099 > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- *A model is a lie that helps you see the truth.* * * Howard Skipper<http: cancerres.aacrjournals.org="" content="" 31="" 9="" 1173.full.pdf=""> [[alternative HTML version deleted]]

ADD REPLY • link 11.6 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Ying, On 9/11/2012 4:06 PM, ying chen wrote: > Hi, > > I read the vignettes and tried the code Jim suggested, but got an > error message I could not figure out why. > > Here is what I did: > > > library(GenomicFeatures) > > library(GenomicRanges) > > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > > TCGA_CRC <- > read.delim("TCGA_CRC_UNC_RNASeq_ExonID_detail_new.txt",stringsAs Factors=FALSE) > > ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = > c("exon_id","gene_id","tx_id")) > > TCGA_CRC <- GRanges(seqnames=TCGA_CRC$Chrom, > ranges=IRanges(start=TCGA_CRC$Start,end=TCGA_CRC$End),strand=TCG A_CRC$Strand,name=TCGA_CRC$ExonID) > > elementMetadata(TCGA_CRC) <- elementMetadata(ex)[match(TCGA_CRC, > ex),] > Error in elementMetadata(ex)[match(TCGA_CRC, ex), ] : > selecting rows: subscript contains NAs or out of bounds indices > You will get that error if there are ranges in TCGA_CRC that are not in ex. So the assumption here is that the exons from your RNA-seq experiment are exons that match up with what is in the knownGene table from UCSC. The fact that you have counts from exons that don't align to the expected ranges from the version of the knownGene table you are using is problematic, and implies that something is amiss. A first check is to ensure that your RNA-seq data (and counts) refer to the hg19 build. You can find out which ranges are different between the two using the %in% operator: x <- which(!TCGA_CRC %in% ex) Which will at least tell you how far off you are. Hypothetically, if there are only a few mismatching ranges, you could subset the TCGA_CRC GRanges object, but personally I would want to know why things aren't agreeing. Alternatively you could use the HTSeq python scripts to get the counts, in which case you won't need to bother with all of this, or you could use the summarizeOverlaps() function in GenomicRanges to do the counting, which will again ensure that the counts you have align to the exons from the knownGene table. Best, Jim > > > > I cannot tell what is subscript it mentioned. > > Any suggestion? > > Thanks a lot fo the help! > > Ying > > The following are the some info: > > > sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-pc-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C LC_TIME=English_United > States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 > GenomicFeatures_1.8.3 AnnotationDbi_1.18.3 > [4] Biobase_2.16.0 > GenomicRanges_1.8.13 IRanges_1.14.4 > [7] BiocGenerics_0.2.0 > > loaded via a namespace (and not attached): > [1] biomaRt_2.12.0 Biostrings_2.24.1 bitops_1.0-4.1 > BSgenome_1.24.0 DBI_0.2-5 RCurl_1.91-1.1 Rsamtools_1.8.6 > [8] RSQLite_0.11.1 rtracklayer_1.16.3 stats4_2.15.1 > tools_2.15.1 XML_3.9-4.1 zlibbioc_1.2.0 > > > > > TCGA_CRC > GRanges with 239886 ranges and 1 elementMetadata col: > seqnames ranges strand > | name > <rle> <iranges> <rle> | <character> > [1] chr10 [100003848, 100004653] + | > chr10:100003848-100004653:+ > [2] chr10 [100007443, 100008748] - | > chr10:100008748-100007443:- > [3] chr10 [100010822, 100010933] - | > chr10:100010933-100010822:- > [4] chr10 [100011323, 100011459] - | > chr10:100011459-100011323:- > [5] chr10 [100012110, 100012225] - | > chr10:100012225-100012110:- > [6] chr10 [100013310, 100013553] - | > chr10:100013553-100013310:- > [7] chr10 [100015334, 100015496] - | > chr10:100015496-100015334:- > [8] chr10 [100016537, 100016704] - | > chr10:100016704-100016537:- > [9] chr10 [100017407, 100017561] - | > chr10:100017561-100017407:- > ... ... ... ... > ... ... > [239878] chrY [9611654, 9611928] - | > chrY:9611928-9611654:- > [239879] chrY [9638762, 9638916] + | > chrY:9638762-9638916:+ > [239880] chrY [9642383, 9642494] + | > chrY:9642383-9642494:+ > [239881] chrY [9646920, 9646994] + | > chrY:9646920-9646994:+ > [239882] chrY [9647680, 9647718] + | > chrY:9647680-9647718:+ > [239883] chrY [9650809, 9650854] + | > chrY:9650809-9650854:+ > [239884] chrY [9748407, 9748463] + | > chrY:9748407-9748463:+ > [239885] chrY [9748577, 9748722] + | > chrY:9748577-9748722:+ > [239886] chrY [9749263, 9749571] + | > chrY:9749263-9749571:+ > --- > seqlengths: > chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 > chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 > chrM chrX chrY > NA NA NA NA NA NA NA NA NA NA > NA NA NA NA NA NA NA NA NA NA NA > NA NA NA NA > > > > ex > GRanges with 286852 ranges and 3 elementMetadata cols: > seqnames ranges strand | > exon_id gene_id tx_id > <rle> <iranges> <rle> | <integer> <compressedcharacterlist> > <compressedintegerlist> > [1] chr1 [ 11874, 12227] + | > 1 NA 1,2,3 > [2] chr1 [ 12595, 12721] + | > 5 NA 3 > [3] chr1 [ 12613, 12721] + | > 2 NA 1 > [4] chr1 [ 12646, 12697] + | > 4 NA 2 > [5] chr1 [ 13221, 14409] + | > 3 NA 1,2 > [6] chr1 [ 13403, 14409] + | > 6 NA 3 > [7] chr1 [ 69091, 70008] + | > 37 79501 25 > [8] chr1 [321084, 321115] + | > 44 NA 28 > [9] chr1 [321146, 321207] + | > 45 NA 29 > ... ... ... ... ... > ... ... ... > [286844] chrY [27603458, 27603499] - | > 142741 NA 40070 > [286845] chrY [27604352, 27604379] - | > 142742 NA 40071 > [286846] chrY [27605645, 27605678] - | > 142743 NA 40072 > [286847] chrY [27606394, 27606421] - | > 142744 NA 40073 > [286848] chrY [27607404, 27607432] - | > 142745 NA 40074 > [286849] chrY [27635919, 27635954] - | > 142750 NA 40078 > [286850] chrY [59358329, 59359508] - | > 142802 NA 40099 > [286851] chrY [59360007, 59360115] - | > 142803 NA 40099 > [286852] chrY [59360501, 59360854] - | > 142804 NA 40099 > --- > seqlengths: > chr1 chr2 chr3 > ... chrM chrUn_gl000226 chr18_gl000207_random > 249250621 243199373 198022430 > ... 16571 15008 4262 > > > > > Date: Mon, 10 Sep 2012 17:29:24 -0400 > > From: jmacdon at uw.edu > > To: ying_chen at live.com > > CC: bioconductor at r-project.org > > Subject: Re: [BioC] How to retrieve exon ID and gene ID from exon > coordinates? > > > > Hi Ying, > > > > On 9/10/2012 4:54 PM, ying chen wrote: > > > > > > > > > Hi guys, I have a RNASeq data table which has exon cooridinates > (chrom, start. end) and raw count. I want to use DEXseq to see > differential transcripts. To do it I need to get geneIDs and exonIDs > from corresponding exon cooridinates. Any suggestion how to do it? > Thanks a lot for the help! > > > > You don't give much to go on. Assuming you are working with a common > > species, it is simple. Let's assume you are working with mice. > > > > Something like this should work: > > > > yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE) > > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > > ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns = > > c("exon_id","gene_id")) > > yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start, > > end=yourdata$end)) > > elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),] > > > > If you are planning on doing this sort of stuff, do yourself a favor > and > > read the GenomicFeatures and GenomicRanges vignettes. They are chock > > full of info that you will need. > > > > Best, > > > > Jim > > > > > > > > > Ying > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor at r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > University of Washington > > Environmental and Occupational Health Sciences > > 4225 Roosevelt Way NE, # 100 > > Seattle WA 98105-6099 > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 11.6 years ago James W. MacDonald 65k

0

Entering edit mode

TCGA BAMs aren't usually available to the public because they contain personally identifiable information about actual humans (not cell lines). BED and WIG files are not produced, as they are for (say) ENCODE or the Roadmap project. Instead, a file format specific to TCGA is used, and it typically looks like this, e.g. for exons: (from https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers /anonymous/tumor/coad/cgcc/unc.edu/illuminaga_rnaseq/rnaseq/unc.edu_CO AD.IlluminaGA_RNASeq.Level_3.3.4.0/UNCID_278446.TCGA-AA-3860-01A-02R-0 905-07.100908_UNC6-RDR300211_00030_FC_62EL4AAXX.1.trimmed.annotated.tr anslated_to_genomic.exon.quantification.txt) exon raw_counts median_length_normalized RPKM chr1:11874-12227:+ 0 0 0 ... chrM:14762-15888:+ 9917181 8799.62821650399 4252.73411539598 chrM:15999-16571:+ 287037 500.937172774869 242.095751310737 Since all of the ranges in the first column will be the same for a given tumor type, those are mapped to the GAF, where their annotations are kept, e.g. (from https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anony mous/other/GAF/GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf) #EntryNumber FeatureID FeatureType FeatureDBSource FeatureDBVersion FeatureDBDate FeatureSeqFileName Composite CompositeType CompositeDBSource CompositeDBVersion CompositeDBDate AlignmentType FeatureCoordinates CompositeCoordinates Gene GeneLocus FeatureAliases FeatureInfo 1 CPA1|1357 gene calculated genomic GRCh37 genome NCBI GRCh37 pairwise 1-137, 138-219,220-453,454-555,556-657,658-1064,1065-1155,1156-1355,1356-1440 ,1441-1724 chr7:130020290-130020426,130020939-130021020,130021471-130 021704,130021949-130022050,130023232-130023333,130023525-130023931,130 024377-130024467,130024987-130025186,130025680-130025764,130027665-130 027948:+ CPA1|1357 chr7:130020290-130027948:+ 2 GUCY2D|3000 gene calculated genomic GRCh37 genome NCBI GRCh37 pairwise 1-65,6 6-795,796-1100,1101-1452,1453-1537,1538-1640,1641-1742,1743-1823,1824- 2030,2031-2187,2188-2337,2338-2486,2487-2650,2651-2843,2844-3018,3019- 3117,3118-3212,3213-3298,3299-3410,3411-3623 chr17:7905988-7906052,79 06357-7907086,7907170-7907474,7909681-7910032,7910378-7910462,7910744- 7910846,7911249-7911350,7912824-7912904,7915462-7915668,7915768-791592 4,7916421-7916570,7917198-7917346,7917919-7918082,7918177-7918369,7918 646-7918820,7919061-7919159,7919245-7919339,7919523-7919608,7919761-79 19872,7923446-7923658:+ It is a drag to deal with this so I am going to cc: Marc Carlson on this with a suggestion that we (possibly me, but better someone at UNC or UBC) upload FDb.* packages with the GAF ranges and their metadata contained, as annotations for the release cycle. I won't have time until this weekend to look. Best, --t On Tue, Sep 11, 2012 at 1:24 PM, James W. MacDonald <jmacdon at="" uw.edu=""> wrote: > > Hi Ying, > > > On 9/11/2012 4:06 PM, ying chen wrote: >> >> Hi, >> >> I read the vignettes and tried the code Jim suggested, but got an error message I could not figure out why. >> >> Here is what I did: >> >> > library(GenomicFeatures) >> > library(GenomicRanges) >> > library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> > TCGA_CRC <- >> read.delim("TCGA_CRC_UNC_RNASeq_ExonID_detail_new.txt",stringsA sFactors=FALSE) >> > ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = >> c("exon_id","gene_id","tx_id")) >> > TCGA_CRC <- GRanges(seqnames=TCGA_CRC$Chrom, >> ranges=IRanges(start=TCGA_CRC$Start,end=TCGA_CRC$End),strand=TC GA_CRC$Strand,name=TCGA_CRC$ExonID) >> > elementMetadata(TCGA_CRC) <- elementMetadata(ex)[match(TCGA_CRC, >> ex),] >> Error in elementMetadata(ex)[match(TCGA_CRC, ex), ] : >> selecting rows: subscript contains NAs or out of bounds indices >> > > You will get that error if there are ranges in TCGA_CRC that are not in ex. So the assumption here is that the exons from your RNA-seq experiment are exons that match up with what is in the knownGene table from UCSC. > > The fact that you have counts from exons that don't align to the expected ranges from the version of the knownGene table you are using is problematic, and implies that something is amiss. A first check is to ensure that your RNA-seq data (and counts) refer to the hg19 build. > > You can find out which ranges are different between the two using the %in% operator: > > x <- which(!TCGA_CRC %in% ex) > > Which will at least tell you how far off you are. Hypothetically, if there are only a few mismatching ranges, you could subset the TCGA_CRC GRanges object, but personally I would want to know why things aren't agreeing. > > Alternatively you could use the HTSeq python scripts to get the counts, in which case you won't need to bother with all of this, or you could use the summarizeOverlaps() function in GenomicRanges to do the counting, which will again ensure that the counts you have align to the exons from the knownGene table. > > Best, > > Jim > > > > > >> > >> >> I cannot tell what is subscript it mentioned. >> >> Any suggestion? >> >> Thanks a lot fo the help! >> >> Ying >> >> The following are the some info: >> >> > sessionInfo() >> R version 2.15.1 (2012-06-22) >> Platform: x86_64-pc-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 GenomicFeatures_1.8.3 AnnotationDbi_1.18.3 >> [4] Biobase_2.16.0 GenomicRanges_1.8.13 IRanges_1.14.4 >> [7] BiocGenerics_0.2.0 >> >> loaded via a namespace (and not attached): >> [1] biomaRt_2.12.0 Biostrings_2.24.1 bitops_1.0-4.1 BSgenome_1.24.0 DBI_0.2-5 RCurl_1.91-1.1 Rsamtools_1.8.6 >> [8] RSQLite_0.11.1 rtracklayer_1.16.3 stats4_2.15.1 tools_2.15.1 XML_3.9-4.1 zlibbioc_1.2.0 >> > >> >> > TCGA_CRC >> GRanges with 239886 ranges and 1 elementMetadata col: >> seqnames ranges strand | name >> <rle> <iranges> <rle> | <character> >> [1] chr10 [100003848, 100004653] + | chr10:100003848-100004653:+ >> [2] chr10 [100007443, 100008748] - | chr10:100008748-100007443:- >> [3] chr10 [100010822, 100010933] - | chr10:100010933-100010822:- >> [4] chr10 [100011323, 100011459] - | chr10:100011459-100011323:- >> [5] chr10 [100012110, 100012225] - | chr10:100012225-100012110:- >> [6] chr10 [100013310, 100013553] - | chr10:100013553-100013310:- >> [7] chr10 [100015334, 100015496] - | chr10:100015496-100015334:- >> [8] chr10 [100016537, 100016704] - | chr10:100016704-100016537:- >> [9] chr10 [100017407, 100017561] - | chr10:100017561-100017407:- >> ... ... ... ... ... ... >> [239878] chrY [9611654, 9611928] - | chrY:9611928-9611654:- >> [239879] chrY [9638762, 9638916] + | chrY:9638762-9638916:+ >> [239880] chrY [9642383, 9642494] + | chrY:9642383-9642494:+ >> [239881] chrY [9646920, 9646994] + | chrY:9646920-9646994:+ >> [239882] chrY [9647680, 9647718] + | chrY:9647680-9647718:+ >> [239883] chrY [9650809, 9650854] + | chrY:9650809-9650854:+ >> [239884] chrY [9748407, 9748463] + | chrY:9748407-9748463:+ >> [239885] chrY [9748577, 9748722] + | chrY:9748577-9748722:+ >> [239886] chrY [9749263, 9749571] + | chrY:9749263-9749571:+ >> --- >> seqlengths: >> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY >> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA >> > >> > ex >> GRanges with 286852 ranges and 3 elementMetadata cols: >> seqnames ranges strand | exon_id gene_id tx_id >> <rle> <iranges> <rle> | <integer> <compressedcharacterlist> <compressedintegerlist> >> [1] chr1 [ 11874, 12227] + | 1 NA 1,2,3 >> [2] chr1 [ 12595, 12721] + | 5 NA 3 >> [3] chr1 [ 12613, 12721] + | 2 NA 1 >> [4] chr1 [ 12646, 12697] + | 4 NA 2 >> [5] chr1 [ 13221, 14409] + | 3 NA 1,2 >> [6] chr1 [ 13403, 14409] + | 6 NA 3 >> [7] chr1 [ 69091, 70008] + | 37 79501 25 >> [8] chr1 [321084, 321115] + | 44 NA 28 >> [9] chr1 [321146, 321207] + | 45 NA 29 >> ... ... ... ... ... ... ... ... >> [286844] chrY [27603458, 27603499] - | 142741 NA 40070 >> [286845] chrY [27604352, 27604379] - | 142742 NA 40071 >> [286846] chrY [27605645, 27605678] - | 142743 NA 40072 >> [286847] chrY [27606394, 27606421] - | 142744 NA 40073 >> [286848] chrY [27607404, 27607432] - | 142745 NA 40074 >> [286849] chrY [27635919, 27635954] - | 142750 NA 40078 >> [286850] chrY [59358329, 59359508] - | 142802 NA 40099 >> [286851] chrY [59360007, 59360115] - | 142803 NA 40099 >> [286852] chrY [59360501, 59360854] - | 142804 NA 40099 >> --- >> seqlengths: >> chr1 chr2 chr3 ... chrM chrUn_gl000226 chr18_gl000207_random >> 249250621 243199373 198022430 ... 16571 15008 4262 >> > >> >> > Date: Mon, 10 Sep 2012 17:29:24 -0400 >> > From: jmacdon at uw.edu >> > To: ying_chen at live.com >> > CC: bioconductor at r-project.org >> > Subject: Re: [BioC] How to retrieve exon ID and gene ID from exon coordinates? >> > >> > Hi Ying, >> > >> > On 9/10/2012 4:54 PM, ying chen wrote: >> > > >> > > >> > > Hi guys, I have a RNASeq data table which has exon cooridinates (chrom, start. end) and raw count. I want to use DEXseq to see differential transcripts. To do it I need to get geneIDs and exonIDs from corresponding exon cooridinates. Any suggestion how to do it? Thanks a lot for the help! >> > >> > You don't give much to go on. Assuming you are working with a common >> > species, it is simple. Let's assume you are working with mice. >> > >> > Something like this should work: >> > >> > yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE) >> > library(TxDb.Mmusculus.UCSC.mm9.knownGene) >> > ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns = >> > c("exon_id","gene_id")) >> > yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start, >> > end=yourdata$end)) >> > elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),] >> > >> > If you are planning on doing this sort of stuff, do yourself a favor and >> > read the GenomicFeatures and GenomicRanges vignettes. They are chock >> > full of info that you will need. >> > >> > Best, >> > >> > Jim >> > >> > >> > >> > > Ying >> > > [[alternative HTML version deleted]] >> > > >> > > _______________________________________________ >> > > Bioconductor mailing list >> > > Bioconductor at r-project.org >> > > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > -- >> > James W. MacDonald, M.S. >> > Biostatistician >> > University of Washington >> > Environmental and Occupational Health Sciences >> > 4225 Roosevelt Way NE, # 100 >> > Seattle WA 98105-6099 >> > > > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- A model is a lie that helps you see the truth. Howard Skipper

ADD REPLY • link 11.6 years ago Tim Triche ★ 4.2k

0

Entering edit mode

Hi Tim, Any update on the availability of FDb.* package with the GAF ranges and their metadata? I am sorry for pushing your guys a little bit :). With more and more TCGA RNASeq data becoming available, this package will definitely benefit a lot bioconductor users. Thanks a lot, Ying > From: tim.triche@gmail.com > Date: Tue, 11 Sep 2012 13:34:15 -0700 > Subject: Re: [BioC] GenomicRanges : How to retrieve exon ID and gene ID from exon coordinates? > To: jmacdon@uw.edu; mcarlson@fhcrc.org > CC: ying_chen@live.com; bioconductor@r-project.org > > TCGA BAMs aren't usually available to the public because they contain > personally identifiable information about actual humans (not cell lines). > > BED and WIG files are not produced, as they are for (say) ENCODE or > the Roadmap project. > Instead, a file format specific to TCGA is used, and it typically > looks like this, e.g. for exons: > > (from https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpuse rs/anonymous/tumor/coad/cgcc/unc.edu/illuminaga_rnaseq/rnaseq/unc.edu_ COAD.IlluminaGA_RNASeq.Level_3.3.4.0/UNCID_278446.TCGA-AA-3860-01A-02R -0905-07.100908_UNC6-RDR300211_00030_FC_62EL4AAXX.1.trimmed.annotated. translated_to_genomic.exon.quantification.txt) > > exon raw_counts median_length_normalized RPKM > chr1:11874-12227:+ 0 0 0 > ... > chrM:14762-15888:+ 9917181 8799.62821650399 4252.73411539598 > chrM:15999-16571:+ 287037 500.937172774869 242.095751310737 > > Since all of the ranges in the first column will be the same for a > given tumor type, those are mapped to the GAF, where their annotations > are kept, e.g. (from > https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/ano nymous/other/GAF/GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf) > > #EntryNumber FeatureID FeatureType FeatureDBSource FeatureDBVersion FeatureDBDate FeatureSeqFileName Composite CompositeType CompositeDBSource CompositeDBVersion CompositeDBDate AlignmentType FeatureCoordinates CompositeCoordinates Gene GeneLocus FeatureAliases FeatureInfo > 1 CPA1|1357 gene calculated genomic GRCh37 genome NCBI GRCh37 pairwise 1-137, 138-219,220-453,454-555,556-657,658-1064,1065-1155,1156-1355,1356-1440 ,1441-1724 chr7:130020290-130020426,130020939-130021020,130021471-130 021704,130021949-130022050,130023232-130023333,130023525-130023931,130 024377-130024467,130024987-130025186,130025680-130025764,130027665-130 027948:+ CPA1|1357 chr7:130020290-130027948:+ > 2 GUCY2D|3000 gene calculated genomic GRCh37 genome NCBI GRCh37 pairwise 1-65,6 6-795,796-1100,1101-1452,1453-1537,1538-1640,1641-1742,1743-1823,1824- 2030,2031-2187,2188-2337,2338-2486,2487-2650,2651-2843,2844-3018,3019- 3117,3118-3212,3213-3298,3299-3410,3411-3623 chr17:7905988-7906052,79 06357-7907086,7907170-7907474,7909681-7910032,7910378-7910462,7910744- 7910846,7911249-7911350,7912824-7912904,7915462-7915668,7915768-791592 4,7916421-7916570,7917198-7917346,7917919-7918082,7918177-7918369,7918 646-7918820,7919061-7919159,7919245-7919339,7919523-7919608,7919761-79 19872,7923446-7923658:+ > > It is a drag to deal with this so I am going to cc: Marc Carlson on > this with a suggestion that we (possibly me, but better someone at UNC > or UBC) upload FDb.* packages with the GAF ranges and their metadata > contained, as annotations for the release cycle. I won't have time > until this weekend to look. > > Best, > > --t > > > On Tue, Sep 11, 2012 at 1:24 PM, James W. MacDonald <jmacdon@uw.edu> wrote: > > > > Hi Ying, > > > > > > On 9/11/2012 4:06 PM, ying chen wrote: > >> > >> Hi, > >> > >> I read the vignettes and tried the code Jim suggested, but got an error message I could not figure out why. > >> > >> Here is what I did: > >> > >> > library(GenomicFeatures) > >> > library(GenomicRanges) > >> > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > >> > TCGA_CRC <- > >> read.delim("TCGA_CRC_UNC_RNASeq_ExonID_detail_new.txt",string sAsFactors=FALSE) > >> > ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = > >> c("exon_id","gene_id","tx_id")) > >> > TCGA_CRC <- GRanges(seqnames=TCGA_CRC$Chrom, > >> ranges=IRanges(start=TCGA_CRC$Start,end=TCGA_CRC$End),strand= TCGA_CRC$Strand,name=TCGA_CRC$ExonID) > >> > elementMetadata(TCGA_CRC) <- elementMetadata(ex)[match(TCGA_CRC, > >> ex),] > >> Error in elementMetadata(ex)[match(TCGA_CRC, ex), ] : > >> selecting rows: subscript contains NAs or out of bounds indices > >> > > > > You will get that error if there are ranges in TCGA_CRC that are not in ex. So the assumption here is that the exons from your RNA-seq experiment are exons that match up with what is in the knownGene table from UCSC. > > > > The fact that you have counts from exons that don't align to the expected ranges from the version of the knownGene table you are using is problematic, and implies that something is amiss. A first check is to ensure that your RNA-seq data (and counts) refer to the hg19 build. > > > > You can find out which ranges are different between the two using the %in% operator: > > > > x <- which(!TCGA_CRC %in% ex) > > > > Which will at least tell you how far off you are. Hypothetically, if there are only a few mismatching ranges, you could subset the TCGA_CRC GRanges object, but personally I would want to know why things aren't agreeing. > > > > Alternatively you could use the HTSeq python scripts to get the counts, in which case you won't need to bother with all of this, or you could use the summarizeOverlaps() function in GenomicRanges to do the counting, which will again ensure that the counts you have align to the exons from the knownGene table. > > > > Best, > > > > Jim > > > > > > > > > > > >> > > >> > >> I cannot tell what is subscript it mentioned. > >> > >> Any suggestion? > >> > >> Thanks a lot fo the help! > >> > >> Ying > >> > >> The following are the some info: > >> > >> > sessionInfo() > >> R version 2.15.1 (2012-06-22) > >> Platform: x86_64-pc-mingw32/x64 (64-bit) > >> > >> locale: > >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 > >> [4] LC_NUMERIC=C LC_TIME=English_United States.1252 > >> > >> attached base packages: > >> [1] stats graphics grDevices utils datasets methods base > >> > >> other attached packages: > >> [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 GenomicFeatures_1.8.3 AnnotationDbi_1.18.3 > >> [4] Biobase_2.16.0 GenomicRanges_1.8.13 IRanges_1.14.4 > >> [7] BiocGenerics_0.2.0 > >> > >> loaded via a namespace (and not attached): > >> [1] biomaRt_2.12.0 Biostrings_2.24.1 bitops_1.0-4.1 BSgenome_1.24.0 DBI_0.2-5 RCurl_1.91-1.1 Rsamtools_1.8.6 > >> [8] RSQLite_0.11.1 rtracklayer_1.16.3 stats4_2.15.1 tools_2.15.1 XML_3.9-4.1 zlibbioc_1.2.0 > >> > > >> > >> > TCGA_CRC > >> GRanges with 239886 ranges and 1 elementMetadata col: > >> seqnames ranges strand | name > >> <rle> <iranges> <rle> | <character> > >> [1] chr10 [100003848, 100004653] + | chr10:100003848-100004653:+ > >> [2] chr10 [100007443, 100008748] - | chr10:100008748-100007443:- > >> [3] chr10 [100010822, 100010933] - | chr10:100010933-100010822:- > >> [4] chr10 [100011323, 100011459] - | chr10:100011459-100011323:- > >> [5] chr10 [100012110, 100012225] - | chr10:100012225-100012110:- > >> [6] chr10 [100013310, 100013553] - | chr10:100013553-100013310:- > >> [7] chr10 [100015334, 100015496] - | chr10:100015496-100015334:- > >> [8] chr10 [100016537, 100016704] - | chr10:100016704-100016537:- > >> [9] chr10 [100017407, 100017561] - | chr10:100017561-100017407:- > >> ... ... ... ... ... ... > >> [239878] chrY [9611654, 9611928] - | chrY:9611928-9611654:- > >> [239879] chrY [9638762, 9638916] + | chrY:9638762-9638916:+ > >> [239880] chrY [9642383, 9642494] + | chrY:9642383-9642494:+ > >> [239881] chrY [9646920, 9646994] + | chrY:9646920-9646994:+ > >> [239882] chrY [9647680, 9647718] + | chrY:9647680-9647718:+ > >> [239883] chrY [9650809, 9650854] + | chrY:9650809-9650854:+ > >> [239884] chrY [9748407, 9748463] + | chrY:9748407-9748463:+ > >> [239885] chrY [9748577, 9748722] + | chrY:9748577-9748722:+ > >> [239886] chrY [9749263, 9749571] + | chrY:9749263-9749571:+ > >> --- > >> seqlengths: > >> chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY > >> NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > >> > > >> > ex > >> GRanges with 286852 ranges and 3 elementMetadata cols: > >> seqnames ranges strand | exon_id gene_id tx_id > >> <rle> <iranges> <rle> | <integer> <compressedcharacterlist> <compressedintegerlist> > >> [1] chr1 [ 11874, 12227] + | 1 NA 1,2,3 > >> [2] chr1 [ 12595, 12721] + | 5 NA 3 > >> [3] chr1 [ 12613, 12721] + | 2 NA 1 > >> [4] chr1 [ 12646, 12697] + | 4 NA 2 > >> [5] chr1 [ 13221, 14409] + | 3 NA 1,2 > >> [6] chr1 [ 13403, 14409] + | 6 NA 3 > >> [7] chr1 [ 69091, 70008] + | 37 79501 25 > >> [8] chr1 [321084, 321115] + | 44 NA 28 > >> [9] chr1 [321146, 321207] + | 45 NA 29 > >> ... ... ... ... ... ... ... ... > >> [286844] chrY [27603458, 27603499] - | 142741 NA 40070 > >> [286845] chrY [27604352, 27604379] - | 142742 NA 40071 > >> [286846] chrY [27605645, 27605678] - | 142743 NA 40072 > >> [286847] chrY [27606394, 27606421] - | 142744 NA 40073 > >> [286848] chrY [27607404, 27607432] - | 142745 NA 40074 > >> [286849] chrY [27635919, 27635954] - | 142750 NA 40078 > >> [286850] chrY [59358329, 59359508] - | 142802 NA 40099 > >> [286851] chrY [59360007, 59360115] - | 142803 NA 40099 > >> [286852] chrY [59360501, 59360854] - | 142804 NA 40099 > >> --- > >> seqlengths: > >> chr1 chr2 chr3 ... chrM chrUn_gl000226 chr18_gl000207_random > >> 249250621 243199373 198022430 ... 16571 15008 4262 > >> > > >> > >> > Date: Mon, 10 Sep 2012 17:29:24 -0400 > >> > From: jmacdon@uw.edu > >> > To: ying_chen@live.com > >> > CC: bioconductor@r-project.org > >> > Subject: Re: [BioC] How to retrieve exon ID and gene ID from exon coordinates? > >> > > >> > Hi Ying, > >> > > >> > On 9/10/2012 4:54 PM, ying chen wrote: > >> > > > >> > > > >> > > Hi guys, I have a RNASeq data table which has exon cooridinates (chrom, start. end) and raw count. I want to use DEXseq to see differential transcripts. To do it I need to get geneIDs and exonIDs from corresponding exon cooridinates. Any suggestion how to do it? Thanks a lot for the help! > >> > > >> > You don't give much to go on. Assuming you are working with a common > >> > species, it is simple. Let's assume you are working with mice. > >> > > >> > Something like this should work: > >> > > >> > yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE) > >> > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > >> > ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns = > >> > c("exon_id","gene_id")) > >> > yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start, > >> > end=yourdata$end)) > >> > elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),] > >> > > >> > If you are planning on doing this sort of stuff, do yourself a favor and > >> > read the GenomicFeatures and GenomicRanges vignettes. They are chock > >> > full of info that you will need. > >> > > >> > Best, > >> > > >> > Jim > >> > > >> > > >> > > >> > > Ying > >> > > [[alternative HTML version deleted]] > >> > > > >> > > _______________________________________________ > >> > > Bioconductor mailing list > >> > > Bioconductor@r-project.org > >> > > https://stat.ethz.ch/mailman/listinfo/bioconductor > >> > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > >> > -- > >> > James W. MacDonald, M.S. > >> > Biostatistician > >> > University of Washington > >> > Environmental and Occupational Health Sciences > >> > 4225 Roosevelt Way NE, # 100 > >> > Seattle WA 98105-6099 > >> > > > > > > > -- > > James W. MacDonald, M.S. > > Biostatistician > > University of Washington > > Environmental and Occupational Health Sciences > > 4225 Roosevelt Way NE, # 100 > > Seattle WA 98105-6099 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > -- > A model is a lie that helps you see the truth. > > Howard Skipper [[alternative HTML version deleted]]

ADD REPLY • link 11.6 years ago ying chen ▴ 340

0

Entering edit mode

Hi guys, I am sorry for the weird format of my earlier email. Here I try to format it one more time and hope it comes out OK. I read the vignettes and tried the code Jim suggested, but got an error message I could not figure out why. Here is what I did: > library(GenomicFeatures) > library(GenomicRanges) > library(TxDb.Hsapiens.UCSC.hg19.knownGene) > TCGA_CRC <- read.delim("TCGA_CRC_UNC_RNASeq_ExonID_detail_new.txt",s tringsAsFactors=FALSE) > ex <- exons(TxDb.Hsapiens.UCSC.hg19.knownGene, columns = c("exon_id","gene_id","tx_id")) > > TCGA_CRC <- GRanges(seqnames=TCGA_CRC$Chrom, ranges=IRanges(start=TC GA_CRC$Start,end=TCGA_CRC$End),strand=TCGA_CRC$Strand,name=TCGA_CRC$Ex onID) > elementMetadata(TCGA_CRC) <- elementMetadata(ex)[match(TCGA_CRC, ex),] Error in elementMetadata(ex)[match(TCGA_CRC, ex), ] : selecting rows: subscript contains NAs or out of bounds indices I cannot tell what is subscript it mentioned. Any suggestion? Thanks a lot fo the help! Ying The following are some info: > sessionInfo() R version 2.15.1 (2012-06-22) Platform: x86_64-pc-mingw32/x64 (64-bit)locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1 GenomicFeatures_1.8.3 AnnotationDbi_1.18.3 [4] Biobase_2.16.0 GenomicRanges_1.8.13 IRanges_1.14.4 [7] BiocGenerics_0.2.0 loaded via a namespace (and not attached): [1] biomaRt_2.12.0 Biostrings_2.24.1 bitops_1.0-4.1 BSgenome_1.24.0 DBI_0.2-5 RCurl_1.91-1.1 Rsamtools_1.8.6 [8] RSQLite_0.11.1 rtracklayer_1.16.3 stats4_2.15.1 tools_2.15.1 XML_3.9-4.1 zlibbioc_1.2.0 > > TCGA_CRC GRanges with 239886 ranges and 1 elementMetadata col: seqnames ranges strand | name <rle> <iranges> <rle> | <character> [1] chr10 [100003848, 100004653] + | chr10:100003848-100004653:+ [2] chr10 [100007443, 100008748] - | chr10:100008748-100007443:- [3] chr10 [100010822, 100010933] - | chr10:100010933-100010822:- [4] chr10 [100011323, 100011459] - | chr10:100011459-100011323:- [5] chr10 [100012110, 100012225] - | chr10:100012225-100012110:- [6] chr10 [100013310, 100013553] - | chr10:100013553-100013310:- [7] chr10 [100015334, 100015496] - | chr10:100015496-100015334:- [8] chr10 [100016537, 100016704] - | chr10:100016704-100016537:- [9] chr10 [100017407, 100017561] - | chr10:100017561-100017407:- ... ... ... ... ... ... [239878] chrY [9611654, 9611928] - | chrY:9611928-9611654:- [239879] chrY [9638762, 9638916] + | chrY:9638762-9638916:+ [239880] chrY [9642383, 9642494] + | chrY:9642383-9642494:+ [239881] chrY [9646920, 9646994] + | chrY:9646920-9646994:+ [239882] chrY [9647680, 9647718] + | chrY:9647680-9647718:+ [239883] chrY [9650809, 9650854] + | chrY:9650809-9650854:+ [239884] chrY [9748407, 9748463] + | chrY:9748407-9748463:+ [239885] chrY [9748577, 9748722] + | chrY:9748577-9748722:+ [239886] chrY [9749263, 9749571] + | chrY:9749263-9749571:+ --- seqlengths: chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr2 chr20 chr21 chr22 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chrM chrX chrY NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > > ex GRanges with 286852 ranges and 3 elementMetadata cols: seqnames ranges strand | exon_id gene_id tx_id <rle> <iranges> <rle> | <integer> <compressedcharacterlist> <compressedintegerlist> [1] chr1 [ 11874, 12227] + | 1 NA 1,2,3 [2] chr1 [ 12595, 12721] + | 5 NA 3 [3] chr1 [ 12613, 12721] + | 2 NA 1 [4] chr1 [ 12646, 12697] + | 4 NA 2 [5] chr1 [ 13221, 14409] + | 3 NA 1,2 [6] chr1 [ 13403, 14409] + | 6 NA 3 [7] chr1 [ 69091, 70008] + | 37 79501 25 [8] chr1 [321084, 321115] + | 44 NA 28 [9] chr1 [321146, 321207] + | 45 NA 29 ... ... ... ... ... ... ... ... [286844] chrY [27603458, 27603499] - | 142741 NA 40070 [286845] chrY [27604352, 27604379] - | 142742 NA 40071 [286846] chrY [27605645, 27605678] - | 142743 NA 40072 [286847] chrY [27606394, 27606421] - | 142744 NA 40073 [286848] chrY [27607404, 27607432] - | 142745 NA 40074 [286849] chrY [27635919, 27635954] - | 142750 NA 40078 [286850] chrY [59358329, 59359508] - | 142802 NA 40099 [286851] chrY [59360007, 59360115] - | 142803 NA 40099 [286852] chrY [59360501, 59360854] - | 142804 NA 40099 --- seqlengths: chr1 chr2 chr3 ... chrM chrUn_gl000226 chr18_gl000207_random 249250621 243199373 198022430 ... 16571 15008 4262 > Date: Mon, 10 Sep 2012 17:29:24 -0400 > From: jmacdon@uw.edu > To: ying_chen@live.com > CC: bioconductor@r-project.org > Subject: Re: [BioC] How to retrieve exon ID and gene ID from exon coordinates? > > Hi Ying, > > On 9/10/2012 4:54 PM, ying chen wrote: > > > > > > Hi guys, I have a RNASeq data table which has exon cooridinates (chrom, start. end) and raw count. I want to use DEXseq to see differential transcripts. To do it I need to get geneIDs and exonIDs from corresponding exon cooridinates. Any suggestion how to do it? Thanks a lot for the help! > > You don't give much to go on. Assuming you are working with a common > species, it is simple. Let's assume you are working with mice. > > Something like this should work: > > yourdata <- read.table("yourdata.txt", stringsAsFactors=FALSE) > library(TxDb.Mmusculus.UCSC.mm9.knownGene) > ex <- exons(TxDb.Mmusculus.UCSC.mm9.knownGene, columns = > c("exon_id","gene_id")) > yourdata <- GRanges(yourdata$chrom, IRanges(start=yourdata$start, > end=yourdata$end)) > elementMetadata(yourdata) <- elementMetadata(ex)[match(yourdata, ex),] > > If you are planning on doing this sort of stuff, do yourself a favor and > read the GenomicFeatures and GenomicRanges vignettes. They are chock > full of info that you will need. > > Best, > > Jim > > > > > Ying > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > [[alternative HTML version deleted]]

ADD REPLY • link 11.6 years ago ying chen ▴ 340

Login before adding your answer.