read coverage per transcript for RNASeq
4
0
Entering edit mode
@jessica-hunter-3643
Last seen 9.6 years ago
Hello all. I have short read sequences for RNASeq that I have then aligned to the reference transcript sequences. The reads are from a Solexa platform and they were aligned using MAQ. I have imported the data using ShortReads (readAligned) and all looks good. However, I am interested in limiting the data to those transcripts with at least 85% coverage. I can’t find any documentation to do this that doesn’t involved looking at each transcript individually. Any way to do this for all transcripts (30,000+)? Also, while I am writing, I was interested in only looking at reads above a certain quality score. When I input the data with readAligned, I couldn’t figure out how to filter on this variable, so I ended up reading the data just into R, filtering that way, and then reading the data into ShortRead. Is there a better way to do this? All the transferring of the data back and forth took a lot of time to process. Lastly, any good online resources for RNASeq applications with Bioconductor that people can recommend? Thank you, Jessica [[alternative HTML version deleted]]
RNASeq Coverage PROcess ShortRead RNASeq Coverage PROcess ShortRead • 1.9k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States
Hi Jessica -- mixing up the order a bit... > Also, while I am writing, I was interested in only looking at reads > above a certain quality score. When I input the data with > readAligned, I couldn???t figure out how to filter on this variable, > so I ended up reading the data just into R, filtering that way, and > then reading the data into ShortRead. Is there a better way to do > this? All the transferring of the data back and forth took a lot of > time to process. readAligned takes a filter argument, see ?srFilter (maybe alignQualityFilter ? 'quality' is a bit ambiguous) and the final example on ?readAligned. Another route is along the lines of aln[ quality(alignQuality(aln)) > certainQualityScore ] > I have short read sequences for RNASeq that I have then aligned to > the reference transcript sequences. The reads are from a Solexa > platform and they were aligned using MAQ. I have imported the data > using ShortReads (readAligned) and all looks good. However, I am > interested in limiting the data to those transcripts with at least > 85% coverage. I can???t find any documentation to do this that > doesn???t involved looking at each transcript individually. Any way > to do this for all transcripts (30,000+)? Hmm. You say they're aligned to the reference transcript sequences, so it sounds like chromosome(aln) gives you the transcript each is aligned to and cvg <- coverage(aln) gives you a list of 'run-length-encoded' coverage vectors, one for each transcript. I'm not sure what '85% coverage' means, but maybe you're after something like sum(cvg) > .85 * sapply(cvg, length) A little more detail might lead to a clearer solution. The bioc-sig-sequencing news groups https://stat.ethz.ch/pipermail/bioc-sig-sequencing/ has several posts recently that might be helpful. Hope that helps, Martin "Jessica Hunter" <hunter at="" ohsu.edu=""> writes: > Hello all. > > > > I have short read sequences for RNASeq that I have then aligned to > the reference transcript sequences. The reads are from a Solexa > platform and they were aligned using MAQ. I have imported the data > using ShortReads (readAligned) and all looks good. However, I am > interested in limiting the data to those transcripts with at least > 85% coverage. I can???t find any documentation to do this that > doesn???t involved looking at each transcript individually. Any way > to do this for all transcripts (30,000+)? > > > > > > > Lastly, any good online resources for RNASeq applications with > Bioconductor that people can recommend? > > > > Thank you, > > > > Jessica > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
Thanks for your help. The command for filtering for quality worked perfectly. As for the coverage, by 85%, I mean I want to figure out which transcript sequences I used as a reference have at least 85% of their sequence aligning to one or more reads. And, yes, 'chromosome(aln)' gives me the list of transcripts that each read is aligned to. There are over 7 million reads and 'unique(chromosome(aln))' gives me the list of 8497 unique transcript names aligned the reads to. I tried, 'cvg <- coverage(aln)', but it didn't run, however 'cvg <- coverage(aln,width=NULL) did run. However, 'cvg' gives me: A GenomeData Instance Chromosomes(8497): chr1.....chr8497 I know there are 8497 unique reference transcripts I am aligning to, so this looks right, but there is actually no read data there, as far as number of reads aligned to the transcript or what the coverage is. If I look at one transcript using 'cvg[1]' I just get: A GenomeData Instance Chromosomes(1): chr1 Needless to say, 'sum(cvg) > .85 * sapply(cvg, length)' doesn't run and I get the error message: cannot coerce type 'S4' to vector of type 'integer'. As a shot in the dark, I tried 'cvgnum <- as.integer(cvg)', but this gives me the same error message. Any ideas? Jessica -----Original Message----- From: Martin Morgan [mailto:mtmorgan@fhcrc.org] Sent: Tuesday, September 01, 2009 4:36 PM To: Jessica Hunter Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] read coverage per transcript for RNASeq Hi Jessica -- mixing up the order a bit... > Also, while I am writing, I was interested in only looking at reads > above a certain quality score. When I input the data with > readAligned, I couldn???t figure out how to filter on this variable, > so I ended up reading the data just into R, filtering that way, and > then reading the data into ShortRead. Is there a better way to do > this? All the transferring of the data back and forth took a lot of > time to process. readAligned takes a filter argument, see ?srFilter (maybe alignQualityFilter ? 'quality' is a bit ambiguous) and the final example on ?readAligned. Another route is along the lines of aln[ quality(alignQuality(aln)) > certainQualityScore ] > I have short read sequences for RNASeq that I have then aligned to > the reference transcript sequences. The reads are from a Solexa > platform and they were aligned using MAQ. I have imported the data > using ShortReads (readAligned) and all looks good. However, I am > interested in limiting the data to those transcripts with at least > 85% coverage. I can???t find any documentation to do this that > doesn???t involved looking at each transcript individually. Any way > to do this for all transcripts (30,000+)? Hmm. You say they're aligned to the reference transcript sequences, so it sounds like chromosome(aln) gives you the transcript each is aligned to and cvg <- coverage(aln) gives you a list of 'run-length-encoded' coverage vectors, one for each transcript. I'm not sure what '85% coverage' means, but maybe you're after something like sum(cvg) > .85 * sapply(cvg, length) A little more detail might lead to a clearer solution. The bioc-sig-sequencing news groups https://stat.ethz.ch/pipermail/bioc-sig-sequencing/ has several posts recently that might be helpful. Hope that helps, Martin "Jessica Hunter" <hunter at="" ohsu.edu=""> writes: > Hello all. > > > > I have short read sequences for RNASeq that I have then aligned to > the reference transcript sequences. The reads are from a Solexa > platform and they were aligned using MAQ. I have imported the data > using ShortReads (readAligned) and all looks good. However, I am > interested in limiting the data to those transcripts with at least > 85% coverage. I can???t find any documentation to do this that > doesn???t involved looking at each transcript individually. Any way > to do this for all transcripts (30,000+)? > > > > > > > Lastly, any good online resources for RNASeq applications with > Bioconductor that people can recommend? > > > > Thank you, > > > > Jessica > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States
"Jessica Hunter" <hunter at="" ohsu.edu=""> writes: > Thanks for your help. The command for filtering for quality worked > perfectly. > > As for the coverage, by 85%, I mean I want to figure out which > transcript sequences I used as a reference have at least 85% of > their sequence aligning to one or more reads. And, yes, > 'chromosome(aln)' gives me the list of transcripts that each read is > aligned to. There are over 7 million reads and > 'unique(chromosome(aln))' gives me the list of 8497 unique > transcript names aligned the reads to. I tried, 'cvg <- > coverage(aln)', but it didn't run, however 'cvg <- it's not so helpful to say 'it didn't run'; better to cut-n-paste the command and error. Also it's useful to provide the output of sessionInfo() -- the next gen sequencing packages in particular are moving quickly, and there are fairly substantial differences between released and development versions, and between versions in the development branch. So I'm using (for this reply, anyway...) > sessionInfo() R version 2.9.2 Patched (2009-09-02 r49533) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3 Biostrings_2.12.8 [5] IRanges_1.2.3 loaded via a namespace (and not attached): [1] Biobase_2.4.1 grid_2.9.2 hwriter_1.1 > coverage(aln,width=NULL) did run. However, 'cvg' gives me: > > A GenomeData Instance > Chromosomes(8497): chr1.....chr8497 I did > example(readAligned) > aln = readAligned(ap, "s_2_export.txt", "SolexaExport") > aln1 = aln[!is.na(position(aln))] > cvg = coverage(aln1) > I know there are 8497 unique reference transcripts I am aligning to, > so this looks right, but there is actually no read data there, as > far as number of reads aligned to the transcript or what the > coverage is. If I look at one transcript using 'cvg[1]' I just get: > > A GenomeData Instance > Chromosomes(1): chr1 yes, '[' is subsetting; more insight comes from "[[", which extracts > cvg[[1]] # or cvg[["chr10.fa"]] 'integer' Rle instance of length 121262769 with 47 runs Lengths: 35 15600363 35 1003552 35 10075118 35 4665882 35 26227977 ... Values : 1 0 1 0 1 0 1 0 1 0 ... The idea is that the coverage depth (i.e., 'Values') is '1' for a length (i.e., 'Lengths') of 35 nt, then the coverage depth is '0' for 15600363 nt, then... Likely your 'Rle' will look much different from this -- greater coverage values and shorter run lengths. Something like sum(cvg[[1]]) is a short way of doing runValue(cvg[[1]]) * runLength(cvg[[1]]), which is to say the total number of nucleotides covering cvg[[1]]. > As for the coverage, by 85%, I mean I want to figure out which > transcript sequences I used as a reference have at least 85% of > their sequence aligning to one or more reads. And, yes, Here I think you want to back up and arrange to call coverage in such a way that the elements of cvg span the length of the transcript (as written above, coverage(aln) just uses the limits of the range in which reads are aligned). The 'start' might be easy, if the transcripts all start at nt '1'. The 'end' needs to be the length of the transcript along the lines of end = c(chr10.fa=12345, chr11.fa=54321, ...) and then cvg = coverage(aln, start=1, end=end) > Needless to say, 'sum(cvg) > .85 * sapply(cvg, length)' doesn't run > and I get the error message: cannot coerce type 'S4' to vector of > type 'integer'. As a shot in the dark, I tried 'cvgnum <- > as.integer(cvg)', but this gives me the same error message. Sorry I was assuming you were using a recent development version, where some of the kinks have been smoothed out. The task is to write a filter that satisfies your criterion, e.g., returning TRUE with... f = function(elt) { sum(runValue(elt) != 0) > 0.85 * length(elt) } You might then convert cvg to a list (of Rle's), use the Filter function to remove elements not satisfying the criterion, and then store it conveniently as a GenomeData object again cvgf = GenomeData(Filter(f, as.list(cvg))) cvgf now contains those transcripts satisfying f. Returning to aln1, you could select the reads aligning to these transcripts with aln2 = aln1[chromosome(aln1) %in% names(cvgf)] Martin > Any ideas? > > Jessica > > -----Original Message----- > From: Martin Morgan [mailto:mtmorgan at fhcrc.org] > Sent: Tuesday, September 01, 2009 4:36 PM > To: Jessica Hunter > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] read coverage per transcript for RNASeq > > Hi Jessica -- mixing up the order a bit... > >> Also, while I am writing, I was interested in only looking at reads >> above a certain quality score. When I input the data with >> readAligned, I couldn???t figure out how to filter on this variable, >> so I ended up reading the data just into R, filtering that way, and >> then reading the data into ShortRead. Is there a better way to do >> this? All the transferring of the data back and forth took a lot of >> time to process. > > readAligned takes a filter argument, see ?srFilter (maybe > alignQualityFilter ? 'quality' is a bit ambiguous) and the final > example on ?readAligned. Another route is along the lines of > > aln[ quality(alignQuality(aln)) > certainQualityScore ] > >> I have short read sequences for RNASeq that I have then aligned to >> the reference transcript sequences. The reads are from a Solexa >> platform and they were aligned using MAQ. I have imported the data >> using ShortReads (readAligned) and all looks good. However, I am >> interested in limiting the data to those transcripts with at least >> 85% coverage. I can???t find any documentation to do this that >> doesn???t involved looking at each transcript individually. Any way >> to do this for all transcripts (30,000+)? > > Hmm. You say they're aligned to the reference transcript sequences, so > it sounds like chromosome(aln) gives you the transcript each is > aligned to and > > cvg <- coverage(aln) > > gives you a list of 'run-length-encoded' coverage vectors, one for > each transcript. I'm not sure what '85% coverage' means, but maybe > you're after something like > > sum(cvg) > .85 * sapply(cvg, length) > > A little more detail might lead to a clearer solution. The > bioc-sig-sequencing news groups > > https://stat.ethz.ch/pipermail/bioc-sig-sequencing/ > > has several posts recently that might be helpful. > > Hope that helps, > > Martin > > > "Jessica Hunter" <hunter at="" ohsu.edu=""> writes: > >> Hello all. >> >> >> >> I have short read sequences for RNASeq that I have then aligned to >> the reference transcript sequences. The reads are from a Solexa >> platform and they were aligned using MAQ. I have imported the data >> using ShortReads (readAligned) and all looks good. However, I am >> interested in limiting the data to those transcripts with at least >> 85% coverage. I can???t find any documentation to do this that >> doesn???t involved looking at each transcript individually. Any way >> to do this for all transcripts (30,000+)? >> >> >> >> >> >> >> Lastly, any good online resources for RNASeq applications with >> Bioconductor that people can recommend? >> >> >> >> Thank you, >> >> >> >> Jessica >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
@jessica-hunter-3643
Last seen 9.6 years ago
My session info is: > sessionInfo() R version 2.9.1 (2009-06-26) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3 Biostrings_2.12.8 [5] IRanges_1.2.3 loaded via a namespace (and not attached): [1] Biobase_2.4.1 grid_2.9.1 hwriter_1.1 and I get: > cvg <- coverage(qualhighalign,width=NULL) > cvg[[1]] 'integer' Rle instance of length 4697 with 161 runs Lengths: 50 210 50 31 20 30 20 28 4 45 ... Values : 1 0 1 0 1 2 1 0 1 2 ... I was going to go back and use your suggestion of > cvg = coverage(aln, start=1, end=end) using a with a list of lengths for all transcripts for 'end'. However, before I create this list for all 9700+ transcripts, I was under the impression that the version of IRAnges I am using (IRanges newer than 1.1.58) had moved on to shift/width rather than start/end which was why I used 'width=NULL" above. Let me know if this is correct of if I should go forward with start/end. Thanks! Jessica Jessica Ezzell Hunter, MS PhD Postdoctoral Fellow Department of Behavioral Neuroscience Oregon Health & Science University 3181 SW Sam Jackson Park Road L470 Portland, OR 97239 (503) 220-8262 x51988 -----Original Message----- From: Martin Morgan [mailto:mtmorgan@fhcrc.org] Sent: Wed 9/2/2009 11:09 AM To: Jessica Hunter Cc: bioconductor@stat.math.ethz.ch Subject: Re: [BioC] read coverage per transcript for RNASeq "Jessica Hunter" <hunter@ohsu.edu> writes: > Thanks for your help. The command for filtering for quality worked > perfectly. > > As for the coverage, by 85%, I mean I want to figure out which > transcript sequences I used as a reference have at least 85% of > their sequence aligning to one or more reads. And, yes, > 'chromosome(aln)' gives me the list of transcripts that each read is > aligned to. There are over 7 million reads and > 'unique(chromosome(aln))' gives me the list of 8497 unique > transcript names aligned the reads to. I tried, 'cvg <- > coverage(aln)', but it didn't run, however 'cvg <- it's not so helpful to say 'it didn't run'; better to cut-n-paste the command and error. Also it's useful to provide the output of sessionInfo() -- the next gen sequencing packages in particular are moving quickly, and there are fairly substantial differences between released and development versions, and between versions in the development branch. So I'm using (for this reply, anyway...) > sessionInfo() R version 2.9.2 Patched (2009-09-02 r49533) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] ShortRead_1.2.1 lattice_0.17-25 BSgenome_1.12.3 Biostrings_2.12.8 [5] IRanges_1.2.3 loaded via a namespace (and not attached): [1] Biobase_2.4.1 grid_2.9.2 hwriter_1.1 > coverage(aln,width=NULL) did run. However, 'cvg' gives me: > > A GenomeData Instance > Chromosomes(8497): chr1.....chr8497 I did > example(readAligned) > aln = readAligned(ap, "s_2_export.txt", "SolexaExport") > aln1 = aln[!is.na(position(aln))] > cvg = coverage(aln1) > I know there are 8497 unique reference transcripts I am aligning to, > so this looks right, but there is actually no read data there, as > far as number of reads aligned to the transcript or what the > coverage is. If I look at one transcript using 'cvg[1]' I just get: > > A GenomeData Instance > Chromosomes(1): chr1 yes, '[' is subsetting; more insight comes from "[[", which extracts > cvg[[1]] # or cvg[["chr10.fa"]] 'integer' Rle instance of length 121262769 with 47 runs Lengths: 35 15600363 35 1003552 35 10075118 35 4665882 35 26227977 ... Values : 1 0 1 0 1 0 1 0 1 0 ... The idea is that the coverage depth (i.e., 'Values') is '1' for a length (i.e., 'Lengths') of 35 nt, then the coverage depth is '0' for 15600363 nt, then... Likely your 'Rle' will look much different from this -- greater coverage values and shorter run lengths. Something like sum(cvg[[1]]) is a short way of doing runValue(cvg[[1]]) * runLength(cvg[[1]]), which is to say the total number of nucleotides covering cvg[[1]]. > As for the coverage, by 85%, I mean I want to figure out which > transcript sequences I used as a reference have at least 85% of > their sequence aligning to one or more reads. And, yes, Here I think you want to back up and arrange to call coverage in such a way that the elements of cvg span the length of the transcript (as written above, coverage(aln) just uses the limits of the range in which reads are aligned). The 'start' might be easy, if the transcripts all start at nt '1'. The 'end' needs to be the length of the transcript along the lines of end = c(chr10.fa=12345, chr11.fa=54321, ...) and then cvg = coverage(aln, start=1, end=end) > Needless to say, 'sum(cvg) > .85 * sapply(cvg, length)' doesn't run > and I get the error message: cannot coerce type 'S4' to vector of > type 'integer'. As a shot in the dark, I tried 'cvgnum <- > as.integer(cvg)', but this gives me the same error message. Sorry I was assuming you were using a recent development version, where some of the kinks have been smoothed out. The task is to write a filter that satisfies your criterion, e.g., returning TRUE with... f = function(elt) { sum(runValue(elt) != 0) > 0.85 * length(elt) } You might then convert cvg to a list (of Rle's), use the Filter function to remove elements not satisfying the criterion, and then store it conveniently as a GenomeData object again cvgf = GenomeData(Filter(f, as.list(cvg))) cvgf now contains those transcripts satisfying f. Returning to aln1, you could select the reads aligning to these transcripts with aln2 = aln1[chromosome(aln1) %in% names(cvgf)] Martin > Any ideas? > > Jessica > > -----Original Message----- > From: Martin Morgan [mailto:mtmorgan@fhcrc.org] > Sent: Tuesday, September 01, 2009 4:36 PM > To: Jessica Hunter > Cc: bioconductor@stat.math.ethz.ch > Subject: Re: [BioC] read coverage per transcript for RNASeq > > Hi Jessica -- mixing up the order a bit... > >> Also, while I am writing, I was interested in only looking at reads >> above a certain quality score. When I input the data with >> readAligned, I couldnâ?Tt figure out how to filter on this variable, >> so I ended up reading the data just into R, filtering that way, and >> then reading the data into ShortRead. Is there a better way to do >> this? All the transferring of the data back and forth took a lot of >> time to process. > > readAligned takes a filter argument, see ?srFilter (maybe > alignQualityFilter ? 'quality' is a bit ambiguous) and the final > example on ?readAligned. Another route is along the lines of > > aln[ quality(alignQuality(aln)) > certainQualityScore ] > >> I have short read sequences for RNASeq that I have then aligned to >> the reference transcript sequences. The reads are from a Solexa >> platform and they were aligned using MAQ. I have imported the data >> using ShortReads (readAligned) and all looks good. However, I am >> interested in limiting the data to those transcripts with at least >> 85% coverage. I canâ?Tt find any documentation to do this that >> doesnâ?Tt involved looking at each transcript individually. Any way >> to do this for all transcripts (30,000+)? > > Hmm. You say they're aligned to the reference transcript sequences, so > it sounds like chromosome(aln) gives you the transcript each is > aligned to and > > cvg <- coverage(aln) > > gives you a list of 'run-length-encoded' coverage vectors, one for > each transcript. I'm not sure what '85% coverage' means, but maybe > you're after something like > > sum(cvg) > .85 * sapply(cvg, length) > > A little more detail might lead to a clearer solution. The > bioc-sig-sequencing news groups > > https://stat.ethz.ch/pipermail/bioc-sig-sequencing/ > > has several posts recently that might be helpful. > > Hope that helps, > > Martin > > > "Jessica Hunter" <hunter@ohsu.edu> writes: > >> Hello all. >> >> >> >> I have short read sequences for RNASeq that I have then aligned to >> the reference transcript sequences. The reads are from a Solexa >> platform and they were aligned using MAQ. I have imported the data >> using ShortReads (readAligned) and all looks good. However, I am >> interested in limiting the data to those transcripts with at least >> 85% coverage. I canâ?Tt find any documentation to do this that >> doesnâ?Tt involved looking at each transcript individually. Any way >> to do this for all transcripts (30,000+)? >> >> >> >> >> >> >> Lastly, any good online resources for RNASeq applications with >> Bioconductor that people can recommend? >> >> >> >> Thank you, >> >> >> >> Jessica >> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States
Hi Jessica -- "Jessica Hunter" <hunter at="" ohsu.edu=""> writes: > My session info is: >> sessionInfo() > R version 2.9.1 (2009-06-26) > x86_64-unknown-linux-gnu > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDEN TIFICATION=C > attached base packages: > [1] stats???? graphics? grDevices utils???? datasets? methods?? base???? > other attached packages: > [1] ShortRead_1.2.1?? lattice_0.17-25?? BSgenome_1.12.3?? Biostrings_2.12.8 > [5] IRanges_1.2.3??? > loaded via a namespace (and not attached): > [1] Biobase_2.4.1 grid_2.9.1??? hwriter_1.1? > and I get: >> cvg <- coverage(qualhighalign,width=NULL) >> cvg[[1]] > ? 'integer' Rle instance of length 4697 with 161 runs > ? Lengths:? 50 210 50 31 20 30 20 28 4 45 ... > ? Values :? 1 0 1 0 1 2 1 0 1 2 ... > I was going to go back and use your suggestion of >> cvg = coverage(aln, start=1, end=end) > using a with a list of lengths for all transcripts for 'end'.? > However, before I create this list for all 9700+ transcripts, I was > under the impression that the version of IRAnges I am Hopefully this won't be too hard, reading the existing info from a text file with read.table (and friends), from bed/wig/gff etc (with rtracklayer::import) or even from fasta files (Biostrings::readFASTA) containing the transcripts followed by something like sapply(ff, function(x) nchar(x[["seq"]])) where ff is the return value of readFASTA. > using (IRanges newer than 1.1.58) had moved on to shift/width rather > than start/end which was why I used 'width=NULL" above.? Let me know > if this is correct of if I should go forward with start/end. I think in your case you want to specify width (which turns out to have values that are the same as end, when everything starts at 1), but to tell you the truth I'm not exactly sure and the best way to find out is to try it (it'll either work or fail in a not-subtle way). Hope that helps, Martin > Thanks! > Jessica > Jessica Ezzell Hunter, MS PhD > Postdoctoral Fellow > Department of Behavioral Neuroscience > Oregon Health & Science University > 3181 SW Sam Jackson Park Road > L470 > Portland, OR 97239 > (503) 220-8262 x51988 > -----Original Message----- > From: Martin Morgan [[[mailto:mtmorgan at fhcrc.org]]] > Sent: Wed 9/2/2009 11:09 AM > To: Jessica Hunter > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] read coverage per transcript for RNASeq > "Jessica Hunter" <hunter at="" ohsu.edu=""> writes: >> Thanks for your help.? The command for filtering for quality worked >> perfectly. >> >> As for the coverage, by 85%, I mean I want to figure out which >> transcript sequences I used as a reference have at least 85% of >> their sequence aligning to one or more reads.? And, yes, >> 'chromosome(aln)' gives me the list of transcripts that each read is >> aligned to.? There are over 7 million reads and >> 'unique(chromosome(aln))' gives me the list of 8497 unique >> transcript names aligned the reads to.? I tried, 'cvg <- >> coverage(aln)', but it didn't run, however 'cvg <- > it's not so helpful to say 'it didn't run'; better to cut-n-paste the > command and error. Also it's useful to provide the output of > sessionInfo() -- the next gen sequencing packages in particular are > moving quickly, and there are fairly substantial differences between > released and development versions, and between versions in the > development branch. So I'm using (for this reply, anyway...) >> sessionInfo() > R version 2.9.2 Patched (2009-09-02 r49533) > x86_64-unknown-linux-gnu > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_ US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC _NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDEN TIFICATION=C > attached base packages: > [1] stats???? graphics? grDevices utils???? datasets? methods?? base > other attached packages: > [1] ShortRead_1.2.1?? lattice_0.17-25?? BSgenome_1.12.3 > Biostrings_2.12.8 > [5] IRanges_1.2.3 > loaded via a namespace (and not attached): > [1] Biobase_2.4.1 grid_2.9.2??? hwriter_1.1 >> coverage(aln,width=NULL) did run.? However, 'cvg' gives me: >> >> A GenomeData Instance >> Chromosomes(8497):? chr1.....chr8497 > I did >> example(readAligned) >> aln = readAligned(ap, "s_2_export.txt", "SolexaExport") >> aln1 = aln[!is.na(position(aln))] >> cvg = coverage(aln1) >> I know there are 8497 unique reference transcripts I am aligning to, >> so this looks right, but there is actually no read data there, as >> far as number of reads aligned to the transcript or what the >> coverage is.? If I look at one transcript using 'cvg[1]' I just get: >> >> A GenomeData Instance >> Chromosomes(1): chr1 > yes, '[' is subsetting; more insight comes from "[[", which extracts >> cvg[[1]] # or cvg[["chr10.fa"]] > ?'integer' Rle instance of length 121262769 with 47 runs > ?Lengths:? 35 15600363 35 1003552 35 10075118 35 4665882 35 26227977 ... > ?Values :? 1 0 1 0 1 0 1 0 1 0 ... > ??? > The idea is that the coverage depth (i.e., 'Values') is '1' for a > length (i.e., 'Lengths') of 35 nt, then the coverage depth is '0' for > 15600363 nt, then... Likely your 'Rle' will look much different from > this -- greater coverage values and shorter run lengths. Something > like > ? sum(cvg[[1]]) > is a short way of doing runValue(cvg[[1]]) * runLength(cvg[[1]]), > which is to say the total number of nucleotides covering cvg[[1]]. >> As for the coverage, by 85%, I mean I want to figure out which >> transcript sequences I used as a reference have at least 85% of >> their sequence aligning to one or more reads.? And, yes, > Here I think you want to back up and arrange to call coverage in such > a way that the elements of cvg span the length of the transcript (as > written above, coverage(aln) just uses the limits of the range in > which reads are aligned). The 'start' might be easy, if the > transcripts all start at nt '1'. The 'end' needs to be the length of > the transcript along the lines of end = c(chr10.fa=12345, chr11.fa=54321, > ...) and then > ? cvg = coverage(aln, start=1, end=end) >> Needless to say, 'sum(cvg) > .85 * sapply(cvg, length)' doesn't run >> and I get the error message: cannot coerce type 'S4' to vector of >> type 'integer'.? As a shot in the dark, I tried 'cvgnum <- >> as.integer(cvg)', but this gives me the same error message. > Sorry I was assuming you were using a recent development version, > where some of the kinks have been smoothed out. > The task is to write a filter that satisfies your criterion, e.g., > returning TRUE with... > ?? f = function(elt) { > ?????? sum(runValue(elt) != 0) > 0.85 * length(elt) > ?? } > You might then convert cvg to a list (of Rle's), use the Filter > function to remove elements not satisfying the criterion, and then > store it conveniently as a GenomeData object again > ? cvgf = GenomeData(Filter(f, as.list(cvg))) > cvgf now contains those transcripts satisfying f. Returning to aln1, > you could select the reads aligning to these transcripts with > ? aln2 = aln1[chromosome(aln1) %in% names(cvgf)] > Martin >> Any ideas? >> >> Jessica? >> >> -----Original Message----- >> From: Martin Morgan [[[mailto:mtmorgan at fhcrc.org]]] >> Sent: Tuesday, September 01, 2009 4:36 PM >> To: Jessica Hunter >> Cc: bioconductor at stat.math.ethz.ch >> Subject: Re: [BioC] read coverage per transcript for RNASeq >> >> Hi Jessica -- mixing up the order a bit... >> >>> Also, while I am writing, I was interested in only looking at reads >>> above a certain quality score.? When I input the data with >>> readAligned, I couldn??Tt figure out how to filter on this variable, >>> so I ended up reading the data just into R, filtering that way, and >>> then reading the data into ShortRead.? Is there a better way to do >>> this?? All the transferring of the data back and forth took a lot of >>> time to process. >> >> readAligned takes a filter argument, see ?srFilter (maybe >> alignQualityFilter ? 'quality' is a bit ambiguous) and the final >> example on ?readAligned. Another route is along the lines of >> >>?? aln[ quality(alignQuality(aln)) > certainQualityScore ] >> >>> I have short read sequences for RNASeq that I have then aligned to >>> the reference transcript sequences.? The reads are from a Solexa >>> platform and they were aligned using MAQ.? I have imported the data >>> using ShortReads (readAligned) and all looks good.? However, I am >>> interested in limiting the data to those transcripts with at least >>> 85% coverage.? I can??Tt find any documentation to do this that >>> doesn??Tt involved looking at each transcript individually.? Any way >>> to do this for all transcripts (30,000+)? >> >> Hmm. You say they're aligned to the reference transcript sequences, so >> it sounds like chromosome(aln) gives you the transcript each is >> aligned to and >> >>?? cvg <- coverage(aln) >> >> gives you a list of 'run-length-encoded' coverage vectors, one for >> each transcript. I'm not sure what '85% coverage' means, but maybe >> you're after something like? >> >>?? sum(cvg) > .85 * sapply(cvg, length) >> >> A little more detail might lead to a clearer solution. The >> bioc-sig-sequencing news groups >> >>?? [[https://stat.ethz.ch/pipermail/bioc-sig-sequencing/]] >> >> has several posts recently that might be helpful. >> >> Hope that helps, >> >> Martin >> >> >> "Jessica Hunter" <hunter at="" ohsu.edu=""> writes: >> >>> Hello all. >>> >>>? >>> >>> I have short read sequences for RNASeq that I have then aligned to >>> the reference transcript sequences.? The reads are from a Solexa >>> platform and they were aligned using MAQ.? I have imported the data >>> using ShortReads (readAligned) and all looks good.? However, I am >>> interested in limiting the data to those transcripts with at least >>> 85% coverage.? I can??Tt find any documentation to do this that >>> doesn??Tt involved looking at each transcript individually.? Any way >>> to do this for all transcripts (30,000+)? >>> >>>? >>> >>> >>>? >>> >>> Lastly, any good online resources for RNASeq applications with >>> Bioconductor that people can recommend? >>> >>>? >>> >>> Thank you, >>> >>>? >>> >>> Jessica >>> >>> >>> ???? [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> [[https://stat.ethz.ch/mailman/listinfo/bioconductor]] >>> Search the archives: [[http://news.gmane.org/gmane.science.biology.informatics.conductor]] -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 645 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6