Question

R for normalizing gene length for next-gene sequencing data

0

Entering edit mode

Andrew Wang ▴ 20

@andrew-wang-4438

Last seen 10.1 years ago

Hello, everyone I am wondering how to use R packages to generate a count table with samples as columns and tags as rows. In addition, how to normalize the counts to the length of each gene. That is, all gene counts should be normalized from 0 to 1 in gene length and then draw a distribution of counts. Finally, how to access these R objects that store these data and to manipulate them using R commands/scripts. Thanks. Best wishes, Andrew [[alternative HTML version deleted]]

• 1.6k views

ADD COMMENT • link updated 13.5 years ago by Steve Lianoglou ★ 13k • written 13.5 years ago by Andrew Wang ▴ 20

score 0 · Answer 1 · 2011-04-22

Hi, On Fri, Apr 22, 2011 at 7:49 PM, Andrew Wang <andrew.wang.2010.2011 at="" gmail.com=""> wrote: > Hello, everyone > > I am wondering how to use R packages to generate a count table with > samples as columns and tags as rows. In addition, how to normalize > the counts to the length of each gene. That is, all gene counts > should be normalized from 0 to 1 in gene length and then draw a > distribution of counts. Finally, how to access these R objects that > store these data and to manipulate them using R commands/scripts. Thanks. You will want to get very comfortable with the following packages: * IRanges and GenomicRanges Use the data structures in these packages (IRanges or GRanges) to store and manipulate your reads. * GenomicFeatures Provides functionality to access gene/transcript info from different annotation sources (refseq, ucsc, etc) and exposes them as GRanges objects. This makes it easy to quantify which reads overlap which genes/exons/etc (assuming you are storing you reads in I/GRanges objects (use GRanges)) * Maybe Rsamtools to query your BAM files and load them into appropriate data structures Reads through the vignettes in these packages You will be able to do all the things you are asking for once you get comfortable with the three packages above. Also * The Biostrings and BSgenome.* packages will be your friends. Read through this stuff, too: http://www.bioconductor.org/help/workflows/high-throughput-sequencing/ Tutorial/course material here: http://www.bioconductor.org/help/course-materials/2010/EMBL2010/ -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology ?| Memorial Sloan-Kettering Cancer Center ?| Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact