Question

filter low expression tags

0

Entering edit mode

Vittoria Roncalli ▴ 40

@vittoria-roncalli-5633

Last seen 9.6 years ago

Hi, Thanks a lot for your suggestions, they were helpful. In order to better understand how the filter really works, I considered only a subset of my dataset, consisting in 30 tags. I have 9 column in total, consisting of 3 different samples(C-L-T) made by 3 replicates each(1-3). So, I used the filter command : > keep <- rowSums (cpm(d)>5) >=3 > d <- d[keep,] Then I looked at the output of `cpm(d) > 5, and at the list of the filtered libraries. Where I get confused is how the 5cpm are calculated. If I am selecting the counts that in the single row (gene) are at least grater then 5cpm in the 3 replicates of 1 sample, how is possible that the filter keep a gene that have 0 counts in all the 3 replicates of the sample? Are the 5cpm calculated for each gene as an average of all the 9 samples in my case? How the 5cpm value is calculated? Is it dependent on the library size of each sample? How can I get the minimum value considered for the filter? gene 1 C1 C2 C3 L1 L2 L3 H1 H2 H3 0 0 0 1 0 0 0 1 2 output cpm>5 F F F T F F F T T I apologize for the silly questions, but I am really new in this field, and I am trying to understand the tool before using it. Thanks in advance Vittoria On Wed, Nov 28, 2012 at 5:54 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Wed, Nov 28, 2012 at 10:14 PM, Vittoria Roncalli <roncalli@hawaii.edu> > wrote: > > Hi, > > > > I would like to understand how the filter of low expression tags works. > If > > I run the command > > > >>keep <- rowSums (cpm(d)>100) >=2 > > d <- d[keep,] > > dim(d) > > > > as in the use guide page 32, this means that I am using a cutoff of > 100cpm, > > but how are treated the 2 samples? Did are they averaged and then the low > > tags are removed? > > Is each sample considered separate and filtered by itself? > > Thanks foe the help in advance > > How many samples (columns) do you have? > > You should first look at the output of `cpm(d) > 100` to see what you > are getting -- this will be a logical (boolean) matrix that has the > same dimensionality as `cpm(d)`. > > rowSums( a logical matrix ) > > returns a vector that is as long as there are rows in the logical > matrix, and each value indicates how many columns are TRUE in that > row. > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Vittoria Roncalli Graduate Research Assistant Center Békésy Laboratory of Neurobiology Pacific Biosciences Research Center University of Hawaii at Manoa 1993 East-West Road Honolulu, HI 96822 USA Tel: 808-4695693 [[alternative HTML version deleted]]

Cancer Cancer • 1.0k views

ADD COMMENT • link updated 11.4 years ago by Steve Lianoglou ★ 13k • written 11.4 years ago by Vittoria Roncalli ▴ 40

score 0 · Answer 1 · 2012-12-04

Hi Vitorria, I'm finding it a bit hard to follow what you're doing, largely because much of the output you present is being summarized by you, but let me try: On Tue, Dec 4, 2012 at 3:59 PM, Vittoria Roncalli <roncalli at="" hawaii.edu=""> wrote: > Hi, > Thanks a lot for your suggestions, they were helpful. > > In order to better understand how the filter really works, I considered only > a subset of my dataset, consisting in 30 tags. Nice, this is a good idea to start with. > I have 9 column in total, consisting of 3 different samples(C-L-T) made by > 3 replicates each(1-3). > So, I used the filter command : > >> keep <- rowSums (cpm(d)>5) >=3 >> d <- d[keep,] > > Then I looked at the output of `cpm(d) > 5, and at the list of the filtered > libraries. > Where I get confused is how the 5cpm are calculated. If I am selecting the > counts that in the single row (gene) are at least grater then 5cpm in the 3 > replicates of 1 sample, how is possible that the filter keep a gene that > have 0 counts in all the 3 replicates of the sample? > Are the 5cpm calculated for each gene as an average of all the 9 samples in > my case? No, `cpm` give you the "counts per million" of each "gene" in each sample. > How the 5cpm value is calculated? Is it dependent on the library size of > each sample? You can see for yourself by: R> library(edgeR) R> cpm function (x, normalized.lib.sizes = FALSE) { if (is(x, "DGEList")) { if (normalized.lib.sizes) lib.size <- x$samples$lib.size * x$samples$norm.factors else lib.size <- x$samples$lib.size x <- x$counts } else { if (normalized.lib.sizes) warning("Matrix of counts supplied, so normalized library sizes are not known. Library sizes are column sums of the count matrix.\n") lib.size <- colSums(x) } cpm <- 1e+06 * t(t(x)/lib.size) cpm } It's (1e6 / library size) * raw.counts > How can I get the minimum value considered for the filter? This is what I mean about your output being aggressively summarized by you -- I'm not sure what these things are below: > gene 1 C1 C2 C3 L1 L2 L3 H1 H2 H3 > 0 0 0 1 0 0 0 1 What are you showing us here? Are these the raw counts for gene one in each sample? > output cpm>5 F F F T F F F T T And this is the related cpm for gene1? This suggests that you have very low read counts per sample, if 1 read is > 5 cpm. Am I getting this right? Ignoring your presumably low read count problem, the call to: keep <- rowSums (cpm(d)>5) >=3 is doing what it's supposed to. There are three samples for this gene that have a cpm > 5 (`T` values, as you write), so you are keeping the rows. This "filter" doesn't consider what sample/condition the cpm's pass your threshold of 5, it's just asking if there are any three samples that have a cpm >= 5, and (if I am interpreting your output correctly) this example does. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact