filter low expression tags
1
0
Entering edit mode
@vittoria-roncalli-5633
Last seen 9.6 years ago
Hi, Thanks a lot for your suggestions, they were helpful. In order to better understand how the filter really works, I considered only a subset of my dataset, consisting in 30 tags. I have 9 column in total, consisting of 3 different samples(C-L-T) made by 3 replicates each(1-3). So, I used the filter command : > keep <- rowSums (cpm(d)>5) >=3 > d <- d[keep,] Then I looked at the output of `cpm(d) > 5, and at the list of the filtered libraries. Where I get confused is how the 5cpm are calculated. If I am selecting the counts that in the single row (gene) are at least grater then 5cpm in the 3 replicates of 1 sample, how is possible that the filter keep a gene that have 0 counts in all the 3 replicates of the sample? Are the 5cpm calculated for each gene as an average of all the 9 samples in my case? How the 5cpm value is calculated? Is it dependent on the library size of each sample? How can I get the minimum value considered for the filter? gene 1 C1 C2 C3 L1 L2 L3 H1 H2 H3 0 0 0 1 0 0 0 1 2 output cpm>5 F F F T F F F T T I apologize for the silly questions, but I am really new in this field, and I am trying to understand the tool before using it. Thanks in advance Vittoria On Wed, Nov 28, 2012 at 5:54 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Wed, Nov 28, 2012 at 10:14 PM, Vittoria Roncalli <roncalli@hawaii.edu> > wrote: > > Hi, > > > > I would like to understand how the filter of low expression tags works. > If > > I run the command > > > >>keep <- rowSums (cpm(d)>100) >=2 > > d <- d[keep,] > > dim(d) > > > > as in the use guide page 32, this means that I am using a cutoff of > 100cpm, > > but how are treated the 2 samples? Did are they averaged and then the low > > tags are removed? > > Is each sample considered separate and filtered by itself? > > Thanks foe the help in advance > > How many samples (columns) do you have? > > You should first look at the output of `cpm(d) > 100` to see what you > are getting -- this will be a logical (boolean) matrix that has the > same dimensionality as `cpm(d)`. > > rowSums( a logical matrix ) > > returns a vector that is as long as there are rows in the logical > matrix, and each value indicates how many columns are TRUE in that > row. > > HTH, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > -- Vittoria Roncalli Graduate Research Assistant Center Békésy Laboratory of Neurobiology Pacific Biosciences Research Center University of Hawaii at Manoa 1993 East-West Road Honolulu, HI 96822 USA Tel: 808-4695693 [[alternative HTML version deleted]]
Cancer Cancer • 1.0k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 13 months ago
United States
Hi Vitorria, I'm finding it a bit hard to follow what you're doing, largely because much of the output you present is being summarized by you, but let me try: On Tue, Dec 4, 2012 at 3:59 PM, Vittoria Roncalli <roncalli at="" hawaii.edu=""> wrote: > Hi, > Thanks a lot for your suggestions, they were helpful. > > In order to better understand how the filter really works, I considered only > a subset of my dataset, consisting in 30 tags. Nice, this is a good idea to start with. > I have 9 column in total, consisting of 3 different samples(C-L-T) made by > 3 replicates each(1-3). > So, I used the filter command : > >> keep <- rowSums (cpm(d)>5) >=3 >> d <- d[keep,] > > Then I looked at the output of `cpm(d) > 5, and at the list of the filtered > libraries. > Where I get confused is how the 5cpm are calculated. If I am selecting the > counts that in the single row (gene) are at least grater then 5cpm in the 3 > replicates of 1 sample, how is possible that the filter keep a gene that > have 0 counts in all the 3 replicates of the sample? > Are the 5cpm calculated for each gene as an average of all the 9 samples in > my case? No, `cpm` give you the "counts per million" of each "gene" in each sample. > How the 5cpm value is calculated? Is it dependent on the library size of > each sample? You can see for yourself by: R> library(edgeR) R> cpm function (x, normalized.lib.sizes = FALSE) { if (is(x, "DGEList")) { if (normalized.lib.sizes) lib.size <- x$samples$lib.size * x$samples$norm.factors else lib.size <- x$samples$lib.size x <- x$counts } else { if (normalized.lib.sizes) warning("Matrix of counts supplied, so normalized library sizes are not known. Library sizes are column sums of the count matrix.\n") lib.size <- colSums(x) } cpm <- 1e+06 * t(t(x)/lib.size) cpm } It's (1e6 / library size) * raw.counts > How can I get the minimum value considered for the filter? This is what I mean about your output being aggressively summarized by you -- I'm not sure what these things are below: > gene 1 C1 C2 C3 L1 L2 L3 H1 H2 H3 > 0 0 0 1 0 0 0 1 What are you showing us here? Are these the raw counts for gene one in each sample? > output cpm>5 F F F T F F F T T And this is the related cpm for gene1? This suggests that you have very low read counts per sample, if 1 read is > 5 cpm. Am I getting this right? Ignoring your presumably low read count problem, the call to: keep <- rowSums (cpm(d)>5) >=3 is doing what it's supposed to. There are three samples for this gene that have a cpm > 5 (`T` values, as you write), so you are keeping the rows. This "filter" doesn't consider what sample/condition the cpm's pass your threshold of 5, it's just asking if there are any three samples that have a cpm >= 5, and (if I am interpreting your output correctly) this example does. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT

Login before adding your answer.

Traffic: 727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6