Question

EdgeR: general advice on using edgeR for sRNA analysis

0

Entering edit mode

Kenlee Nakasugi ▴ 30

@kenlee-nakasugi-6076

Last seen 9.6 years ago

Hi, I was hoping someone would be able to provide me with some general advice on using EdgeR for some sRNA datasets I have received. I have 3 sRNA datasets, and I have calculated all abundances (just read counts) of every sequence in each dataset. Unfortunately, there are no replicates. The goal is to find specific sRNA sequences that are higher in abundance in dataset1 and dataset2 compared to dataset3. As there are no replicates, I understand that no stats analyses with confidence can be done on them, and so just want to first get a 'general' indication of what sequences may be higher in abundance in datasets 1 and 2, and follow up with other experiments. I have already generated a subset of 'common' sRNA sequences that are present in dataset1, 2 and 3, along with their counts. Because the original library sizes are different between the three, and also there will be high level of duplicate sequences as these are sRNA sequences, 1. I am not sure if I should just use the edgeR setting to calculate the library sizes via the sum of the column of the read counts, or use the actual library size of each dataset, prior to normalization. Because I am working on just the 'common' subset of sRNA sequences between the datasets, there may be highly abundant sRNA sequences unique to each dataset that are missing, and which may have skewed the distribution of sRNA abundances within each dataset. 2. what dispersion value should I use - these are plant sRNA sequences, so from experience, can someone suggest a number and I will go from there Apart from this, are there any other issues I need to be concerned about when analyzing such data in edgeR? Any advice greatly appreciated! Best regards, Ken --- School of Molecular Biosciences University of Sydney

Normalization edgeR Normalization edgeR • 934 views

ADD COMMENT • link updated 10.7 years ago by Gordon Smyth 50k • written 10.7 years ago by Kenlee Nakasugi ▴ 30

score 0 · Answer 1 · 2013-08-06

Dear Ken, > Date: Mon, 5 Aug 2013 04:56:09 +0000 > From: Kenlee Nakasugi <kenlee.nakasugi at="" sydney.edu.au=""> > To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> > Subject: [BioC] EdgeR: general advice on using edgeR for sRNA analysis > > Hi, > > I was hoping someone would be able to provide me with some general > advice on using EdgeR for some sRNA datasets I have received. > > I have 3 sRNA datasets, and I have calculated all abundances (just read > counts) of every sequence in each dataset. Unfortunately, there are no > replicates. The goal is to find specific sRNA sequences that are higher > in abundance in dataset1 and dataset2 compared to dataset3. As there are > no replicates, I understand that no stats analyses with confidence can > be done on them, and so just want to first get a 'general' indication of > what sequences may be higher in abundance in datasets 1 and 2, and > follow up with other experiments. > > I have already generated a subset of 'common' sRNA sequences that are > present in dataset1, 2 and 3, along with their counts. Because the > original library sizes are different between the three, and also there > will be high level of duplicate sequences as these are sRNA sequences, > > 1. I am not sure if I should just use the edgeR setting to calculate the > library sizes via the sum of the column of the read counts, or use the > actual library size of each dataset, prior to normalization. Because I > am working on just the 'common' subset of sRNA sequences between the > datasets, there may be highly abundant sRNA sequences unique to each > dataset that are missing, and which may have skewed the distribution of > sRNA abundances within each dataset. You should recompute the lib.sizes from the column sums for the sequences that you are analysing, and then run calcNormFactors(). I am unclear why you are restricting to common sRNA sequences. Doesn't this exclude the most differentially expressed sequences, which might have zero counts in one or two libraries, which you might want to know about? > 2. what dispersion value should I use - these are plant sRNA sequences, > so from experience, can someone suggest a number and I will go from > there > > Apart from this, are there any other issues I need to be concerned about > when analyzing such data in edgeR? I haven't analysed plant sRNA, so cannot give any general advice for this type of data. You could try a few dispersion values and go from there. Alternatively, here is a conservative way to estimate the dispersion without replicates: dge2 <- dge dge2$samples$group <- rep(1,3) dge2 <- estimateDisp(dge2,robust=TRUE,winsor.tail=c(0.05,0.2)) plotBCV(dge2) This will estimate the dispersions allowing for about 20% of the sequences to be differentially expressed (treated as outliers). Then results <- exactTest(dge, dispersion=dge2$trended.dispersion) etc. Best wishes Gordon > Any advice greatly appreciated! > Best regards, > Ken > > --- > School of Molecular Biosciences > University of Sydney ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}