edgeR question tags per million

0

Entering edit mode

David ▴ 860

@david-3335

Last seen 6.1 years ago

Hi, I'm going through a sequencing dataset (GSE21630). It's a 46 different libraries experiment with a large dispersion: > d$common.dispersion [1] 3.101696 > sqrt(d$common.dispersion) [1] 1.761163 I guess i need to cope with that as the sequencing libraries were prepared with an old protocol not as efficient as current 1.5 solexa protocol. With thar being said the librarires are given in tags per 1 million (TPM). On average the experiment had 4 million reads per library. I was wondering how edge analysis would be affected by using reads (sort of normalized to 1million size library (TPM)) rather then using the total number of reads. What do you think ?? Should i multiply 4 the current number of reads given in TPM or can i stay with my libraries at 1million reads all. thanks, david

Sequencing Sequencing • 2.3k views

ADD COMMENT • link updated 13.9 years ago by Mark Robinson ★ 1.1k • written 13.9 years ago by David ▴ 860

0

Entering edit mode

Mark Robinson ★ 1.1k

@mark-robinson-2171

Last seen 9.6 years ago

Hi David. We generally recommend raw counts as input to edgeR. Presumably you can back to this by reversing the TPM calculation -- multiply the TPM in a library by the total number of reads and divide by 1 million. I presume this info is available. That is some seriously high dispersion. You may also want to look into normalization -- see ?calcNormFactors or see the manual. Cheers, Mark > Hi, > I'm going through a sequencing dataset (GSE21630). > It's a 46 different libraries experiment with a large dispersion: > > > d$common.dispersion > [1] 3.101696 > > > sqrt(d$common.dispersion) > [1] 1.761163 > > I guess i need to cope with that as the sequencing libraries were > prepared with an old protocol not as efficient as current 1.5 solexa > protocol. > > With thar being said the librarires are given in tags per 1 million > (TPM). On average the experiment had 4 million reads per library. I was > wondering how edge analysis would be affected by using reads (sort of > normalized to 1million size library (TPM)) rather then using the total > number of reads. > > What do you think ?? Should i multiply 4 the current number of reads > given in TPM or can i stay with my libraries at 1million reads all. > > thanks, > david > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 13.9 years ago Mark Robinson ★ 1.1k

0

Entering edit mode

Thanks Mark, That would do the trick !!!! I have also written some functions to do multiple comparisons in a loop as this is clear missing. Similar to what is in limma. Some functions are writtent at the bottom of this page. You might want to include this code snippet to do pairwise comparisons of all your conditions. I'm not a R guru and probably people would do the same easily but it's quite useful to get some working code. library(edgeR) library(Biobase) #### mydataFile contains the reads for each condition ##### mydataFile=read.table(file="rnaseq_reads.txt",header=TRUE,row.names=1, sep="\t", check.names=FALSE) I start reading a file that contains the link between tissus/cell names and description ### pDataFile.txt ### Experiment Group mouse pro B cells [09-002] pro-bcells mouse pre B cells [09-002] pre-bcells mouse immature B cells [09-002] Imm-bcells mouse mature B cells (spleen) replicate 1 [09-002] mBcells-1 mouse mature B cells (spleen) replicate 2 [09-002] mBcells-2 mouse B cells activated in vitro with LPS/IL4 replicate 1 [09-002] LPS-IL4-1 mouse B cells activated in vitro with LPS/IL4 replicate 2 [09-002] LPS-IL4-2 mouse B cells activated in vitro with LPS/IL4 replicate 3 [09-002] LPS-IL4-3 mouse B cells activated in vitro with LPS/aIgD-Dextran [09-002] LPS- algD pData=read.table(pDataFile,header=TRUE,row.names=1,sep="\t") At this stage you need to make sure that the samples in your pData file are sorted the same way they are in your reads_file. if (all(rownames(pData) == colnames(mydataFile)) == FALSE) { idx = which(rownames(pData) != colnames(mydataFile)) for (i in 1:length(idx)) { print(paste("pDATA", i)) print(rownames(pData)[i]) print(paste("mydataFile", i)) print(colnames(mydataFile[i])) } stop ("Problems with pData file. check sorting") } #Create your groups phenotype <- new("AnnotatedDataFrame", data = pData) #All groups group=c(as.character(unique(phenotype$Group))) groups = factor(c(design.list(phenotype$Group,group))) design <- model.matrix(~ 0+ groups) colnames(design) <- c(group) #Prepare contrast matrix contrast.matrix <- design.pairs(c(group)) comparisons = names(contrast.matrix[1,]) l= length(comparisons) #Run pairwise comparisons and plot smearplots # pdf("smearplots.pdf") for (i in 1:l) { #print(paste("Running Comparison ",comparisons[i])) tmp = unlist(strsplit(comparisons[i],"-")) condA = tmp[1] condB = tmp[2] print(paste("Running Comparison ",condB, " vs ", condA) ) #You submit condA vs condB but the comparison is done the other way around condB vs conD A. just need to remember, that's the way exacttest waorks !!! de.com <- exactTest( d, c( condA, condB ) ) outputfile = paste(condB,"vs",condA,".csv",sep="") detags500.com <- rownames(topTagsde.com, n = 500)$table) title = paste("FC plot using common dispersion for ",condB," vs ",condA, "\nhighlighted top 500 genes",sep="") plotSmear(d, de.tags = detags500.com, main = title) abline(h = c(-2, 2), col = "dodgerblue", lwd = 2) write.table(file=outputfile,de.com$table,sep="\t") } and you're done. This creates a csv file for each comparison and a pdf file with all smearplots. Would be useful to get volcanoplots as well but don't know how to do it. ##### FUNCTIONS ##### design.list <- function(phenotype,cond) { n <- length(phenotype) k = length(cond) design <- matrix(1,1,(n)) for (i in 1:k) { #print(paste("Reading:",i)) ind = (which(phenotype == cond[i])) design[,ind] = i #design[,ind] = cond[i] } design } design.pairs <- function(levels) { n <- length(levels) design <- matrix(0,n,choose(n,2)) rownames(design) <- levels colnames(design) <- 1:choose(n,2) k <- 0 for (i in 1:(n-1)) for (j in (i+1):n) { k <- k+1 design[i,k] <- 1 design[j,k] <- -1 colnames(design)[k] <- paste(levels[i],"-",levels[j],sep="") } design } On 06/16/2010 02:17 PM, Mark Robinson wrote: > Hi David. > > We generally recommend raw counts as input to edgeR. Presumably you can > back to this by reversing the TPM calculation -- multiply the TPM in a > library by the total number of reads and divide by 1 million. I presume > this info is available. > > That is some seriously high dispersion. You may also want to look into > normalization -- see ?calcNormFactors or see the manual. > > Cheers, > Mark > >> Hi, >> I'm going through a sequencing dataset (GSE21630). >> It's a 46 different libraries experiment with a large dispersion: >> >> > d$common.dispersion >> [1] 3.101696 >> >> > sqrt(d$common.dispersion) >> [1] 1.761163 >> >> I guess i need to cope with that as the sequencing libraries were >> prepared with an old protocol not as efficient as current 1.5 solexa >> protocol. >> >> With thar being said the librarires are given in tags per 1 million >> (TPM). On average the experiment had 4 million reads per library. I was >> wondering how edge analysis would be affected by using reads (sort of >> normalized to 1million size library (TPM)) rather then using the total >> number of reads. >> >> What do you think ?? Should i multiply 4 the current number of reads >> given in TPM or can i stay with my libraries at 1million reads all. >> >> thanks, >> david >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}

ADD REPLY • link 13.9 years ago David ▴ 860

Login before adding your answer.