Question

calcNormFactors in edgeR - quick question from a very inexperienced user

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

Dear Jan, The first step is to read the documentation! Page 9 of the edgeR User's Guide says: "If the counts for different samples are stored in separate files, then the files have to be read separately and collated together. The edgeR function readDGE is provided to do this. Files need to contain two columns, one for the counts and one for a gene identifier. See the SAGE and deepSAGE case studies for examples of this." The readDGE() function does exactly what you want to do. Type ?readDGE at the R prompt. Best wishes Gordon > Date: Fri, 10 May 2013 16:32:38 +0100 > From: Jan Zaucha <jan.zaucha at="" bristol.ac.uk=""> > To: bioconductor at r-project.org > Subject: [BioC] calcNormFactors in edgeR - quick question from a very > inexperienced user > > Hi, > > I'm totally new to the field, I've never used R before, but I need to > normalize some expression data. > > In every file I have many columns corresponding to different samples > (different source cells) and rows corresponding to different genes. > However I have many different files corresponding to different > experiments and they have different total numbers of rows (genes). > > I want to use RLE normalization to normalize all of the data, which is > implemented in the function calcNormFactors from the package edgeR, but > I don't understand how can I put the read counts into a matrix since my > files contain different numbers of genes (rows). > > I thought I should have a giant matrix containing data from all of my > files where the columns are the samples and rows are the genes. > > Should I perhaps take the file that has the highest number of genes and > input "0" for these genes if they are not present in the other files? > > Thanks for your time. > Jan > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

Normalization edgeR Normalization edgeR • 1.3k views

ADD COMMENT • link updated 11.0 years ago by Jan Zaucha ▴ 20 • written 11.0 years ago by Gordon Smyth 50k

score 0 · Answer 1 · 2013-05-12

Thank You for the reply Gordon, I have been trying to get this to work for the whole day yesterday and realized that my data files would need a considerable amount of parsing before they could be imported into R. Therefore I decided that the easiest and errorless approach will be to use my mysql database into which I had already imported the data. Out of it I get a file with 3 columns corresponding to "gene id", "sample id" and "expression value". Then I do the following: # initialize a matrix with just zeros; of the correct size (I have 76455 genes and 3001 samples) data_matrix <- matrix(0, 76455, 3001) # read in the expression data exported from the mysql database expression <- read.table(file.choose(), header=TRUE) # populate the empty matrix with the data having the columns as sample indices and rows as gene indices mtx <- as.matrix(expression) data_matrix[mtx[,1:2] ]<- mtx[,3] # use the normalization method from edgeR d <- calcNormFactors(data_matrix, method="RLE") I thought this should work (and it does when I try it with a small dummy file), but when I apply it to my own data (1.4GB) I get all the normalization factors which are "NA". I did print out some of my 'data_matrix' and it does seem to be populated correctly, so I really don't know what is causing this now. Could it be something to do with the amount of data - would I have to use doubles instead of floats or something like that? Best wishes, Jan On 12 May 2013 04:43, Gordon K Smyth <smyth@wehi.edu.au> wrote: > Dear Jan, > > The first step is to read the documentation! Page 9 of the edgeR User's > Guide says: > > "If the counts for different samples are stored in separate files, then > the files have to be read separately and collated together. The edgeR > function readDGE is provided to do this. Files need to contain two > columns, one for the counts and one for a gene identifier. See the SAGE > and deepSAGE case studies for examples of this." > > The readDGE() function does exactly what you want to do. Type ?readDGE at > the R prompt. > > Best wishes > Gordon > > Date: Fri, 10 May 2013 16:32:38 +0100 >> From: Jan Zaucha <jan.zaucha@bristol.ac.uk> >> To: bioconductor@r-project.org >> Subject: [BioC] calcNormFactors in edgeR - quick question from a very >> inexperienced user >> >> Hi, >> >> I'm totally new to the field, I've never used R before, but I need to >> normalize some expression data. >> >> In every file I have many columns corresponding to different samples >> (different source cells) and rows corresponding to different genes. However >> I have many different files corresponding to different experiments and they >> have different total numbers of rows (genes). >> >> I want to use RLE normalization to normalize all of the data, which is >> implemented in the function calcNormFactors from the package edgeR, but I >> don't understand how can I put the read counts into a matrix since my files >> contain different numbers of genes (rows). >> >> I thought I should have a giant matrix containing data from all of my >> files where the columns are the samples and rows are the genes. >> >> Should I perhaps take the file that has the highest number of genes and >> input "0" for these genes if they are not present in the other files? >> >> Thanks for your time. >> Jan >> >> > ______________________________**______________________________**____ ______ > The information in this email is confidential and inte...{{dropped:10}}