Dear Jan,
The first step is to read the documentation! Page 9 of the edgeR
User's
Guide says:
"If the counts for different samples are stored in separate files,
then
the files have to be read separately and collated together. The edgeR
function readDGE is provided to do this. Files need to contain two
columns, one for the counts and one for a gene identifier. See the
SAGE
and deepSAGE case studies for examples of this."
The readDGE() function does exactly what you want to do. Type ?readDGE
at
the R prompt.
Best wishes
Gordon
> Date: Fri, 10 May 2013 16:32:38 +0100
> From: Jan Zaucha <jan.zaucha at="" bristol.ac.uk="">
> To: bioconductor at r-project.org
> Subject: [BioC] calcNormFactors in edgeR - quick question from a
very
> inexperienced user
>
> Hi,
>
> I'm totally new to the field, I've never used R before, but I need
to
> normalize some expression data.
>
> In every file I have many columns corresponding to different samples
> (different source cells) and rows corresponding to different genes.
> However I have many different files corresponding to different
> experiments and they have different total numbers of rows (genes).
>
> I want to use RLE normalization to normalize all of the data, which
is
> implemented in the function calcNormFactors from the package edgeR,
but
> I don't understand how can I put the read counts into a matrix since
my
> files contain different numbers of genes (rows).
>
> I thought I should have a giant matrix containing data from all of
my
> files where the columns are the samples and rows are the genes.
>
> Should I perhaps take the file that has the highest number of genes
and
> input "0" for these genes if they are not present in the other
files?
>
> Thanks for your time.
> Jan
>
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}
Thank You for the reply Gordon,
I have been trying to get this to work for the whole day yesterday and
realized that my data files would need a considerable amount of
parsing
before they could be imported into R. Therefore I decided that the
easiest
and errorless approach will be to use my mysql database into which I
had
already imported the data. Out of it I get a file with 3 columns
corresponding to "gene id", "sample id" and "expression value". Then I
do
the following:
# initialize a matrix with just zeros; of the correct size (I have
76455
genes and 3001 samples)
data_matrix <- matrix(0, 76455, 3001)
# read in the expression data exported from the mysql database
expression <- read.table(file.choose(), header=TRUE)
# populate the empty matrix with the data having the columns as sample
indices and rows as gene indices
mtx <- as.matrix(expression)
data_matrix[mtx[,1:2] ]<- mtx[,3]
# use the normalization method from edgeR
d <- calcNormFactors(data_matrix, method="RLE")
I thought this should work (and it does when I try it with a small
dummy
file), but when I apply it to my own data (1.4GB) I get all the
normalization factors which are "NA". I did print out some of my
'data_matrix' and it does seem to be populated correctly, so I really
don't
know what is causing this now. Could it be something to do with the
amount
of data - would I have to use doubles instead of floats or something
like
that?
Best wishes,
Jan
On 12 May 2013 04:43, Gordon K Smyth <smyth@wehi.edu.au> wrote:
> Dear Jan,
>
> The first step is to read the documentation! Page 9 of the edgeR
User's
> Guide says:
>
> "If the counts for different samples are stored in separate files,
then
> the files have to be read separately and collated together. The
edgeR
> function readDGE is provided to do this. Files need to contain two
> columns, one for the counts and one for a gene identifier. See the
SAGE
> and deepSAGE case studies for examples of this."
>
> The readDGE() function does exactly what you want to do. Type
?readDGE at
> the R prompt.
>
> Best wishes
> Gordon
>
> Date: Fri, 10 May 2013 16:32:38 +0100
>> From: Jan Zaucha <jan.zaucha@bristol.ac.uk>
>> To: bioconductor@r-project.org
>> Subject: [BioC] calcNormFactors in edgeR - quick question from a
very
>> inexperienced user
>>
>> Hi,
>>
>> I'm totally new to the field, I've never used R before, but I need
to
>> normalize some expression data.
>>
>> In every file I have many columns corresponding to different
samples
>> (different source cells) and rows corresponding to different genes.
However
>> I have many different files corresponding to different experiments
and they
>> have different total numbers of rows (genes).
>>
>> I want to use RLE normalization to normalize all of the data, which
is
>> implemented in the function calcNormFactors from the package edgeR,
but I
>> don't understand how can I put the read counts into a matrix since
my files
>> contain different numbers of genes (rows).
>>
>> I thought I should have a giant matrix containing data from all of
my
>> files where the columns are the samples and rows are the genes.
>>
>> Should I perhaps take the file that has the highest number of genes
and
>> input "0" for these genes if they are not present in the other
files?
>>
>> Thanks for your time.
>> Jan
>>
>>
> ______________________________**______________________________**____
______
> The information in this email is confidential and
inte...{{dropped:10}}