Question

RPKM normalization for several samples using a for loop

0

Entering edit mode

humberto_munoz • 0

@humberto_munoz-10903

Last seen 7.8 years ago

I am using a for loop to compute the RPKM values for 11 samples, however, I am getting only the gene IDs from the last sample. I would like to know what commands I am missing in the lines below. If somebody knows a more efficient way to do this task, please tell me.

Thanks

Humberto

setwd("/Users/hmunozbarona/Documents/Normalization-R")
rm(list=ls(all=TRUE)) # remove all variables
files <- dir(pattern="*\.csv$")
RPKM = NULL
for (i in 1:11)
  {
  data<-read.csv(files[[i]])
  id<- data['GeneID']
  cnts<- data['ReadCount']
  lens<- data['Length']
  y <- DGEList(genes=data.frame(gene = id,Length=lens), counts=cnts)
  RPKM[i]= data.frame(gene =id, counts=cnts, rpkm(y))
print(RPKM[i])
}

edger forloop • 1.9k views

ADD COMMENT • link updated 3.0 years ago by Gordon Smyth 50k • written 7.7 years ago by humberto_munoz • 0

0

Entering edit mode

I wonder why all your samples are in separate files in the first place. Tools like featureCounts or htseq-counts collate the read counts into a matrix for you, so reading separate files of counts is generally unnecessary.

ADD REPLY • link 7.7 years ago Gordon Smyth 50k

Gordon Smyth · Answer 1 · 2016-07-28

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 11 hours ago

WEHI, Melbourne, Australia

Well, one obvious problem is that you are treating RPKM as a list, but you haven't made it a list. You need

RPKM <- list()

instead of

RPKM <- NULL

and

RPKM[[i]] <- data.frame(etc)

instead of

RPKM[i] <- data.frame(etc)

As for being more efficient, you have posted many questions on this support site about readDGE() and rpkm(), so you already know that there is no need for a for loop. I can't imagine the code as you have written it can possibly be useful to you.

ADD COMMENT • link 7.7 years ago • updated 3.0 years ago Gordon Smyth 50k

0

Entering edit mode

In my updated code I corrected these lines below, however, I still got some error with the list RPKM.

 setwd("/Users/hmunozbarona/Documents/Normalization-R")
rm(list=ls(all=TRUE)) # remove all variables
# files <- dir(pattern="*\.csv")
print(files)
my_data <- list()
RPKM <- list()
for (i in seq_along(files)) {
  my_data[[i]] <- read.csv(file = files[i])
  id<- my_data[[i]]['GeneID']
  cnts<- my_data[[i]]['ReadCount']
  lens<- my_data[[i]]['Length']
  y <- DGEList(genes=data.frame(gene = id,Length=lens), counts=cnts)
  RPKM[[i]] <- data.frame(gene =id, rpkm(y))
}
# countDf <- data.frame(gene = id, count = cnts, length = lens)
group<- c(1,1,1,1,1,1,1,1,1,1,1)
RG<-readDGE(RPKM, columns=c(1,2), group=NULL, labels=NULL)
RG$samples
keep <-rowSums(cpm(RG)>1) >=1
RG<- RG[keep, , keep.lib.sizes=FALSE] 

rownames(design)<-colnames(RG)
rownames(design)
design
RG <- estimateDisp(RG, design, robust=TRUE)
x<-RGtagwise.dispersion)
plotBCV(RG)
fit <- glmQLFit(RG, design, robust=TRUE)
plotQLDisp(fit)
qlf <-glmQLFTest(fit, coef=4)
topTags(qlf,n=50, sort.by="logFC")
is.de <- decideTestsDGE(qlf, p.value=0.05)
summaryis.de)
plot(qlfcoefficients[,5])

The error starts in these lines:

> # countDf <- data.frame(gene = id, count = cnts, length = lens)
> group<- c(1,1,1,1,1,1,1,1,1,1,1)
> RG<-readDGE(RPKM, columns=c(1,2), group=NULL, labels=NULL)
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file 'list(GeneID = c(641612588, 641610273, 641612823, 641611428, 641610552, 641612589, 641612814, 641611971, 641612541, 641610260, 641610167, 641610059, 641611190, 641612220, 641611972, 641610056, 641611932, 641611769, 641610514, 641612072, 641612379, 641612089, 641611662, 641612486,  [... truncated]

My 11 CVS files all have more than 2800 rows and 4 columns as follows:

GeneID  Length  ReadCount   Normalized Coverage1 * 109
641612588   2604    13466   10730.812
641610273   582 11767   41954.421
641612823   1713    10837   13127.64

ADD REPLY • link updated 3.0 years ago by Gordon Smyth 50k • written 7.7 years ago by humberto_munoz • 0

0

Entering edit mode

Why would you think that it makes sense to pass RPKM values to readDGE()? Obviously readDGE() needs a vector of 11 file names to read, not a list of thousands of numerical values.

Why would you compute RPKM at all, since edgeR requires counts and not RPKM?

I just have to give up at this point, as I can't make sense of the logic of your code. I have answered your original question, and this followup question doesn't seem a logical progression from the original post.

ADD REPLY • link 7.7 years ago Gordon Smyth 50k

0

Entering edit mode

Gordon thanks for your comments. Excuse me R-ignorance in this topic, I am trying to compare the effect on the False Discovery Rate (FDR) of the normalization methods (TMM, RLE, Upper-quartile and RPKM) using these set of 11 samples. In edgeR, I calculated easily TMM, RLE and Upper-quartile normalizations for all samples, but not RPKM. With the lines below, I am trying to calculate the RPKM for the 11 samples individually using the for loop. My problem is when I am trying to collate all rpkm values to perform the other functions for the dispersion, CBV, fitting with the glmQLFit, conduct the glmQLFTest and make the plots. If there is a better way to do it, please let me know it. I will try your suggested tools featureCounts or htseq-counts collate the read counts into a matrix.

Humberto

ADD REPLY • link 7.7 years ago humberto_munoz • 0

0

Entering edit mode

The tools featureCounts and htseq-counts are in packages not available for the R version 3.3.0 that I have.

ADD REPLY • link 7.7 years ago humberto_munoz • 0

0

Entering edit mode

So you did biocLite("featureCounts") and biocLite("htseq-counts") and they both failed. Well that's because these are not R packages. featureCounts is a function defined in the Rsubread package, which is available for R 3.3.0. And htseq-counts is part of the HTSeq software which is a Python package.

H.

ADD REPLY • link 7.7 years ago Hervé Pagès 16k

0

Entering edit mode

I tried to install the Rsubread package in my R 3.3.0 version and this is what I got.

 > install.packages("Rsubread")
Warning in install.packages :
  package ‘Rsubread’ is not available (for R version 3.3.0)

ADD REPLY • link updated 3.0 years ago by Gordon Smyth 50k • written 7.7 years ago by humberto_munoz • 0

0

Entering edit mode

.. which is because the function to install Bioconductor packages is biocLite rather than install.packages. The Bioconductor website explains how to install Bioconductor packages.

ADD REPLY • link 3.0 years ago Gordon Smyth 50k