Question

EDASeq, RUVSeq and edgeR together

0

Entering edit mode

zoppoli pietro ▴ 10

@zoppoli-pietro-4792

Last seen 5.6 years ago

United States

Hi,

I'd like to use together EDASeq,RUVSeq and edgeR

I'd like to know if It's correct to run the folling code

According to EDASeq manual:

dataOffset <- withinLaneNormalization(data,"gc",which="upper",offset=TRUE)

dataOffset <- betweenLaneNormalization(dataOffset,which="upper",offset=TRUE)

then according to RUVSeq using control genes taken from

Eisenberg E, Levanon EY. Human housekeeping genes, revisited. Trends Genet. 2013

I do:

set1 <- RUVg(x = dataOffset cIdx = HKgenes, k=1)

RUVg does NOT fill the offset slot of the EDASeq object.

To use the offsets I need the result from EDASeq but I still want to Remove Unwanted Variation

so I modify the edgeR paragraph of the EDASeq manual:

W_1 <- pData(set1)$ W_1

pData(dataOffset) <- cbind(pData(dataOffset),W_1)

design <- model.matrix(~conditions + W_1, data=pData(dataOffset))

then all by manual

disp <- estimateGLMCommonDisp(counts(dataOffset),

design, offset=-offst(dataOffset))

fit <- glmFit(counts(dataOffset), design, disp, offset=-offst(dataOffset))

lrt <- glmLRT(fit, coef=2)

topTags(lrt)

By the way other than common dispersion (estimateGLMCommonDisp )

in edgeR manual there are 2 other steps (raccomanded):

To estimate trended dispersions:

y <- estimateGLMTrendedDisp(y, design)

To estimate tagwise dispersions:

y <- estimateGLMTagwiseDisp(y, design)

I need to integrate these steps too ?

Thank,

Pietro

edaseq ruvseq edgeR • 2.6k views

ADD COMMENT • link updated 5.6 years ago by davide risso ▴ 950 • written 5.6 years ago by zoppoli pietro ▴ 10

score 1 · Answer 1 · 2018-09-27

1

Entering edit mode

davide risso ▴ 950

@davide-risso-5075

Last seen 6 weeks ago

University of Padova

Hi Pietro,

I think that your code is fine. And yes, it's likely better to have a tagwise and possibly trended dispersion and not a global dispersion parameter.

I believe that edgeR now offers a simpler wrapper function to carry out the three steps that you describe. Have a look at ?edgeR::estimateDisp.

Best, Davide

ADD COMMENT • link 5.6 years ago davide risso ▴ 950

0

Entering edit mode

Hi Davide,

following your tips I changed this dispersion:

disp <- estimateGLMCommonDisp(counts(dataOffset),

design, offset=-offst(dataOffset))

with this dispersion

disp2 <- estimateDisp(counts(dataOffset),design, offset=-offst(dataOffset))

and changed the fit step:

fit <- glmFit(counts(dataOffset), design, disp, offset=-offst(dataOffset))

with this:

fit2 <- glmFit(counts(dataOffset), design, disp2$tagwise.dispersion,
offset=-offst(dataOffset))

Thanks,

Pietro

ADD REPLY • link 5.6 years ago zoppoli pietro ▴ 10

0

Entering edit mode

Hi Davide,

I'm going to visualize the results of the analysis using a simple boxplot and I find a myself troubled to decide which matrix to use.

Let's say I get lrt2 from glmLRT and I want to plot geneA

contrast_BvsA <- makeContrasts(TvsN=conditionsB-conditionA, levels=design)

contrast_BvsA

#                      Contrasts
# Levels               BvsA
# conditionA            -1
# conditionsB           1
#  W_1                  0           

#  W_1 is a vector to Remove Unwanted Variation

lrt2 <- glmLRT(fit2,contrast = contrast_BvsA )

lrt2$table["geneA",]

         logFC   logCPM       LR      PValue
geneA 5.605485 27.49028 580.2279 3.3458e-128

I need to give a vector to boxplot.

According to manuals you can use the log2(normCounts) or better use something like

rld <- rlogTransformation(dataOffset@assayData@normalizedCounts, blind=TRUE)

or

vsd <- varianceStabilizingTransformation(dataOffset@assayData@normalizedCounts, blind=TRUE)

I check for geneA FC

FC <- aggregate(vsd["geneA",] ~ (conditions),FUN= mean)

      BvsA            vsd ["geneA", ]
1          A            8.710527
2          B            9.881888

FC[2,2]-FC[1,2]

# 1.171361

such FC is very different from the lrt2 FC so if I plot this any reviewer can argument boxplot is NOT representative of the reported FC.

I try to use counts or normalizedCounts (though if they worked I could not explain myself why adopt the discussed pipeline)

log2(mean(Data2@assayData$counts["geneA",names(which(condition=="B"))],na.rm=T)
  /
mean(Data2@assayData$counts["geneA",names(which(condition=="A"))],na.rm=T))

# 3.312596

log2(mean(Data2@assayData$normalizedCounts["geneA",names(which(condition=="B"))],na.rm=T)
  /
mean(Data2@assayData$normalizedCounts["geneA",names(which(condition=="A"))],na.rm=T))

# 1.673753

Also these FCs are far away from the lrt2 result.

So I try to get the data from lrt2 results, in particular i extract fitted.values and coefficients

FC <- aggregate(log2(lrt2$fitted.values["geneA",])~ (conditions),FUN= mean)
FC[2,2] - FC[1,2]

# 3.146288

lrt2$coefficients["geenA",2] - lrt2$coefficients["geneA",1]

#  3.885426

these values are still very different from logFC = 5.605485 from glmLRT function.

Worst scenario is geneB

 lrt2$table["geneB",]

         logFC   logCPM       LR       PValue
geneB -1.088669 31.60722 23.45083 1.281471e-06

vsd FC is 0.42

counts FC is 1.95

normalizedCounts is 0.51

lrt2$coefficients is -0.75

ltr2$fitted.values FC is 1.93

As u can see only lrt2$coefficients goes near the value reported by lrt2 while the others have even different sign

I think I'm missing something, maybe more than that.

Please can you comment these results and pinpoint the correct matrix to use for boxplot.

Best,

Pietro

ADD REPLY • link 5.5 years ago zoppoli pietro ▴ 10

0

Entering edit mode

Hi Pietro, the issue here is in the terminology. What edgeR calls "logFC" is not a log-fold-change per se but the estimated value of the particular contrast based on the estimates of the coefficient of the regression model. When you use the additional covariates from RUV, you are changing the value of the estimated coefficient, because these coefficients are "adjusted" by the presence of the confounding factors from RUV. Obviously, none of your manual log-fold-changes are. The interpretation of the values in this case would be something along the lines of "the value that the log-fold-change would have if the data weren't influenced by the unwanted variation". I hope this helps.

ADD REPLY • link 5.5 years ago davide risso ▴ 950

0

Entering edit mode

Hi Davide,

thanks for your reply. I explained myself the discrepancy in the same way BUT this means I can't use boxplot or similar plots to show the data.
I can do a HM of the genes using the edgeR "logFC" but i have no way to show the distribution of the samples around that value, right ?

By the way, is there no way to generate a "corrected" matrix using the confounding factors from RUV together with the estimates of the coefficient of the regression model ?

How you would visualize the result and explain it to a review asking to look at distribution of the expression of a gene in the two tested conditions ?

Here seems that RUV strongly changes the expression values (maybe because the total reads of the first condition are the double of the other condition).

Thank you very much,
P

ADD REPLY • link 5.5 years ago zoppoli pietro ▴ 10

0

Entering edit mode

You can of course look at the "corrected" values once you fit the model, by looking at $Y - W alpha$, or you could plot the fitted values for the model instead of the observed data. Note that neither RUV nor any other statistical method will tell you if the difference is biological or technical if you have complete confounding. But if that's not the case, then, $Y - W alpha$ is probably what you want to plot to show the estimated effect.

ADD REPLY • link 5.5 years ago davide risso ▴ 950

0

Entering edit mode

Hi Davide,

my bad but i really can't understand the formula $Y - W alpha$. Can you please explain it to me ?

Please consider that in my example fit2 is the result of the glmFit having pData(set1)$ W_1 integrated into design and set1 is the result of RUVg.

Sorry to bother but i'd really like to fully implement and standardize the pipeline so I can use it every time (when reasonable).

Best,

P

ADD REPLY • link 5.5 years ago zoppoli pietro ▴ 10

0

Entering edit mode

Sorry, I was assuming that you were familiar with the notation we use in RUV, where $W$ is the matrix with the unwanted variation factors and $\alpha$ its associated parameter vector.

$Y - W \alpha$ essentially means to look at the coefficient estimates related to the unwanted variation factors in the glmFit result object, %*% that by the related columns of your design matrix and subtract the resulting matrix from the matrix of observed counts.

This would be almost like we provide as "normalizedCounts" in RUV, but after fitting the full model (including the variable of interests).

I hope this is more clear, sorry that there isn't an automated function to do that, but it would work on edgeR objects so I always thought it is out of the scope of the RUVSeq package. But if you want to contribute a PR to the RUVSeq package I would consider adding it to the package.

ADD REPLY • link 5.5 years ago davide risso ▴ 950

0

Entering edit mode

Hi Davide,

if I understand correctly, in my example:

$Y is the fit2$counts

$W$ is the lrt2$coefficients (Ngenes X 3conditions)

$\alpha$ is the design$W_1 (W_1 is a vector result of RUVg) (Nsamples X 1)

so

MatrixByDavide <-  fit_TvsN$coefficients[,W_1]%*%t(design$W_1)

MatrixToMakePlots <- log2(fit_TvsN$counts)- MatrixByDavide 

bb <- aggregate(log2(MatrixToMakePlots [geneB,])~ (conditions),FUN= mean)
bb[2,2]-bb[1,2]

# 0.074 instead of  -1.088669

if I use all the columns of fit_TvsN$coefficients then

MatrixByDavide2 <-  fit_TvsN$coefficients%*%t(design$W_1)

MatrixToMakePlots2 <- log2(fit_TvsN$counts)- MatrixByDavide2 

bb <- aggregate(log2(MatrixToMakePlots2 [geneB,])~ (conditions),FUN= mean)

bb[2,2]-bb[1,2]

# 0.5405 instead of  -1.088669

Please let me know if I made something wrong because I'm still not able to get a matrix I can use to plot the results.

Best,

P

ADD REPLY • link 5.4 years ago zoppoli pietro ▴ 10