Question

Intepretation of RUVseq/edgeR logFC

0

Entering edit mode

Marie • 0

@marie-7720

Last seen 8.9 years ago

France

Dear All,

I'm working on a genomic plateform and I'm in charge of analysing biological data. A team of researchers asked me to compare two different methods to compute differentially expressed genes in a given dataset, from illumina human RNAseq data, aligned with Tophat2.

My dataset in composed of 12 samples, divided in 4 groups A, B, C and D (each of the groups containing 3 samples).

The first approach consists in using Cufflinks to compute FPKM values, calculate the mean FPKM value for each group, and then comparing the groups by performing a t-test (only if the coverage is > 1 in at least the 3 members of one of the 2 compared groups). So for each gene in each pairwise comparison, I get a p-value and a fold-change corresponding to the ratio of the means. A cutoff of p-value < 0.05 and FC > 1.5 was then applied.

The second approach consists in using the RUVseq method (http://www.bioconductor.org/packages/release/bioc/html/RUVSeq.html), based on a GLM approach. I'm using RUV-g with 11 housekeeping genes and I used the model :

set<-RUVg(set_raw_counts,row.names(controls),k=4)

where set_raw_counts is the SeqExpressionSet of the raw counts and controls contains housekeeping genes.
designAD<-model.matrix(~0+groups+W_1+W_2+W_3+W_4,data=pData(set))
yAD<-DGEList(counts=counts(set_raw_counts),group=groups)
yAD<-calcNormFactors(yAD,method="upperquartile")
yAD<-estimateGLMCommonDisp(yAD,designAD)
yAD<-estimateGLMTagwiseDisp(yAD,designAD)
fitAD<-glmFit(yAD,designAD)

Then for computing the DE genes in A vs B :

A_vs_B<-makeContrasts(groupsA-groupsB,levels=designAD)
myContrasts<-makeContrasts(A_vs_B=groupsA-groupsB,levels=designAD)

lrt<-glmLRT(fitAD,contrast=myContrasts[,"A_vs_B"])

But I'm not sure to understand well the output in lrt$table. I got 3 columns, logFC, logCPM and p-value.

I'm sorry this is a recurrent question but how is calculated the logFC and the logCPM ? Is it possible to have the details of the calculation ? How can I make it comparable with the FC I got from the first approach (with FPKM ?) Because when I tried to convert the logFC to FC, this lead to FC with very different orders of magnitude from the FPKM ones.

Thank you very much for your time and help.

RUVseq logFC edgeR • 3.7k views

ADD COMMENT • link 9.0 years ago Marie • 0

score 1 · Answer 1 · 2015-05-07

I'm not surprised that the fold changes from your second approach are different from those with your first approach. This is because RUVSeq will estimate and eliminate any sources of variability in your data set. If you have some batch effect between replicates in one group, the fold change between groups should change considerably after RUV removes that batch.

If you want to make the fold changes more comparable, I would suggest repeating the analysis without the estimated RUV factors in your design matrix. Even then, there should still be some differences from simply averaging the FPKMs and computing the fold changes between groups. For starters, the FPKMs don't account for elimination of bias during normalization; the mean for each group in a negative binomial model is not equivalent to the average count when effective library sizes are different; and the log-FCs reported by edgeR use a prior count to avoid extreme values when one group mean is very low.

As an aside, I'm not sure that 11 house-keeping genes are enough for stable estimation of the hidden factors. You could try the RUVr method instead, which doesn't rely on control genes at all (despite having an argument for it in the function - the example in the documentation seems to use all genes as controls). Also, is there any need to use k=4? This seems like a lot; for comparison, the RUV paper uses k=6 for the SEQC data set with 128 samples. Remember, the more hidden factors you use, the fewer residual d.f. you'll have for estimating the dispersion.

score 0 · Answer 2 · 2015-05-07

0

Entering edit mode

Marie • 0

@marie-7720

Last seen 8.9 years ago

France

Thank you very much for your helpful answer.

I will try to repeat the analysis without the estimated RUV factors as you suggest.

In my current results, I can see a logFC=18 for a gene which has raw counts like 0,1,8 vs 2,1,4 so I really don't get how it calculates the logFC. It computes a ratio of prior counts ? Do you know if it is possible to output these prior counts ?

I chose k=4 by looking at RLE plots with different k which were not satisfying with k=1, 2 or 3.

Thank you again.

ADD COMMENT • link 9.0 years ago Marie • 0

0

Entering edit mode

By default, a prior count of 0.125 is added to all counts before the log-fold change is calculated; see the glmFit documentation for more details. In the case you've presented above, though, I wouldn't expect the prior count to be responsible for that result. I'd need more details to determine why that gene has such a large log-fold change. One possibility might be if there are very big differences in the library sizes between the samples in the two groups.

ADD REPLY • link 9.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Ok I'm not sure of what prior count really means... I need to have the 2 values that are used to calculate the logFC in my output but I just can't figured out how I can find them, any ideas ?

Thank you !

ADD REPLY • link 9.0 years ago Marie • 0

0

Entering edit mode

In the example you've given above, the "2 values" for all genes are in the columns of fitAD$coefficients with names groupsA and groupsB, where each row corresponds to a gene. I can't imagine why you would need them, though.

As to your other question, the prior count is just a count that is added to all other counts. This avoids getting, e.g., infinite fold changes if one of the groups has zero counts in all libraries.

ADD REPLY • link 9.0 years ago Aaron Lun ★ 28k

score 0 · Answer 3 · 2015-05-07

Ok thanks a lot for your answer.

I tried to do the log 2 ratio of fitAD$coefficients but I don't find the same value as logFC. I surely miss something here...

I would like to have these "2 values" just to show where the logFC comes from, as I have for the FPKM analysis, to make things clearer for my colleagues...

Thanks again.

score 0 · Answer 4 · 2015-05-07

Hi Marie,

what you find in fitAD$coefficients are the estimated parameters from the GLM regression (note that the link function here is natural log, so this is the scale of the parameters).

You should be able to compute the logFC values by doing something similar to:

(fitAD$coefficients[,"groupB"] - fitAD$coefficients[,"groupA"]) / log(2)

Note that the division by log(2) has the only effect of transforming the natural log in base 2 log.

As for your original question, in our experience 11 genes are definitely too few to capture correctly the unwanted variation, so this could explain the strange fold-changes that you observe.

score 0 · Answer 5 · 2015-05-07

Ok thank you so much, it is clear now !!

Yes I know that 11 genes are way too few, we are working on making a longer list.

I tried to compute RUVg with k=2 and I have filtered the genes which the expression is lower than 10, and I don't have weird fold change anymore. On the contrary, they are pretty low now...

Thanks again for you help !