multiple voom runs with duplicateCorrelation and negative values in limma
2
0
Entering edit mode
Hilary ▴ 40
@hilary-8596
Last seen 22 months ago
United States

I have a general question on running limma-voom on an RNA-Seq study with repeated measures.

I have a repeated measures design, where 19 subjects were measured over multiple timepoints in two different conditions. The PCA/MDS plots indicate a strong impact of Subject, so initially I ran the study in edgeR including "subject" as a separate factor in the design matrix. However I am interesting in also applying the limma approach to allow the "subject" impact to be modeled as a random contributor; the impact of "subject" is not really of interest. I started following the standard commands, creating a DGEList object and removing low expressed genes. I kept genes with a rowSum cpm of at least 1 in 19 or more samples. (Increasing stringency to only keep genes with a rowSum cpm of at least 5 in 19 or more samples still results in the same negative counts problem noted below). Then I used calcNormFactors to normalize, set up my design matrix, and ran voom a first time. Next I estimated the correlation with duplicateCorrelation. My cor$consensus value is 0.3549613.

Following the example in 18.1.9 of the limma user guide (revision 14 Nov 2021), I then tried to run voom a second time in case the correlation could modify voom weights. However, this resulted in an error message that "negative counts are not allowed."

Digging into the issue, I found that my original filtered/normalized DGEList did NOT have negative counts; the minimum value of any single gene in any single sample was 0. However after running voom the first time, negative values were generated, so I am inhibited from running voom a second time since the dataset contains negative values.

I tried to find explanations in bioconductor, and I found one posting (Voom on TCGA data shifts count distributions towards negative values ) suggesting that it is OK to have negative values generated by a run of voom. That was a relief! However I am also seeing a posting (using duplicateCorrelation with limma+voom for RNA-seq data) suggesting that running voom twice is recommended.

What should I do if the first voom run creates negative values? Is it sufficient to run voom once? Is it better to stick to an edgeR approach and just encode subjects as a factor in the design matrix/model?

NB: I am running R version 4.2.1, edgeR version 3.38.4, limma version 3.52.2.

Thank you in advance for your help!

limma voom • 2.9k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 12 hours ago
United States

Just use voomLmFit. It handles all that for you.

ADD COMMENT
0
Entering edit mode

Thank you; I was not aware of this function.

ADD REPLY
0
Entering edit mode
@gordon-smyth
Last seen 4 hours ago
WEHI, Melbourne, Australia

initially I ran the study in edgeR including "subject" as a separate factor in the design matrix. However I am interesting in also applying the limma approach to allow the "subject" impact to be modeled as a random contributor; the impact of "subject" is not really of interest.

If all subjects are measured for all time points and conditions then it is normal to include subject in the design matrix, even if subject is not of interest. For balanced designs, there is no advantage is fitting it as a random effect. Random effects are more useful for unbalanced designs.

Following the example in 18.1.9 of the limma user guide (revision 14 Nov 2021), I then tried to run voom a second time in case the correlation could modify voom weights. However, this resulted in an error message that "negative counts are not allowed."

I am guessing that you have made a programming error. Look closely the example in the User's Guide and you will see that voom is run on the original counts y both times and only the input correlation is changed. You will not get an error from running voom twice, if done correctly.

As James MacDonald has pointed out, there is a new function edgeR::voomLmFit now, which will do it all automatically for you. If you specify a blocking variable, then voomLmFit will automatically run voom and duplicateCorrelation twice for you.

However after running voom the first time, negative values were generated

voom doesn't change the counts at all, so it does not generate negative counts.

I tried to find explanations in bioconductor, and I found one posting (Voom on TCGA data shifts count distributions towards negative values ) suggesting that it is OK to have negative values generated by a run of voom.

That's a different issue. That's a question about negative logCPM values, which are indeed perfectly normal.

ADD COMMENT
0
Entering edit mode

Thank you very much. I will attempt to troubleshoot my limma coding but meanwhile I am working to setup the voomLmFit. I am not finding an example in the UserGuide; I hope this is the right way to set this up (code below). My design is not perfectly balanced; a few samples failed. (I should have 19*8*2=304 samples; after removing failed points I have 288). Thank you again; I really appreciate your input.

Sys.Date()
[1] "2022-08-30"

library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:base’:
format.pval, units

library(edgeR)
Loading required package: limma
library(stringr)

###Load data and set up variables
#Modify the read-in file names (CSVs) as appropriate

rawC = read.csv("MainCountV8.csv", header=T, row.names=1)
GC = as.matrix(rawC)
dim(GC)
[1] 60651   288
rawT = read.csv("Score_Combined_V6.csv", header=T, row.names=18)

#Setup variables

Tar = rawT
rownames(Tar) = str_replace(rownames(Tar), "-", ".")
Subj = as.factor(Tar$SubjectID) #Distinguish the 19 subjects
Score = as.numeric(Tar$Score) #Score is a covariate of discrete integers (e.g., 5, 7, 22).
Hour = as.numeric(Tar$Hour) #Hour is of the form 2, 4, 24, etc. for the 24 hours of the day
Countermeasure = as.factor(Tar$Countermeasure) #Countermeasure is a binary factor

#Set up DGEList

y = DGEList(GC)

#Filter to remove genes with low expression

keep = rowSums(cpm(y) >1) >=19
table(keep)
keep
FALSE  TRUE 
41686 18965 
yK = y[keep, , keep.lib.sizes=FALSE]
NumbGenes = nrow(yK)
NumbGenes
[1] 18965

#Normalize

yKC = calcNormFactors(yK)

#Set up model parameters to allow for circadian rhythms

CirCycle = 24
sinphaseCC = sin(2*pi*Hour/CirCycle)
cosphaseCC = cos(2*pi*Hour/CirCycle)
head(sinphaseCC)
[1]  1.224606e-16 -8.660254e-01 -8.660254e-01  0.000000e+00  8.660254e-01
[6]  8.660254e-01
head(cosphaseCC)
[1] -1.0 -0.5  0.5  1.0  0.5 -0.5

#Set up design matrix and estimate dispersion

designCC = model.matrix(~Subj + Score + Countermeasure + sinphaseCC + cosphaseCC)
yKC_CC = estimateDisp(yKC, designCC)
yKC_CC$common.dispersion
[1] 0.04326189

###Set up models and test for DEGs

fitVLF = voomLmFit(counts=yKC_CC, design=designCC, block=Subj, plot=TRUE, save.plot=TRUE)
First intra-block correlation  0
Final intra-block correlation  0
colnames(fitVLF)
 [1] "(Intercept)"        "Subj3"              "Subj4"             
 [4] "Subj6"              "Subj8"              "Subj10"            
 [7] "Subj11"             "Subj13"             "Subj14"            
[10] "Subj15"             "Subj16"             "Subj19"            
[13] "Subj23"             "Subj25"             "Subj26"            
[16] "Subj28"             "Subj29"             "Subj30"            
[19] "Subj31"             "Score"                "CountermeasureNone"
[22] "sinphaseCC"         "cosphaseCC"        

eB_fitVLF = eBayes(fitVLF)
summary(decideTests(eB_fitVLF))

#Export DEGs for the covariate Score

DEG_Score = topTable(eB_fitVLF, coef="Score")
DEG_Score = topTable(eB_fitVLF, coef="Score", n=NumbGenes)
write.csv(DEG_Score, "DEG_Score.csv")

####

sessionInfo()

R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.4.1   edgeR_3.38.4    limma_3.52.2    Hmisc_4.7-1    
[5] ggplot2_3.3.6   Formula_1.2-4   survival_3.4-0  lattice_0.20-45

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.6      tidyselect_1.1.2    xfun_0.32          
 [4] purrr_0.3.4         splines_4.2.1       colorspace_2.0-3   
 [7] vctrs_0.4.1         generics_0.1.3      htmltools_0.5.3    
[10] base64enc_0.1-3     utf8_1.2.2          rlang_1.0.4        
[13] pillar_1.8.1        foreign_0.8-82      glue_1.6.2         
[16] withr_2.5.0         DBI_1.1.3           RColorBrewer_1.1-3 
[19] jpeg_0.1-9          lifecycle_1.0.1     munsell_0.5.0      
[22] gtable_0.3.0        htmlwidgets_1.5.4   latticeExtra_0.6-30
[25] knitr_1.40          fastmap_1.1.0       fansi_1.0.3        
[28] htmlTable_2.4.1     Rcpp_1.0.9          scales_1.2.1       
[31] backports_1.4.1     checkmate_2.1.0     deldir_1.0-6       
[34] interp_1.1-3        gridExtra_2.3       png_0.1-7          
[37] digest_0.6.29       stringi_1.7.8       dplyr_1.0.9        
[40] grid_4.2.1          cli_3.3.0           tools_4.2.1        
[43] magrittr_2.0.3      tibble_3.1.8        cluster_2.1.4      
[46] pkgconfig_2.0.3     Matrix_1.4-1        data.table_1.14.2  
[49] assertthat_0.2.1    rstudioapi_0.14     R6_2.5.1           
[52] rpart_4.1.16        nnet_7.3-17         compiler_4.2.1
ADD REPLY
0
Entering edit mode

You need to specify Subj either in the design matrix or as a random block. You cannot do both as you are currently trying to do. voomLmFit recognizes the problem and so just returns correction=0, which is equivalent to not estimating a random effect at all.

voomLmFit is not mentioned in the limma User's Guide because it is new and because it is an edgeR package function, but the input arguments are almost the same as for voom. Just read the help page.

ADD REPLY
0
Entering edit mode

I re-ran the code, removing Subj from the design matrix and only using it as a random effect with the term block-Subj. This increased my common.dispersion estimate to 0.09, and resulted in a final intra-block correlation of ~0.355.

When I do this a second time, I end up with a higher correlation (code snippet and output below).

designCC = model.matrix(~Countermeasure + PVT + sinphaseCC + cosphaseCC)

fitVLF = voomLmFit(counts=yKC_CC, design=designCC, block=Subj, plot=TRUE, save.plot=TRUE, sample.weights=TRUE)

First sample weights (min/max) 0.1374427/2.1670148

First intra-block correlation 0.4232958

Final sample weights (min/max) 0.139459/2.181011

Final intra-block correlation 0.4233958

THANK YOU Dr. Smyth and Dr. MacDonald; I truly appreciate your taking the time to review code, point out new features, explain, and help troubleshoot. I am hopeful the results of my team's research will benefit society, and am very grateful for your assistance to improve the models.

ADD REPLY
0
Entering edit mode

As I said in ny previous comment, you need to specify Subj either in the design matrix or as a random block. It doesn't make sense to run estimateDisp ignoring Subj entirely. edgeR doesn't support random effects so, for edgeR analyses, Subj should stay in the design matrix.

I don't understand what you mean by running the code a second time. You only need to run voomLmFit once. Even if you run it twice, it will give the same results both times.

ADD REPLY
0
Entering edit mode

I apologize; I did not explain myself correctly in my last post. My mention of running the code a second time was trying to refer to running voomLmFit two separate ways, one with sample weights and one without it. Following your prior advice I ONLY applied the Subj in the voomLmFit as block=Subj.

I am thinking your comment about estimateDisp needing Subj is only for traditional edgeR and NOT for voomLmFit, right?

(Code below).

Load data and set up variables

Modify the read-in file names (CSVs) as appropriate

rawC = read.csv("MainCountV8.csv", header=T, row.names=1)

GC = as.matrix(rawC)

rawT = read.csv("SCORE_Combined_V6.csv", header=T, row.names=18)

Setup variables

Tar = rawT

rownames(Tar) = str_replace(rownames(Tar), "-", ".")

Subj = as.factor(Tar$SubjectID)

SCORE = as.numeric(Tar$SCORE)

Hour = as.numeric(Tar$Hour)

Countermeasure = as.factor(Tar$Countermeasure)

Set up DGEList

y = DGEList(GC)

Filter to remove genes with low expression

keep = rowSums(cpm(y) >1) >=19

table(keep)

yK = y[keep, , keep.lib.sizes=FALSE]

NumbGenes = nrow(yK)

NumbGenes

Normalize

yKC = calcNormFactors(yK)

Set up model parameters to allow for circadian rhythms

CirCycle = 24

sinphaseCC = sin(2*pi*Hour/CirCycle)

cosphaseCC = cos(2*pi*Hour/CirCycle)

Set up design matrix and estimate dispersion

designCC = model.matrix(~Countermeasure + SCORE + sinphaseCC + cosphaseCC)

yKC_CC = estimateDisp(yKC, designCC)

colnames(designCC)

yKC_CC$common.dispersion

Set up models and test for DEGs - no sample weights

fitVLF1 = voomLmFit(counts=yKC_CC, design=designCC, block=Subj, plot=TRUE, save.plot=TRUE)

First intra-block correlation 0.3549613

Final intra-block correlation 0.3550964

eB_fitVLF1 = eBayes(fitVLF1)

Export DEGs for the covariate SCORE

DEG_SCORE1 = topTable(eB_fitVLF1, coef="SCORE", n=NumbGenes)

write.csv(DEG_SCORE1, "DEG_SCORE1.csv")

Set up models and test for DEGs - WITH sample weights

fitVLF2 = voomLmFit(counts=yKC_CC, design=designCC, block=Subj, plot=TRUE, sample.weights=TRUE)

First sample weights (min/max) 0.1374427/2.1670148

First intra-block correlation 0.4232958

Final sample weights (min/max) 0.139459/2.181011

Final intra-block correlation 0.4233958

eB_fitVLF2 = eBayes(fitVLF2)

Export DEGs for the covariate SCORE

DEG_SCORE2 = topTable(eB_fitVLF2, coef="SCORE", n=NumbGenes)

write.csv(DEG_SCORE2, "DEG_SCORE2.csv")

ADD REPLY
0
Entering edit mode

estimateDisp is part of edgeR pipelines and has nothing to do with voom or voomLmFit. We often include estimateDisp in a limma reports as a diagnostic data exploration, but it has no effect on any limma analysis pipeline.

ADD REPLY
0
Entering edit mode

Very good to know; thank you. Again I appreciate your help!

ADD REPLY

Login before adding your answer.

Traffic: 564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6