Modify the read-in file names (CSVs) as appropriate

Question

multiple voom runs with duplicateCorrelation and negative values in limma

0

Entering edit mode

Hilary ▴ 30

@hilary-8596

Last seen 13 months ago

United States

I have a general question on running limma-voom on an RNA-Seq study with repeated measures.

I have a repeated measures design, where 19 subjects were measured over multiple timepoints in two different conditions. The PCA/MDS plots indicate a strong impact of Subject, so initially I ran the study in edgeR including "subject" as a separate factor in the design matrix. However I am interesting in also applying the limma approach to allow the "subject" impact to be modeled as a random contributor; the impact of "subject" is not really of interest. I started following the standard commands, creating a DGEList object and removing low expressed genes. I kept genes with a rowSum cpm of at least 1 in 19 or more samples. (Increasing stringency to only keep genes with a rowSum cpm of at least 5 in 19 or more samples still results in the same negative counts problem noted below). Then I used calcNormFactors to normalize, set up my design matrix, and ran voom a first time. Next I estimated the correlation with duplicateCorrelation. My cor$consensus value is 0.3549613.

Following the example in 18.1.9 of the limma user guide (revision 14 Nov 2021), I then tried to run voom a second time in case the correlation could modify voom weights. However, this resulted in an error message that "negative counts are not allowed."

Digging into the issue, I found that my original filtered/normalized DGEList did NOT have negative counts; the minimum value of any single gene in any single sample was 0. However after running voom the first time, negative values were generated, so I am inhibited from running voom a second time since the dataset contains negative values.

I tried to find explanations in bioconductor, and I found one posting (Voom on TCGA data shifts count distributions towards negative values ) suggesting that it is OK to have negative values generated by a run of voom. That was a relief! However I am also seeing a posting (using duplicateCorrelation with limma+voom for RNA-seq data) suggesting that running voom twice is recommended.

What should I do if the first voom run creates negative values? Is it sufficient to run voom once? Is it better to stick to an edgeR approach and just encode subjects as a factor in the design matrix/model?

NB: I am running R version 4.2.1, edgeR version 3.38.4, limma version 3.52.2.

Thank you in advance for your help!

limma voom • 2.0k views

ADD COMMENT • link 20 months ago • updated 19 months ago Hilary ▴ 30

score 1 · Answer 1 · 2022-08-26

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

Just use voomLmFit. It handles all that for you.

ADD COMMENT • link 20 months ago James W. MacDonald 65k

0

Entering edit mode

Thank you; I was not aware of this function.

ADD REPLY • link 19 months ago Hilary ▴ 30

Gordon Smyth · Answer 2 · 2022-08-26

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 8 hours ago

WEHI, Melbourne, Australia

initially I ran the study in edgeR including "subject" as a separate factor in the design matrix. However I am interesting in also applying the limma approach to allow the "subject" impact to be modeled as a random contributor; the impact of "subject" is not really of interest.

If all subjects are measured for all time points and conditions then it is normal to include subject in the design matrix, even if subject is not of interest. For balanced designs, there is no advantage is fitting it as a random effect. Random effects are more useful for unbalanced designs.

Following the example in 18.1.9 of the limma user guide (revision 14 Nov 2021), I then tried to run voom a second time in case the correlation could modify voom weights. However, this resulted in an error message that "negative counts are not allowed."

I am guessing that you have made a programming error. Look closely the example in the User's Guide and you will see that voom is run on the original counts y both times and only the input correlation is changed. You will not get an error from running voom twice, if done correctly.

As James MacDonald has pointed out, there is a new function edgeR::voomLmFit now, which will do it all automatically for you. If you specify a blocking variable, then voomLmFit will automatically run voom and duplicateCorrelation twice for you.

However after running voom the first time, negative values were generated

voom doesn't change the counts at all, so it does not generate negative counts.

I tried to find explanations in bioconductor, and I found one posting (Voom on TCGA data shifts count distributions towards negative values ) suggesting that it is OK to have negative values generated by a run of voom.

That's a different issue. That's a question about negative logCPM values, which are indeed perfectly normal.

ADD COMMENT • link 20 months ago Gordon Smyth 50k

0

Entering edit mode

Thank you very much. I will attempt to troubleshoot my limma coding but meanwhile I am working to setup the voomLmFit. I am not finding an example in the UserGuide; I hope this is the right way to set this up (code below). My design is not perfectly balanced; a few samples failed. (I should have 19*8*2=304 samples; after removing failed points I have 288). Thank you again; I really appreciate your input.

Sys.Date()
[1] "2022-08-30"

library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:base’:
format.pval, units

library(edgeR)
Loading required package: limma
library(stringr)

###Load data and set up variables
#Modify the read-in file names (CSVs) as appropriate

rawC = read.csv("MainCountV8.csv", header=T, row.names=1)
GC = as.matrix(rawC)
dim(GC)
[1] 60651   288
rawT = read.csv("Score_Combined_V6.csv", header=T, row.names=18)

#Setup variables

Tar = rawT
rownames(Tar) = str_replace(rownames(Tar), "-", ".")
Subj = as.factor(Tar$SubjectID) #Distinguish the 19 subjects
Score = as.numeric(Tar$Score) #Score is a covariate of discrete integers (e.g., 5, 7, 22).
Hour = as.numeric(Tar$Hour) #Hour is of the form 2, 4, 24, etc. for the 24 hours of the day
Countermeasure = as.factor(Tar$Countermeasure) #Countermeasure is a binary factor

#Set up DGEList

y = DGEList(GC)

#Filter to remove genes with low expression

keep = rowSums(cpm(y) >1) >=19
table(keep)
keep
FALSE  TRUE 
41686 18965 
yK = y[keep, , keep.lib.sizes=FALSE]
NumbGenes = nrow(yK)
NumbGenes
[1] 18965

#Normalize

yKC = calcNormFactors(yK)

#Set up model parameters to allow for circadian rhythms

CirCycle = 24
sinphaseCC = sin(2*pi*Hour/CirCycle)
cosphaseCC = cos(2*pi*Hour/CirCycle)
head(sinphaseCC)
[1]  1.224606e-16 -8.660254e-01 -8.660254e-01  0.000000e+00  8.660254e-01
[6]  8.660254e-01
head(cosphaseCC)
[1] -1.0 -0.5  0.5  1.0  0.5 -0.5

#Set up design matrix and estimate dispersion

designCC = model.matrix(~Subj + Score + Countermeasure + sinphaseCC + cosphaseCC)
yKC_CC = estimateDisp(yKC, designCC)
yKC_CC$common.dispersion
[1] 0.04326189

###Set up models and test for DEGs

fitVLF = voomLmFit(counts=yKC_CC, design=designCC, block=Subj, plot=TRUE, save.plot=TRUE)
First intra-block correlation  0
Final intra-block correlation  0
colnames(fitVLF)
 [1] "(Intercept)"        "Subj3"              "Subj4"             
 [4] "Subj6"              "Subj8"              "Subj10"            
 [7] "Subj11"             "Subj13"             "Subj14"            
[10] "Subj15"             "Subj16"             "Subj19"            
[13] "Subj23"             "Subj25"             "Subj26"            
[16] "Subj28"             "Subj29"             "Subj30"            
[19] "Subj31"             "Score"                "CountermeasureNone"
[22] "sinphaseCC"         "cosphaseCC"        

eB_fitVLF = eBayes(fitVLF)
summary(decideTests(eB_fitVLF))

#Export DEGs for the covariate Score

DEG_Score = topTable(eB_fitVLF, coef="Score")
DEG_Score = topTable(eB_fitVLF, coef="Score", n=NumbGenes)
write.csv(DEG_Score, "DEG_Score.csv")

####

sessionInfo()

R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.4.1   edgeR_3.38.4    limma_3.52.2    Hmisc_4.7-1    
[5] ggplot2_3.3.6   Formula_1.2-4   survival_3.4-0  lattice_0.20-45

loaded via a namespace (and not attached):
 [1] locfit_1.5-9.6      tidyselect_1.1.2    xfun_0.32          
 [4] purrr_0.3.4         splines_4.2.1       colorspace_2.0-3   
 [7] vctrs_0.4.1         generics_0.1.3      htmltools_0.5.3    
[10] base64enc_0.1-3     utf8_1.2.2          rlang_1.0.4        
[13] pillar_1.8.1        foreign_0.8-82      glue_1.6.2         
[16] withr_2.5.0         DBI_1.1.3           RColorBrewer_1.1-3 
[19] jpeg_0.1-9          lifecycle_1.0.1     munsell_0.5.0      
[22] gtable_0.3.0        htmlwidgets_1.5.4   latticeExtra_0.6-30
[25] knitr_1.40          fastmap_1.1.0       fansi_1.0.3        
[28] htmlTable_2.4.1     Rcpp_1.0.9          scales_1.2.1       
[31] backports_1.4.1     checkmate_2.1.0     deldir_1.0-6       
[34] interp_1.1-3        gridExtra_2.3       png_0.1-7          
[37] digest_0.6.29       stringi_1.7.8       dplyr_1.0.9        
[40] grid_4.2.1          cli_3.3.0           tools_4.2.1        
[43] magrittr_2.0.3      tibble_3.1.8        cluster_2.1.4      
[46] pkgconfig_2.0.3     Matrix_1.4-1        data.table_1.14.2  
[49] assertthat_0.2.1    rstudioapi_0.14     R6_2.5.1           
[52] rpart_4.1.16        nnet_7.3-17         compiler_4.2.1

ADD REPLY • link updated 19 months ago by Gordon Smyth 50k • written 19 months ago by Hilary ▴ 30

0

Entering edit mode

You need to specify Subj either in the design matrix or as a random block. You cannot do both as you are currently trying to do. voomLmFit recognizes the problem and so just returns correction=0, which is equivalent to not estimating a random effect at all.

voomLmFit is not mentioned in the limma User's Guide because it is new and because it is an edgeR package function, but the input arguments are almost the same as for voom. Just read the help page.

ADD REPLY • link 19 months ago Gordon Smyth 50k

0

Entering edit mode

I re-ran the code, removing Subj from the design matrix and only using it as a random effect with the term block-Subj. This increased my common.dispersion estimate to 0.09, and resulted in a final intra-block correlation of ~0.355.

When I do this a second time, I end up with a higher correlation (code snippet and output below).

designCC = model.matrix(~Countermeasure + PVT + sinphaseCC + cosphaseCC)

fitVLF = voomLmFit(counts=yKC_CC, design=designCC, block=Subj, plot=TRUE, save.plot=TRUE, sample.weights=TRUE)

First sample weights (min/max) 0.1374427/2.1670148

First intra-block correlation 0.4232958

Final sample weights (min/max) 0.139459/2.181011

Final intra-block correlation 0.4233958

THANK YOU Dr. Smyth and Dr. MacDonald; I truly appreciate your taking the time to review code, point out new features, explain, and help troubleshoot. I am hopeful the results of my team's research will benefit society, and am very grateful for your assistance to improve the models.

ADD REPLY • link 19 months ago Hilary ▴ 30

0

Entering edit mode

As I said in ny previous comment, you need to specify Subj either in the design matrix or as a random block. It doesn't make sense to run estimateDisp ignoring Subj entirely. edgeR doesn't support random effects so, for edgeR analyses, Subj should stay in the design matrix.

I don't understand what you mean by running the code a second time. You only need to run voomLmFit once. Even if you run it twice, it will give the same results both times.

ADD REPLY • link 19 months ago Gordon Smyth 50k

0

Entering edit mode

I apologize; I did not explain myself correctly in my last post. My mention of running the code a second time was trying to refer to running voomLmFit two separate ways, one with sample weights and one without it. Following your prior advice I ONLY applied the Subj in the voomLmFit as block=Subj.

I am thinking your comment about estimateDisp needing Subj is only for traditional edgeR and NOT for voomLmFit, right?

(Code below).

Load data and set up variables

Modify the read-in file names (CSVs) as appropriate

rawC = read.csv("MainCountV8.csv", header=T, row.names=1)

GC = as.matrix(rawC)

rawT = read.csv("SCORE_Combined_V6.csv", header=T, row.names=18)

Setup variables

Tar = rawT

rownames(Tar) = str_replace(rownames(Tar), "-", ".")

Subj = as.factor(Tar$SubjectID)

SCORE = as.numeric(Tar$SCORE)

Hour = as.numeric(Tar$Hour)

Countermeasure = as.factor(Tar$Countermeasure)

Set up DGEList

y = DGEList(GC)

Filter to remove genes with low expression

keep = rowSums(cpm(y) >1) >=19

table(keep)

yK = y[keep, , keep.lib.sizes=FALSE]

NumbGenes = nrow(yK)

NumbGenes

Normalize

yKC = calcNormFactors(yK)

Set up model parameters to allow for circadian rhythms

CirCycle = 24

sinphaseCC = sin(2*pi*Hour/CirCycle)

cosphaseCC = cos(2*pi*Hour/CirCycle)

Set up design matrix and estimate dispersion

designCC = model.matrix(~Countermeasure + SCORE + sinphaseCC + cosphaseCC)

yKC_CC = estimateDisp(yKC, designCC)

colnames(designCC)

yKC_CC$common.dispersion

Set up models and test for DEGs - no sample weights

fitVLF1 = voomLmFit(counts=yKC_CC, design=designCC, block=Subj, plot=TRUE, save.plot=TRUE)

First intra-block correlation 0.3549613

Final intra-block correlation 0.3550964

eB_fitVLF1 = eBayes(fitVLF1)

Export DEGs for the covariate SCORE

DEG_SCORE1 = topTable(eB_fitVLF1, coef="SCORE", n=NumbGenes)

write.csv(DEG_SCORE1, "DEG_SCORE1.csv")

Set up models and test for DEGs - WITH sample weights

fitVLF2 = voomLmFit(counts=yKC_CC, design=designCC, block=Subj, plot=TRUE, sample.weights=TRUE)

First sample weights (min/max) 0.1374427/2.1670148

First intra-block correlation 0.4232958

Final sample weights (min/max) 0.139459/2.181011

Final intra-block correlation 0.4233958

eB_fitVLF2 = eBayes(fitVLF2)

Export DEGs for the covariate SCORE

DEG_SCORE2 = topTable(eB_fitVLF2, coef="SCORE", n=NumbGenes)

write.csv(DEG_SCORE2, "DEG_SCORE2.csv")

ADD REPLY • link 19 months ago Hilary ▴ 30

0

Entering edit mode

estimateDisp is part of edgeR pipelines and has nothing to do with voom or voomLmFit. We often include estimateDisp in a limma reports as a diagnostic data exploration, but it has no effect on any limma analysis pipeline.