Duplicate correlation with limma in before-after exposure experiment
0
Entering edit mode
@283b118b
Last seen 8 weeks ago

Hi,

I have the following study design:

PatientID Time_Resp R or NR at T1 Timepoint

1 T0_no No T0

1 T1_no No T1

2 T0_yes Yes T0

2 T1_yes Yes T1

3 T0_no No T0

3 T1_no No T1

4 T0_yes Yes T0

4 T1_yes Yes T1

Questions.

Patients are sampled at two timepoints; before and after treatment with a study drug. I have 2 (T0 T1) x20 patients. Some of them respond (n = 9) to the drug and some do not (n = 11). Are there any differences in genexpression over time in general or between groups over time.

Q1) As I have repeated sampling of the same individuals I consider it would be appropriate to use Limma and Duplicate correlation (in comparisons involving paired samples) - correct?

Q2) We have discussed using Deseq2 as well, but if I understand, the duplicate correlation function is not available there and therefore it is better to use the code above and limma in the entire project (even in comparisons only involving unpaired data such as within T0).

Q3) I have basic skills in bioinformnatics and have followed the limma manual 9.7, thus can you check that my code is correct please, especially regarding paired data and dupl corr?

Q4) Can it be improved to fx exclude potential outliers (if that is not done by default), I have also read in other Bioconductor messages fx https://support.bioconductor.org/p/59700/, that it is recommended to do voom twise (although the manual doesnt say anything about repeated voom). Is that what you recommend and how and where can I incoorporate it correctly in my code?

Q5) Also I have read that the Duplicate correlation function takes time to excecute - that is not my experience and therefore my question regarding the code in Q3 - with my data this step takes less than a minute -- have I done anything wrong when defining block variables etc?

All the best //Sofie

# load packages and get version numbers using sapply()
sapply(c("Rsubread", "limma", "edgeR", "Glimma", "RColorBrewer", "gplots"), library, character.only = TRUE)
sapply(c("Rsubread", "limma", "edgeR", "Glimma", "RColorBrewer", "gplots"), packageVersion)

#there is a lot of code here between (based on Law 2017 RNAseq is easy as 1-2-3 ...) not relevant to the query

#Groupcomparison using duplicate correlation. first design matrix and contrasts:
Fourgroup <- factor(targets$Time_Resp, levels = c('T0_yes','T1_yes','T0_no','T1_no')) design <- model.matrix( ~ 0 + Fourgroup) colnames(design) <- gsub("Fourgroup", "", colnames(design)) contr.matrix <- makeContrasts( Responder_T1vsT0 = T1_yes-T0_yes, Nonresponder_T1vsT0 = T1_no-T0_no, DiffbwDiffRespNonresp = (T1_yes-T0_yes)-(T1_no-T0_no), levels=design) # show the number of samples per category colSums(design) #Removing heteroscedascity with voom v <- voom(x, design, plot=TRUE) #x = DGElist #Fitting linear models (lmFit) for comparisons of interest, and statistics using eBayes #Unpaired data - not used in my query but appreciate check code correct? #vfit <- lmFit(v, design) #vfit <- contrasts.fit(vfit, contrasts=contr.matrix) #efit <- eBayes(vfit) #PAIRED DATA IN THE EXAMPLE - check code correct? PatientID <- factor(targets$PatientID)  #to be used as block-variable
table(PatientID) #just to check 2 datapoints each

corfit <- duplicateCorrelation(v, design, block=PatientID)
corfit$consensus ## = 0.614, excpected to be positive in study with repeated sampling of the same patient at different time points vfit <- lmFit(v, design, block=PatientID, correlation=corfit$consensus)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)

View(summary(decideTests(efit)))
head(efit$coefficients, n = 2) #List all genes and corrsponding Log2FC, P-values, BH adjusted P-values to investigate further C1 <- topTable(efit, coef=1, n=Inf, sort.by = "P") C2 <- topTable(efit, coef=2, n=Inf, sort.by = "P") C3 <- topTable(efit, coef=3, n=Inf, sort.by = "P") #and more code follows fx ranking of lists for GSEA etc  RNASeqR • 316 views ADD COMMENTlink 0 Entering edit mode @james-w-macdonald-5106 Last seen 25 minutes ago United States I don't see anything obviously wrong with your code. However, if you have complete pairs, I would tend to simply block on patient, in which case you could use edgeR, limma, or DESeq2, depending on your preference. In other words, see section 3.4.1 in the edgeR User's Guide, or section 9.4.1 in the limma User's Guide, or the DESeq2 FAQ. 0 Entering edit mode OP has a multlevel experiment, because response is between patients. Hence paired analysis is not sufficient and OP is following limma User's Guide Section 9.7. ADD REPLYlink 0 Entering edit mode Thank you, here in my example i listed only a few patients to describe how the study is built, but as I describe above in the text I have 20x2 samples in total (each patient n= 20 is sampled before and after drug exposure and they diverge into responders or non-responders at follow up), tot 40 datapoints; so stick to voom, lmma and duplicate correlation? Recommend to do voom twize? ADD REPLYlink 0 Entering edit mode Try voomLmFit in the edgeR package, which automates calls to voom and duplicateCorrelation. ADD REPLYlink 0 Entering edit mode Thank you for this advise However trying to run the following gives me some error messages I dont understand. vfit = voomLmFit(x, design, block = ID_PairedSamples, prior.weights = NULL, sample.weights = FALSE, var.design = NULL, var.group = NULL, lib.size = NULL, normalize.method = "none", span = 0.5, plot = TRUE, save.plot = FALSE) vfit <- contrasts.fit(vfit, contrasts=contr.matrix) efit <- eBayes(vfit) Error 1) received after …., save.plot = FALSE: C stack usage 7972548 is too close to the limit Error 2) Received after: vfit <- contrasts.fit(vfit, contrasts=contr.matrix) Error in contrasts.fit(vfit, contrasts = contr.matrix) : Number of rows of contrast matrix must match number of coefficients in fit In addition: Warning messages: 1: In rn != cn : longer object length is not a multiple of shorter object length 2: In contrasts.fit(vfit, contrasts = contr.matrix) : row names of contrasts don't match col names of coefficients Code copied from https://rdrr.io/bioc/edgeR/man/voomLmFit.html. Q1: First of all - shall the code run as above - obviously something wrong Q2: What does the error messages mean (tried google it but didnt get any solution for a non-IT/non-bioinformatics). Any problem with the version of my package?:$Rsubread [1] 2 0 1

$limma [1] 3 42 2$edgeR [1] 3 28 1

$Glimma [1] 1 14 0$RColorBrewer [1] 1 1 2

$gplots [1] 3 0 4 Q3: I also tried just to simply run voom and duplicate correlation two times each as described in https://support.bioconductor.org/p/59700 (#Double voom). With my data I got a correlation coeff of 0.6138559 after one round of voom and after two rounds the corfit consensus was 0.6142262 - thus no bigger difference. Comment ? I really appreciate your comments on these issues as well. Please mind the uneducated reader here. //All the best ADD REPLYlink 0 Entering edit mode Please have a read of the posting guide: http://www.bioconductor.org/help/support/posting-guide/ Yes, there is a problem with the package versions you are using. When you post a question here it is normally expected that you will be using the current release of Bioconductor. Your packages are two Bioconductor releases out of date. You are using Bioconductor 3.10 instead of Bioconductor 3.12. Please consider using sessionInfo as advised in the posting guide, which provides a more complete way to see package versions. The old packages you are using don't have a function called voomLmFit, so I wonder how you have run it. It seems from your post that you have gone to a 3rd party website, found a code version of voomLmFit and sourced that into your R session. That are many problems with doing that. The first error message appears to be a problem with your R installation. I haven't seen the error message before, but it is presumably due to the fact that you have mixed code from different Bioconductor releases. Or possibly because your machine doesn't have enough memory. You should be using the documentation that comes with limma and edgeR rather than googling for a 3rd party website like rdrr.io. Just type ?voomLmFit at the R prompt. That way you can be sure that the documentation matches the software version you are using. There is no need to copy entire usage lines from the documenation into your R session. You only need to specify arguments that you are specifying a non-default value for, for example: vfit = voomLmFit(x, design, block = ID_PairedSamples, plot = TRUE)  As far as your analysis is concerned, your manual running of voom() and duplicateCorrelation() twice appears to be fine and probably similar to what you would get from voomLmFit(). Yes I agree the consensus correlation appears to have converged. ADD REPLYlink 0 Entering edit mode Some of my questions stems from our lack of experience, so please apologize me. Via the function BiocManager::version() I received the same as in your answer: version 3.10. So that explains alot. Also when I didn’t get any output in R for ?voomLmFit I assumed I did something wrong and googled it instead. However, we also learn along the way and are getting more familiar with the Bioconductor project and found the info also here https://bioconductor.org/packages/devel/bioc/manuals/edgeR/man/edgeR.pdf. I thank you all for all the helpful feedback regarding duplicate correlation and voomLmFit. However also related to the my topic and my code is the filtering out of lowly expressed genes before the statistics You mention in https://support.bioconductor.org/p/p133796/ , related with my topic, that filtering out genes with low expression should be done using the design. In my code I have used the recommendations given in Law 2017 RNAseq is easy as 1-2-3 ... using the smallest study group (my responders n = 9, and a cut of, of 2 cpm). Briefly described in the code below. <h2>1 ### Assemble a DGEList object</h2> # DGEList() fc = featurecount, x = DGEList, groups created earlier x <- DGEList(fc$counts, group = group, genes = fc$annotation[,c("GeneID","Length")]) x dim(x)  <h2>2 ### Remove genes with low expression, keep all with >2 cpm in at least 9 samples (smallest study group)</h2> summary(group) #number of subjects per study group cpm_all_genes <- cpm(x, normalized.lib.sizes = TRUE) keep.exprs <- rowSums(cpm_all_genes > 2) >= 9 x <- x[keep.exprs,, keep.lib.sizes = FALSE] dim(x)  <h2>3 ### Normalising gene expression distributions TMM.</h2> x <- calcNormFactors(x, method = "TMM") x$samples\$norm.factors

<h2>4 ### Creating a design matrix and contrasts,</h2>

# and so on for group comparisons and with duplicate correlation for paired data as described above in my thread

<h2>Question:</h2>

Would you recommend to insert the “create a design matrix and contrasts” before step ##2 and then replace my step ##2 remove genes with low expression, with the filtering as below? Or is it just different ways of reaching the same goal?

<h2>Filtering out lowly expressed genes</h2>
keep.expr <- filterByExpr(x, design)
x <- x[keep.exprs,, keep.lib.sizes = FALSE]
dim(x)


Then followed by TMM normalization, voom and duplicate correlation and lmfit

The code runs smoothly but due to our basic skills I would appreciate your feedback on this one as well and I thank you so much in advance!

0
Entering edit mode

Please see the latest revision of the article by Law et al at

https://f1000research.com/articles/5-1408/v3

or

https://www.bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html

Our recommendations for filtering lowly expressed genes are discussed at length in that paper.

0
Entering edit mode
@gordon-smyth
Last seen 15 minutes ago
WEHI, Melbourne, Australia

with my data this step takes less than a minute

You have only 8 samples so duplicateCorrelation is quick. Larger datasets will take a bit longer.