Hi,
I have the following study design:
PatientID Time_Resp R or NR at T1 Timepoint
1 T0_no No T0
1 T1_no No T1
2 T0_yes Yes T0
2 T1_yes Yes T1
3 T0_no No T0
3 T1_no No T1
4 T0_yes Yes T0
4 T1_yes Yes T1
Questions.
Patients are sampled at two timepoints; before and after treatment with a study drug. I have 2 (T0 T1) x20 patients. Some of them respond (n = 9) to the drug and some do not (n = 11). Are there any differences in genexpression over time in general or between groups over time.
Q1) As I have repeated sampling of the same individuals I consider it would be appropriate to use Limma and Duplicate correlation (in comparisons involving paired samples) - correct?
Q2) We have discussed using Deseq2 as well, but if I understand, the duplicate correlation function is not available there and therefore it is better to use the code above and limma in the entire project (even in comparisons only involving unpaired data such as within T0).
Q3) I have basic skills in bioinformnatics and have followed the limma manual 9.7, thus can you check that my code is correct please, especially regarding paired data and dupl corr?
Q4) Can it be improved to fx exclude potential outliers (if that is not done by default), I have also read in other Bioconductor messages fx https://support.bioconductor.org/p/59700/, that it is recommended to do voom twise (although the manual doesnt say anything about repeated voom). Is that what you recommend and how and where can I incoorporate it correctly in my code?
Q5) Also I have read that the Duplicate correlation function takes time to excecute - that is not my experience and therefore my question regarding the code in Q3 - with my data this step takes less than a minute -- have I done anything wrong when defining block variables etc?
I really appreciate your comments
All the best //Sofie
# load packages and get version numbers using sapply()
sapply(c("Rsubread", "limma", "edgeR", "Glimma", "RColorBrewer", "gplots"), library, character.only = TRUE)
sapply(c("Rsubread", "limma", "edgeR", "Glimma", "RColorBrewer", "gplots"), packageVersion)
#there is a lot of code here between (based on Law 2017 RNAseq is easy as 1-2-3 ...) not relevant to the query
#Groupcomparison using duplicate correlation. first design matrix and contrasts:
Fourgroup <- factor(targets$Time_Resp, levels = c('T0_yes','T1_yes','T0_no','T1_no'))
design <- model.matrix( ~ 0 + Fourgroup)
colnames(design) <- gsub("Fourgroup", "", colnames(design))
contr.matrix <- makeContrasts(
Responder_T1vsT0 = T1_yes-T0_yes,
Nonresponder_T1vsT0 = T1_no-T0_no,
DiffbwDiffRespNonresp = (T1_yes-T0_yes)-(T1_no-T0_no), levels=design)
# show the number of samples per category
colSums(design)
#Removing heteroscedascity with voom
v <- voom(x, design, plot=TRUE) #x = DGElist
#Fitting linear models (lmFit) for comparisons of interest, and statistics using eBayes
#Unpaired data - not used in my query but appreciate check code correct?
#vfit <- lmFit(v, design)
#vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
#efit <- eBayes(vfit)
#PAIRED DATA IN THE EXAMPLE - check code correct?
PatientID <- factor(targets$PatientID) #to be used as block-variable
table(PatientID) #just to check 2 datapoints each
corfit <- duplicateCorrelation(v, design, block=PatientID)
corfit$consensus ## = 0.614, excpected to be positive in study with repeated sampling of the same patient at different time points
vfit <- lmFit(v, design, block=PatientID, correlation=corfit$consensus)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
View(summary(decideTests(efit)))
head(efit$coefficients, n = 2)
#List all genes and corrsponding Log2FC, P-values, BH adjusted P-values to investigate further
C1 <- topTable(efit, coef=1, n=Inf, sort.by = "P")
C2 <- topTable(efit, coef=2, n=Inf, sort.by = "P")
C3 <- topTable(efit, coef=3, n=Inf, sort.by = "P")
#and more code follows fx ranking of lists for GSEA etc
OP has a multlevel experiment, because response is between patients. Hence paired analysis is not sufficient and OP is following limma User's Guide Section 9.7.
Thank you, here in my example i listed only a few patients to describe how the study is built, but as I describe above in the text I have 20x2 samples in total (each patient n= 20 is sampled before and after drug exposure and they diverge into responders or non-responders at follow up), tot 40 datapoints; so stick to voom, lmma and duplicate correlation?
Recommend to do voom twize?
Try
voomLmFit
in the edgeR package, which automates calls to voom and duplicateCorrelation.Thank you for this advise
However trying to run the following gives me some error messages I dont understand.
vfit = voomLmFit(x, design, block = ID_PairedSamples, prior.weights = NULL, sample.weights = FALSE, var.design = NULL, var.group = NULL, lib.size = NULL, normalize.method = "none", span = 0.5, plot = TRUE, save.plot = FALSE)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
Code copied from https://rdrr.io/bioc/edgeR/man/voomLmFit.html.
Q1: First of all - shall the code run as above - obviously something wrong
Q2: What does the error messages mean (tried google it but didnt get any solution for a non-IT/non-bioinformatics). Any problem with the version of my package?:
$Rsubread [1] 2 0 1
$limma [1] 3 42 2
$edgeR [1] 3 28 1
$Glimma [1] 1 14 0
$RColorBrewer [1] 1 1 2
$gplots [1] 3 0 4
Q3: I also tried just to simply run voom and duplicate correlation two times each as described in https://support.bioconductor.org/p/59700 (#Double voom). With my data I got a correlation coeff of 0.6138559 after one round of voom and after two rounds the corfit consensus was 0.6142262 - thus no bigger difference. Comment ?
I really appreciate your comments on these issues as well. Please mind the uneducated reader here.
//All the best
Please have a read of the posting guide: http://www.bioconductor.org/help/support/posting-guide/
Yes, there is a problem with the package versions you are using. When you post a question here it is normally expected that you will be using the current release of Bioconductor. Your packages are two Bioconductor releases out of date. You are using Bioconductor 3.10 instead of Bioconductor 3.12.
Please consider using
sessionInfo
as advised in the posting guide, which provides a more complete way to see package versions.The old packages you are using don't have a function called
voomLmFit
, so I wonder how you have run it. It seems from your post that you have gone to a 3rd party website, found a code version of voomLmFit and sourced that into your R session. That are many problems with doing that.The first error message appears to be a problem with your R installation. I haven't seen the error message before, but it is presumably due to the fact that you have mixed code from different Bioconductor releases. Or possibly because your machine doesn't have enough memory.
You should be using the documentation that comes with limma and edgeR rather than googling for a 3rd party website like
rdrr.io
. Just type ?voomLmFit at the R prompt. That way you can be sure that the documentation matches the software version you are using.There is no need to copy entire usage lines from the documenation into your R session. You only need to specify arguments that you are specifying a non-default value for, for example:
As far as your analysis is concerned, your manual running of voom() and duplicateCorrelation() twice appears to be fine and probably similar to what you would get from voomLmFit(). Yes I agree the consensus correlation appears to have converged.
Some of my questions stems from our lack of experience, so please apologize me. Via the function BiocManager::version() I received the same as in your answer: version 3.10. So that explains alot. Also when I didn’t get any output in R for ?voomLmFit I assumed I did something wrong and googled it instead. However, we also learn along the way and are getting more familiar with the Bioconductor project and found the info also here https://bioconductor.org/packages/devel/bioc/manuals/edgeR/man/edgeR.pdf. I thank you all for all the helpful feedback regarding duplicate correlation and voomLmFit.
However also related to the my topic and my code is the filtering out of lowly expressed genes before the statistics You mention in https://support.bioconductor.org/p/p133796/ , related with my topic, that filtering out genes with low expression should be done using the design. In my code I have used the recommendations given in Law 2017 RNAseq is easy as 1-2-3 ... using the smallest study group (my responders n = 9, and a cut of, of 2 cpm). Briefly described in the code below.
<h2>1 ### Assemble a DGEList object</h2> <h2>2 ### Remove genes with low expression, keep all with >2 cpm in at least 9 samples (smallest study group)</h2> <h2>3 ### Normalising gene expression distributions TMM.</h2> <h2>4 ### Creating a design matrix and contrasts,</h2>and so on for group comparisons and with duplicate correlation for paired data as described above in my thread
<h2>Question:</h2>Would you recommend to insert the “create a design matrix and contrasts” before step ##2 and then replace my step ##2 remove genes with low expression, with the filtering as below? Or is it just different ways of reaching the same goal?
<h2>Filtering out lowly expressed genes</h2>The code runs smoothly but due to our basic skills I would appreciate your feedback on this one as well and I thank you so much in advance!
Please see the latest revision of the article by Law et al at
https://f1000research.com/articles/5-1408/v3
or
https://www.bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html
Our recommendations for filtering lowly expressed genes are discussed at length in that paper.