how plotSA shall look like for good or bad data, any example of good or bad data shall look like in plotSA regarding whether the data is OK for limma analysis or not
1
0
Entering edit mode
Ming ▴ 360
@ming-yi-6363
Last seen 3 months ago
United States

Hi, there:

I have a question about how plotSA shall look like for good or bad data, any example of good or bad data shall look like in plotSA regarding whether the data is OK for limma analysis or not?

I indeed looked through web or the user guide of limma or the limma method paper, I could not find example plots of good or bad plotSA plots shall look like. So hereby I wonder if one can offer examples on what kind of plotSA plots shall look like or what criteria to determine regarding to whether the data is OK for limma analysis or not. Since if plotSA plot can be a diagnosis tool to determine if limma can be applied to the data as discussed initially and partially at URL: old discussion, it is better to know how and what is OK and what is not OK look like. I attached example of my plotSA plot here with also plotMD plots for reference here and code is below. The plotSA plot for my data look kind of bent and skewed, although plotMD look great.

Ming

Code should be placed in three backticks as shown below


> show(mydata[1:5,1:3]);
COLON.Mut.LS513 COLON.Mut.CL11 COLON.Mut.LS1034
A1BG..1.             0.009157821   -0.001640884      -0.05054882
A1CF..29974.        -0.021848711    0.036079619       0.05448993
A2M..2.              0.099182798    0.105247983       0.04387373
A2ML1..144568.       0.107724902    0.043404811       0.02711932
A3GALT2..127550.    -0.045207710   -0.146916234      -0.37110402
> show(tar[1:5,c(8,7,3)])
TypesLines RAS Types
COLON.Mut.LS513     COLON.Mut.LS513 Mut COLON
COLON.Mut.CL11       COLON.Mut.CL11 Mut COLON
COLON.Mut.LS1034   COLON.Mut.LS1034 Mut COLON
COLON.Mut.SNU1033 COLON.Mut.SNU1033 Mut COLON
COLON.Mut.COLO678 COLON.Mut.COLO678 Mut COLON

eset <- ExpressionSet(assayData = as.matrix(mydata))

group1<-paste(tar$Types,tar$RAS,sep=".");
group<-factor(group1,levels=c("COLON.Mut","COLON.WT","LUNG.Mut","LUNG.WT"));
design <- model.matrix(~0+group)
colnames(design) <- levels(group)

fit <- lmFit(eset, design)

> show(fit)
An object of class "MArrayLM"
$coefficients COLON.Mut COLON.WT LUNG.Mut LUNG.WT A1BG..1. -0.006184935 -0.01589437 -0.01650645 -0.03327325 A1CF..29974. -0.036653106 -0.05470048 -0.03705100 -0.05318098 A2M..2. 0.059910009 0.05123221 0.08906358 0.07615483 A2ML1..144568. 0.062222203 0.12934582 0.10457048 0.12544309 A3GALT2..127550. -0.142014114 -0.12313766 -0.15668482 -0.17035788 17640 more rows ... ....... expName<-paste("Test_PlotSA_LungColon_V2", "",sep="");; pngF<-paste(outPath,paste(expName,".png",sep=""),sep="/"); png(file= pngF, width = 1000, height = 400); par(mfrow=c(1, 3)) cont.matrix<-makeContrasts(COLON.Mut_WT=COLON.Mut-COLON.WT, LUNG.Mut_WT=LUNG.Mut-LUNG.WT,levels=design); fit2 <- contrasts.fit(fit, cont.matrix) fit3 <- eBayes(fit2) > show(fit3) An object of class "MArrayLM"$coefficients
Contrasts
COLON.Mut_WT LUNG.Mut_WT
A1BG..1.          0.009709434  0.01676680
A1CF..29974.      0.018047372  0.01612999
A2M..2.           0.008677799  0.01290875
A2ML1..144568.   -0.067123620 -0.02087261
A3GALT2..127550. -0.018876454  0.01367307
17640 more rows ...
.......

plotMD(fit3, coef=1,main=paste0("Mean-Difference Plot:", colnames(fit3$coefficients)[1])); abline(0,0,col="blue") plotMD(fit3, coef=2,main=paste0("Mean-Difference Plot:", colnames(fit3$coefficients)[2]));
abline(0,0,col="blue")
plotSA(fit3, main="Residual standard deviation vs. average log expression")

dev.off()

I atttached the image of the created plots here for your view

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] calibrate_1.7.7     MASS_7.3-54         Biobase_2.50.0
[4] BiocGenerics_0.38.0 limma_3.46.0        reshape_0.8.8
[7] reshape2_1.4.4      ggrepel_0.9.1       ggplot2_3.3.5

loaded via a namespace (and not attached):
[1] Rcpp_1.0.7       pillar_1.6.2     compiler_4.0.2   plyr_1.8.6
[5] tools_4.0.2      digest_0.6.27    lifecycle_1.0.0  tibble_3.1.3
[9] gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.11     DBI_1.1.1
[13] withr_2.4.2      dplyr_1.0.7      stringr_1.4.0    generics_0.1.0
[17] vctrs_0.3.8      grid_4.0.2       tidyselect_1.1.1 glue_1.4.2
[21] R6_2.5.0         fansi_0.5.0      farver_2.1.0     purrr_0.3.4
[25] magrittr_2.0.1   scales_1.1.1     ellipsis_0.3.2   assertthat_0.2.1
[29] colorspace_2.0-2 labeling_0.4.2   utf8_1.2.2       stringi_1.7.3
[33] munsell_0.5.0    crayon_1.4.1

plotSA limma • 482 views
0
Entering edit mode
@gordon-smyth
Last seen 6 minutes ago
WEHI, Melbourne, Australia

I already answered your question 4 months ago (Is there a need or general principle/rule to assess whether the data types/distribution etc is suitable for limma R package to derive differential features between interested contrasting groups?). I already told you that that I see no evidence that limma would not work on your data but your data looks to be poor quality and you could improve the analysis if you stabilized the variance better. You don't seem to have followed my hints but have instead repeated the same question both on this forum and in a series of emails sent to me. This is the 5th time now that you have asked me the same question about the same plots.

You have been demanding a simple test of whether data is OK for a limma analysis. In my opinion, your campaign for a universal test that can be applied for any type of data, any technology platform and any preprocessing practice is unrealistic and also misguided. I think your questions are based on some mistaken premises.

1. You ask whether the data is suitable for a limma analysis, but suitable compared to what alternative? If you decide the data is not suitable, what will you do then? Throw the data out? That's not very constructive. Apply a t-test or a Wilcoxon test? That will just make problems worse. I am not aware of generally applicable alternatives that will make fewer assumptions than limma.

2. limma can be applied to data from most genomic technologies provided the data is quantitative and is summarized on a scale for which linear modeling is reasonable. The challenge with each new technology is to preprocess and normalize the data to make the linear model assumptions as good as possible. There is no cutoff between good and bad, just a continuum of better or worse preprocessing practices. Figuring out the best preprocessing for each new technology is one of the key concerns of statistical bioinformatics research and is the topic of many published journal articles.

3. You give no information about the genomic technology that you are using, but almost all the data values appear to be close to zero with a few negative values. The data seem to have an upper bound of some sort, so it is not like log-expression. You say that the data has been scaled to be between -2 and 1, which seems strange. Rather than trying to shift the onus to the DE analysis tool, you should be trouble-shooting the data quality and improving the data preprocessing.

4. I did not suggest plotSA as a universal test but rather as a way to examine whether you had stabilized the variance properly. Frankly, it shows the shortcomings of your data pretty clearly. I would never put an example like this in the limma User's Guide because I would never process data like this.

5. Have you looked at the SA plot closely? Your data shows a distinct mean-variance trend so the curve fitted by limma looks ok, or at least as good as can be done. The uptick of the trend-line on the far right of the plot is a region where there almost no genes, so it will have little or no influence on your analysis. limma is doing as well as can be done. What more do you expect??

0
Entering edit mode

Dr, Smyth:

0
Entering edit mode

Can I suggest a way to optimize your data? No I can't. Data preprocessing is technology specific. No one can tell you how to preprocess your data because we don't have examples of data from your assay technology. We don't even know what your technology is. Please read the last sentence of my point 2 above. You have already explained that the data is proprietary and not under you control so it is very difficult indeed to understand what you are expecting me to suggest.

0
Entering edit mode

Thanks for the input! What I am asking is not to optimize the whole data preprocessing, since it has been done and it is techonology dependent, and no need and would be more issues to do anything on that side. I am just wondering about any potential way of stabilizing the variance of processed data properly as observed in plotSA and as you suggested.

0
Entering edit mode

I have no idea how to improve your data. I've never seen anything similar to it. If you want info about standard variance stabiliizing transformations, do a Google search (as I suggested 4 months ago). But I have no expectation that any of the standard transformations will be appropriate for your data. The need for variance stabilization generally needs to be incorporated at early stages of the data processing steps. It is not always possible to fix the data by tacking on a transformation at the end.

It seems from your comments that you already have perfectly satisfactory results and that the limma package has performed correctly. If you still need more assistance, the logical step would be to collaborate with a statistical bioinformatician at your own institution.

0
Entering edit mode

thank you for your great help!