Is there a need or general principle/rule to assess whether the data types/distribution etc is suitable for limma R package to derive differential features between interested contrasting groups?
1
1
Entering edit mode
Ming ▴ 380
@ming-yi-6363
Last seen 2.2 years ago
United States

Hi, there:

I have a general question for limma R package: I have used limma package for many occasions of microarray datasets in the past, but besides microarrays, RNA-seq (with voom) and quantitative PCR data as primarily mentioned that limma can be used for. What about other types of high-thoughput data that can be used for limma analysis? Although linear model usually requires Linearity,Homoscedasticity,Independence, and Normality etc, quantitative PCR data may not necessarily meet all these criteria. So my question is: Is there a need or general principle/rule to assess whether the data types/distribution etc is suitable for limma analysis to derive differential features between interested contrasting groups? for example, one dataset I wish to use limma was derived from high-throughput assays, and had been scaled between around -2 and 1 (and we can essentially treat the data as log2 intensity like in microrray and high values mean highly expressed etc for the same probe after normalized), since more posiitve and more negative have opposite biological meanings. Another dataset we have is compound labeling high throghput assays for interested protein at intended sites/pockets as percentage levels of successful labeling, data ranged from 0 to 100%. highly lableing % certainly is favored as for good compounds. Maybe I shall log transform the percentage data if can be used for limma analysis?

I was wondering about the potentials/possibilities of using limma in these datasets. Any advice would be highly appreciated! Thanks a lot in advance! Ming

limma highthroughputassays R • 3.2k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

linear model usually requires Linearity,Homoscedasticity,Independence, and Normality

That is the principle, although homoscedasticity is not strictly required because limma can model variance trends and can estimate sample weights. Dependence can also be handled by intra-block correlations. Apart from independence, the assumptions only need to apply approximately.

quantitative PCR data may not necessarily meet all these criteria.

Actually qPCR meet the criteria pretty closely. There is a long history of t-tests being applied to qPCR data, even before limma

one dataset I wish to use limma was derived from high-throughput assays, and had been scaled between around -2 and 1 ... Another dataset we have is compound labeling high throghput assays for interested protein at intended sites/pockets as percentage levels of successful labeling, data ranged from 0 to 100%.

Data that are constrained between limits (such at 0 and 100%) have strong mean-variance relationships at both boundaries and hence are not generally suitable for linear modelling.

Maybe I shall log transform the percentage data if can be used for limma analysis?

No, you would need a variance-stabilizing transformation for the percentages. The log-transformation is only suitable for small percentages, not for percentages close to 100%.

ADD COMMENT
0
Entering edit mode

Dear Dr. Smyth:

Really appreciate the time you took and the inputs you wrote on my questions, very helpful and informative!

I have a bit follow-ups to be clarified further and looking forward to your advice.

Data that are constrained between limits (such at 0 and 100%) have strong mean-variance relationships at both boundaries and hence are not generally suitable for linear modelling.

for dataset scaled between -2 and 1 using some internal controls etc, which is actually not bounded strictly by this boundary, the data can be higher or lower than the boundaries dependent upon samples or cell lines, although majority of the data is falling within this interval. My question is: do I need to run through procedures to check data normality (such as qqplot etc),or formally via a normality test (Shapiro-Wilk test for instance) etc?

Indeed, for this dataset (scaled between -2 and 1), I did check histogram quickly and found the data in rough bell shaped. And indeed I already applied limma quickly to some interested contrasts within the data and already found some very interesting findings that make quite biological sense, which is why I am really hoping that the limma method can be suitable for the dataset. That is why I am asking if there is standard procedure to assess if the datasets are suitable for limma method at beginning?

For the percentage dataset (bounded by 0 to 100 (%)),you suggested that the original data is not suitable for limma, and also log transform is not working here. Although you suggest variance-stabilizing transformation, do you mean that after variance-stabilizing transformation, I can use limma on the transformed data? And for variance-stabilizing transformation, I found some potential R packages DDHFm and vsn, and DESeq2 package has vst transformation for RNAseq data, can you specifically suggest one R function or package that is suitable to use for limma analysis? I am also a bit concerned if these transformation ways are relatively dramatic. In these percentage data, higher percentage means better compound labeling capacity, not sure if these transformations still maintain the meaning of these values although it allows the limma analysis if my understanding is correct.

Thx in advance! Ming

ADD REPLY
0
Entering edit mode

do I need to run through procedures to check data normality (such as qqplot etc),or formally via a normality test (Shapiro-Wilk test for instance) etc?

No. Such tests are both impossible and unnecessary.

It is much more relevant to use limma's provided plots, especially plotSA but also plotMDS and plotMD.

I did check histogram quickly and found the data in rough bell shaped.

No, you cannot check normality with a histogram. limma does not make any assumption that the data appears normal in a histogram. limma also does not make any assumption about how the expression values vary across genes.

I am really hoping that the limma method can be suitable for the dataset.

If the usual limma plots look fine then it is almost certainly ok.

And for variance-stabilizing transformation, I found some potential R packages DDHFm and vsn, and DESeq2 package has vst transformation for RNAseq data

No, none of those are at all relevant. I suggested using a variance-stabilizing tranformation that is specifically for percentages, not for some other type of data. The standard transformation for proportions would be arcsin-square-root (do a Google search). There is no need for any fancy software package, if y contains percentages then just compute y2 <- asin(sqrt(y/100)). Or an empirical logit tranformation might also be used.

ADD REPLY
0
Entering edit mode

Dear Dr. Smyth:

Thanks so much for the excellent advice and info, which are very helpful!! Really appreciated as always what you did for me and for others in the community!! Best Ming

ADD REPLY
0
Entering edit mode

Dear Dr. Smyth:

Just want to follow up with your suggestion on our previous communication: It is much more relevant to use limma's provided plots, especially plotSA but also plotMDS and plotMD.

I did try to find the updated user guide for limma: https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf based on this user guide, and I did try to make plotMD and plotSA plots for the dataset (scaled between -2 and 1). I used plotMD before, but not plotSA. plotMDS I did test and may not be much helpful here and so I skipped here.

As I asked in my original comunication above, I wish to determine if my data is suitable for limma analysis or not. I did not find from the user guide any relevant info to help me make the decision especially from plotSA (I did not see any example of plotSA how the plot shall look like for typical data that can be used for limma analysis). As seen below for the code and included plot results, plotMD for fit3 object is what I usually saw and seem reasonable (on section 16.1, there is example of plotMD, and I did use plotMD a while ago). but for fit object, I am not sure why the plot look like that either (the 1st panel of the plot file I attached). Hope you can give assessment and advice on whether my data is reasonable for limma aalysis or not. Appreciated very much!! Ming

here is the brief main codes and some related outputs to make the plots:

> show(mydata[1:5,1:3]);
                 COLON.Mut.LS513 COLON.Mut.CL11 COLON.Mut.LS1034
A1BG..1.             0.009157821   -0.001640884      -0.05054882
A1CF..29974.        -0.021848711    0.036079619       0.05448993
A2M..2.              0.099182798    0.105247983       0.04387373
A2ML1..144568.       0.107724902    0.043404811       0.02711932
A3GALT2..127550.    -0.045207710   -0.146916234      -0.37110402
> show(tar[1:5,c(8,7,3)])
                         TypesLines RAS Types
COLON.Mut.LS513     COLON.Mut.LS513 Mut COLON
COLON.Mut.CL11       COLON.Mut.CL11 Mut COLON
COLON.Mut.LS1034   COLON.Mut.LS1034 Mut COLON
COLON.Mut.SNU1033 COLON.Mut.SNU1033 Mut COLON
COLON.Mut.COLO678 COLON.Mut.COLO678 Mut COLON


eset <- ExpressionSet(assayData = as.matrix(mydata))

group1<-paste(tar$Types,tar$RAS,sep=".");
group<-factor(group1,levels=c("COLON.Mut","COLON.WT","LUNG.Mut","LUNG.WT"));
design <- model.matrix(~0+group)
colnames(design) <- levels(group)

fit <- lmFit(eset, design)

> show(fit)
An object of class "MArrayLM"
$coefficients
                    COLON.Mut    COLON.WT    LUNG.Mut     LUNG.WT
A1BG..1.         -0.006184935 -0.01589437 -0.01650645 -0.03327325
A1CF..29974.     -0.036653106 -0.05470048 -0.03705100 -0.05318098
A2M..2.           0.059910009  0.05123221  0.08906358  0.07615483
A2ML1..144568.    0.062222203  0.12934582  0.10457048  0.12544309
A3GALT2..127550. -0.142014114 -0.12313766 -0.15668482 -0.17035788
17640 more rows ...
.......

expName<-"Test_PlotSA_LungColon";
pngF<-paste(outPath,paste(expName,".png",sep=""),sep="/");
png(file= pngF, width = 1000, height = 800);
par(mfrow=c(2, 2))

plotMD(fit, coef=1,main="Mean-Difference Plot of fit,coef=1")

plotSA(fit,main="Residual standard deviation versus average log expression for fit")


cont.matrix<-makeContrasts(COLON.Mut_WT=COLON.Mut-COLON.WT, LUNG.Mut_WT=LUNG.Mut-LUNG.WT,levels=design);

fit2 <- contrasts.fit(fit, cont.matrix)
fit3 <- eBayes(fit2)

> show(fit3)
An object of class "MArrayLM"
$coefficients
                  Contrasts
                   COLON.Mut_WT LUNG.Mut_WT
  A1BG..1.          0.009709434  0.01676680
  A1CF..29974.      0.018047372  0.01612999
  A2M..2.           0.008677799  0.01290875
  A2ML1..144568.   -0.067123620 -0.02087261
  A3GALT2..127550. -0.018876454  0.01367307
17640 more rows ...
.......

plotMD(fit3, coef=1,main="Mean-Difference Plot of fit3,coef=1")
abline(0,0,col="blue")

plotSA(fit3, main="Residual standard deviation versus average log expression for fit3")

dev.off()

I atttached the image of the created plots here for your view

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRblas.so
LAPACK: /mnt/nasapps/development/R/4.0.2/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] calibrate_1.7.7     MASS_7.3-54         Biobase_2.50.0
[4] BiocGenerics_0.38.0 limma_3.46.0        reshape_0.8.8
[7] reshape2_1.4.4      ggrepel_0.9.1       ggplot2_3.3.5

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       pillar_1.6.2     compiler_4.0.2   plyr_1.8.6
 [5] tools_4.0.2      digest_0.6.27    lifecycle_1.0.0  tibble_3.1.3
 [9] gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.11     DBI_1.1.1
[13] withr_2.4.2      dplyr_1.0.7      stringr_1.4.0    generics_0.1.0
[17] vctrs_0.3.8      grid_4.0.2       tidyselect_1.1.1 glue_1.4.2
[21] R6_2.5.0         fansi_0.5.0      farver_2.1.0     purrr_0.3.4
[25] magrittr_2.0.1   scales_1.1.1     ellipsis_0.3.2   assertthat_0.2.1
[29] colorspace_2.0-2 labeling_0.4.2   utf8_1.2.2       stringi_1.7.3
[33] munsell_0.5.0    crayon_1.4.1

enter image description here

ADD REPLY
0
Entering edit mode

Dear Dr. Smyth:

Although lack of direct example of created plotSA plot from limma user guide as mentioned in my last post, Later I realized that maybe I can test the plotSA by using the data provided from the limma user guide: https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf

So I used the data specified from section 17.3.1 on page 108, and followed the flow from page 108 to 112 (up to secrion 17.3.8) to create the fit and fit2 objects and using the code for plotSA on page 34 (first page of chapter 7) to plot both fit and fit2 objects (I used the data from section 17.3.1 is because I can not get the data for page 34 to plot the plotSA). The main code is below and so are the plots. I noticed that for this example dataset from limma use guide, the 4th panel plotSA has a smooth line following the data well, but comparing to my data in my last post (see above), it seems not able to create the smooth line following the data trend but as a horizontal line. Does this suggest a problem for my data or my data is not suitable for limma analysis? Although it seems that the plotMD plots seem all OK for my data and the example data of limma user guide. Thanks so much in advance for your advice!!

targets <- readTargets()
x <- read.ilmn(files="probe profile.txt",ctrlfiles="control probe profile.txt",other.columns="Detection")
options(digits=3)

expName<-"Test_PlotSA_limma_dataExample";
pngF<-paste(outPath,paste(expName,".png",sep=""),sep="/");
png(file= pngF, width = 1000, height = 800);
par(mfrow=c(2, 2))

pe <- propexpr(x)
dim(pe) <- c(4,3)
dimnames(pe) <- list(CellType=c("MS","Stroma","ML","LP"),Donor=c(1,2,3))
y <- neqc(x)
expressed <- rowSums(y$other$Detection < 0.05) >= 3
y <- y\[expressed,\]
ct <- factor(targets$CellType)
design <- model.matrix(~0+ct)
colnames(design) <- levels(ct)
dupcor <- duplicateCorrelation(y,design,block=targets$Donor)
dupcor$consensus.correlation
fit <- lmFit(y,design,block=targets$Donor,correlation=dupcor$consensus.correlation)
contrasts <- makeContrasts(ML-MS, LP-MS, ML-LP, levels=design)
fit2 <- contrasts.fit(fit, contrasts)
fit2 <- eBayes(fit2, trend=TRUE)

plotMD(fit, coef=1,main="Mean-Difference Plot of fit,coef=1")
plotSA(fit,main="Residual standard deviation versus average log expression for fit")

plotMD(fit2, coef=1,main="Mean-Difference Plot of fit2,coef=1")
abline(0,0,col="blue")
plotSA(fit2,main="Residual standard deviation versus average log expression for fit2")

dev.off()

enter image description here

ADD REPLY
0
Entering edit mode

The reason why the SA plot of your data doesn't have a smooth curve is because you called eBayes with trend=FALSE. You should set trend=TRUE, same as you did for the User's Guide dataset.

MD plots are only intended to plot differences, not intercepts or group means. So MD plots make sense for fit3 but not for fit.

ADD REPLY
0
Entering edit mode

Thanks so much for the advcie and info, Dr. Smyth. I quickyly added trend=TRUE and replotted again as shown here with the smooth line showed up. Most importantly, I wish to confirm with you about that such plots can justfy whether my data is OK for limma analysis or not? I have not got clear sense what typical data shall look like in this plotSA plot? The plot from the example data of user guide seems somewhat similar to mine although my data falls into a narrow range, but exmple data may be a bit more evenly distributed Does this impact whether it is OK for limma analysis?

Also I plotted MD plot for fit object, is because there is one example on page 83 of userguide. But after your advice, I double checked and realized that the data they plot is from two-color array data, maybe in that case, plotMD for fit object (derived from lmfit() but not from the eBayes()) seems OK. Thx for the advice!

Thx a lot in advance!

Ming enter image description here

ADD REPLY
0
Entering edit mode

I see no evidence of any serious problems that would cause limma not to work on your data. Your data does not look very good quality, with all almost all the values clustered around 0 with a tail of negative values. But that is not an issue with limma or the statistical tests.

ADD REPLY
0
Entering edit mode

Thx a lot for the confirmation, Dr. Smyth! That is a great news! Appreciated very much!

ADD REPLY
0
Entering edit mode

my data is genecoverage , calculated in percenatges. i ran vst and got my batch corrected values in range of -6 to 7 (approx) how come i got negative values?

and the genes that got 100 % covered are now said to be just 6 % covered... please comment

below is the code that i have used

Install and load required packages

install.packages(c("limma", "edgeR", "ggplot2"))

library(limma) library(edgeR) library(ggplot2)

Load data

genecov <- read.csv("C:/Users/coverage/10geq_gene_coverage.csv", row.names = 1) batch_info <- read.csv("C:/Users/batch_info.csv", row.names = 1)

Set paths for batch-corrected counts file

batch_corrected_cov_file <- "C:/Users/coverage/batch_corrected_cov.csv"

Check the structure of the data

head(genecov) head(batch_info)

Create a design matrix

design_matrix <- model.matrix(~ condition, data = batch_info)

Convert genecov to a DGEList object from edgeR

dge <- DGEList(counts = as.matrix(genecov))

Add batch information to the DGEList object

dge$samples$Batch <- batch_info$Batch

Apply VST and perform batch effect correction using limma

vst_dge <- voom(dge, design = design_matrix, plot = FALSE) vst_dge <- removeBatchEffect(vst_dge, batch = vst_dge$samples$Batch, design = design_matrix)

Save batch-corrected counts to a file

write.csv(vst_dge, file = batch_corrected_cov_file)

have i missed something or overlooked ??

ADD REPLY
0
Entering edit mode

hi again.

I tried again with arcsine method. the range of data was now 0-1.5 so, I did back transformation and got the figure below. The percentage range are now almost much improved.

it ran successfully though I got a warning stating that I have some got NaN values. I traced back to see if these NaN values are 0 values in original data but then, the values are varying like 26 or 50 or more than that.... How should I proceed further with such values. Also, in my plot the genes highest coverage percentage was 100 earlier. Batch effect was removed, then the percentage drops to max 89.9 %

Will that be correct to state that now the highest percentage of gene coverage is 90% approx. and not 100% Does not it affect my biological inference then, that earlier I was getting all the nucleotides mapped by our reads in a particular sample and now 90% of the nucleotides (or total length) is covered.

Please comment

this is the figure showing box plots before and after batch effect correctio

these are the codes i used

Install and load required packages install.packages(c("limma", "edgeR", "ggplot2")) library(limma)

library(edgeR)

library(ggplot2)

Load data

percentage_data <- read.csv("C:/Users/coverage/10geq_gene_coverage.csv", row.names = 1)

batch_info <- read.csv("C:/Users/batch_info.csv", row.names = 1)

write.csv(back_transformed_data * 100, file = "C:/Users/coverage/back_transformed_data.csv")

Set paths for batch-corrected counts file

batch_corrected_cov_file <- "C:/Users/coverage/batch_corrected_cov.csv"

batch_corrected_plot <- "C:/Users/coverage/batch_corrected_boxplot.png"

Check the structure of the data

head(percentage_data)

head(batch_info)

Arcsine square root transformation

sqrt_transformed_data <- asin(sqrt(percentage_data / 100))

Create a design matrix

design_matrix <- model.matrix(~ condition, data = batch_info)

Convert transformed data to a DGEList object from edgeR

dge <- DGEList(counts = sqrt_transformed_data)

Add batch information to the DGEList object

dge$samples$Batch <- batch_info$Batch

Perform batch effect correction using limma

vst_dge <- removeBatchEffect(dge, batch = dge$samples$Batch, design = design_matrix)

Save batch-corrected counts to a file

write.csv(vst_dge, file = batch_corrected_counts_file)

Back-transform the data with handling for edge cases

back_transformed_data <- sin(sqrt(asin(vst_dge)))^2

Visualization before and after batch correction

par(mfrow = c(1, 2))

boxplot(percentage_data, main = "Original Data", ylim = c(0, 100))

boxplot(back_transformed_data * 100, main = "Batch-Corrected Data", ylim = c(0, 100))

ADD REPLY

Login before adding your answer.

Traffic: 440 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6