Question

News:Bug in ChAMP package champ.SVD() function

0

Entering edit mode

Yuan Tian ▴ 90

@yuan-tian-9598

Last seen 4.2 years ago

Hello:

This is Yuan Tian, the new developer and maintainer of ChAMP package. I am really sorry to say we find a serious bug in current (and old) version champ.SVD function, which is obviously a coding bug, not a algorithms one.

The bug exist in line 157 in current version ChAMP (1.10.0), which mistakenly ordered pd file (Sample_Sheet.csv) file, but forget to order beta matrix. The svd doconvolution is correct, but the correlation p value calculated between each covariate and latent component are relying on the matching of dataset and pd file. Thus only if your dataset is happen have its Sample_Name sorted, otherwise, the SVD plot generated by this champ.SVD() would be wrong.

The SVD plot is a useful plot for researchs to decide which factors and batches should be corrected before DMP, DMR, Block, GSEA analysis. Thus I thought it's a serious bug or user who used our package before.

As the new developer, I am really really sorry for posting this bad news here. Scientists who have encountered this problem, still stuck in this problem, or have experience with it may comment here or send email to champ450k@gmail.com. I will do my best to solve your questions.

The original paper of ChAMP is ChAMP: 450k Chip Analysis Methylation Pipeline.

The new version of ChAMP has finished now, I am writing vignette now, which should be released in this week. And I promise I will double check the code and make sure everything is correct. The new version of ChAMP changed a lot compared with current one, and much more powerful and easier to use, will provide more function in pipeline. So please still trust ChAMP and take it as your preferred tools to analysis Methylation Array Data.

Again, I am so sorry about the mistake we made in ChAMP package. T_T

Best

Yuan Tian

champ methylation 450k SVD EPIC News • 1.9k views

ADD COMMENT • link 7.6 years ago Yuan Tian ▴ 90

score 0 · Answer 1 · 2016-08-30

Hello:

Here is a more detailed explanation and example of the bug, the bug lies in line 157 in champ.SVD.R script, which would mistakenly order pd file (Sample_Sheet.csv) file but forget to order beta matrix. The champ.SVD() function has two part: svd deconvolution on beta matrix and correlation between each estimated latent components to covariate. So the bug will influence the second part of champ.SVD() function.

Below is am example of the effect of this bug on a simulation data I produced:

This data set contains 8 samples, but their Sample_Name are not sorted in Sample_Sheet.csv file:

[Header],,,,,,,
Investigator Name,,,,,,,
Project Name,,,,,,,
Experiment Name,,,,,,,
Date,3/18/2012,,,,,,
,,,,,,,
[Data],,,,,,,
Sample_Name,Sample_Plate,Sample_Group,Pool_ID,Project,Sample_Well,Sentrix_ID,Sentrix_Position
C2,,C,,,G09,7990895118,R05C02
C3,,C,,,E02,9247377086,R01C01
T4,,T,,,C09,7990895118,R01C02
T3,,T,,,E08,7990895118,R01C01
C4,,C,,,F02,9247377086,R02C01
T1,,T,,,B09,7766130112,R06C01
C1,,C,,,E09,7990895118,R03C02
T2,,T,,,C09,7766130112,R01C02

We can find that he Sample_Name (C2,C3,T4,T3...) is totally randomly distributed.

Then we can use following function to analysis this dataset and generate SVD plot:

library(ChAMP)
set.seed(100)
myLoad <- champ.load("./Simulation_Data_Random/")
myNorm <- champ.norm()
champ.SVD()

Then we can get the plot below, Sample_Group is not significant between C and T phentype:

Simulation Data (Random) SVD plot from uncorrected champ.SVD() function.

According to above plot, we can not see any significant correlation between Sample_Group and first three components. That's because in previous version of ChAMP, champ.SVD() reordered pd into C1,C2,C3...T3,T4, but did not reorder batch matrix.

Then if we order the Shamp_Sheet.csv file as below:

[Header],,,,,,,
Investigator Name,,,,,,,
Project Name,,,,,,,
Experiment Name,,,,,,,
Date,3/18/2012,,,,,,
,,,,,,,
[Data],,,,,,,
Sample_Name,Sample_Plate,Sample_Group,Pool_ID,Project,Sample_Well,Sentrix_ID,Sentrix_Position
C1,,C,,,E09,7990895118,R03C02
C2,,C,,,G09,7990895118,R05C02
C3,,C,,,E02,9247377086,R01C01
C4,,C,,,F02,9247377086,R02C01
T1,,T,,,B09,7766130112,R06C01
T2,,T,,,C09,7766130112,R01C02
T3,,T,,,E08,7990895118,R01C01
T4,,T,,,C09,7990895118,R01C02

In above Sample_Sheet.csv file, Sample_Names has been ranked, so it would not be effect by this bug.

Then, by rerun above code, we get another SVD plot as below:

Simulation Data (Ordered) SVD plot from uncorrected champ.SVD() function.

This this plot, we can clearly see that the Sample_Group is very significant. That's why I recommend scientists who have used our package here to recheck your data set and SVD plot, because I know many SVD plot is a very fundamental plot to detect batch effect and decide following actions.

I have already fixed the bug in current development version of ChAMP (1.11.0) and already uploaded the online version to 1.11.1 now, which would be available for you to download and use tomorrow.

below are two SVD plot I generated using corrected version of pacakge:

For Random Sample_Sheet.csv file Simulation Data as above one:

Simulation Data (Random) SVD plot from corrected champ.SVD() function.

And for Ordered Sample_Sheet.csv Simluation Data.

Simulation Data (Ordered) SVD plot from corrected champ.SVD() function.

So we can already get the same result that Sample_Group is significant no matter what the rank of Sample_Name is.

So sorry about the bug's existence for such a long time again.

Best

Yuan Tian