Question

microarray data analsis with limma?

0

Entering edit mode

Mehmet Ilyas Cosacak • 0

@mehmet-ilyas-cosacak-9020

Last seen 7.3 years ago

Germany/Dresden/ CRTD - DZNE

Hi Everyone,

I am trying to analyse a micro array data and wonder if anyone could check if I am doing something wrong.

I am analyzing the microarray data from (Hamilton et al. 2015: Cell Stem Cell. 2015 Oct 1;17(4):397-411).

http://www.sciencedirect.com/science/article/pii/S1934590915003562

my script is as below:

library("limma")
library("plyr")
targets <- data.frame(rbind(
    c("GSE60460_Pair-E-Samples9+10.gpr","3xTg","WT"),
    c("GSE60460_Pair-F-Samples11+12.gpr","3xTg","WT"),
    c("GSE60460_Pair-G-Samples13+14.gpr","3xTg","WT"),
    c("GSE60460_Pair-H-Samples15+16.gpr","3xTg","WT")
))
colnames(targets) <- c("FileName","Cy5","Cy3")
rownames(targets) <- targets$FileName
# Required files can be downloaded from:
# "GSE60460_Pair-E-Samples9+10.gpr"  --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# "GSE60460_Pair-F-Samples11+12.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# "GSE60460_Pair-G-Samples13+14.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# "GSE60460_Pair-H-Samples15+16.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# and IDs that have been used in Hamilton et al. 2015. (http://www.sciencedirect.com/science/article/pii/S1934590915003562)
# "GSE60460_series_matrix.txt" -->  ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60460/matrix/
# "GPL10787-9758.txt" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10787
targets
fF <- function(x) as.numeric(x$Flags > -99)
calMeanDupIds <- function(rawMatrix, col4Mean, col4Values){
    initialColID <- colnames(rawMatrix)[col4Mean]
    colnames(rawMatrix)[col4Mean] <- "col2Use"
    empIds <- rawMatrix$col2Use == "" # check for empty IDs!
    empIdsMatrix <- rawMatrix[empIds,]
    remainedIds <- rawMatrix[!empIds,]
    d <- duplicated(remainedIds$col2Use)
    duplicList <- remainedIds[d,]
    uniqueList <- remainedIds[!d,]
    tmp_duplicList <- uniqueList[(uniqueList$col2Use %in% duplicList$col2Use),]
    uniqueList <- uniqueList[!(uniqueList$col2Use %in% duplicList$col2Use),]
    duplicList <- rbind(duplicList,tmp_duplicList)
    duplicListIDs <- duplicList[,1:(col4Values-1)]
    duplicList <- ddply(duplicList[,c(col4Mean,col4Values:ncol(duplicList))],.(col2Use), colwise(mean))
    duplicList <- merge(duplicListIDs,duplicList, by = "col2Use")
    duplicListIDs <- duplicListIDs[order(duplicListIDs$col2Use),]
    duplicList <- duplicList[order(duplicList$col2Use),]
    duplicList <- cbind(duplicListIDs,duplicList[,col4Values:ncol(duplicList)])
    if(nrow(empIdsMatrix) != 0){
        result <- rbind(empIdsMatrix,duplicList,uniqueList)
    }else{
        result <- rbind(duplicList,uniqueList)
    }
    colnames(result)[col4Mean] <- initialColID
    return(result)
}
annotIDs <- read.delim("GPL10787-9758.txt",sep = "\t")
annotIDs <- annotIDs[,1:16]
iDs <- read.delim("GSE60460_series_matrix.txt",sep = "\t")
annIds <- annotIDs[,c("ID","ENSEMBL_ID","GENE")]
#targets <- readTargets("Targets.txt")
targets <- readTargets()
RG <- read.maimages(targets, source = "genepix.median", wt.fun = fF)
RG <- backgroundCorrect(RG, method = "normexp", offset = 50)
RG <- RG[RG$genes$ID %in% annotIDs$ID,]
RG$genes <- merge(RG$genes,annIds, by =  "ID")
tmpValues <- cbind(RG$genes,RG$R,RG$G)
col4Mean <- 1
col4Values <- 8
tmpValues <- calMeanDupIds(tmpValues, col4Mean, col4Values)
RG$genes <- tmpValues[,1:(col4Values-1)]
RG$R <- tmpValues[,col4Values:(col4Values+3)]
RG$G <- tmpValues[,(col4Values+4):(col4Values+7)]
RG <- RG[!(duplicated(RG$genes$ID)),]
annotIDs <- annotIDs[(annotIDs$ID %in% iDs$ID_REF),]
RG <- RG[RG$genes$ID %in% annotIDs$ID,]
MA <- normalizeWithinArrays(RG , method = "loess")
MA <- normalizeBetweenArrays(MA, method="quantile", "normalizeQuantiles")
fit <- lmFit(MA)
fit <- eBayes(fit)
y <- topTable(fit, n = length(fit$genes$ID))
table(abs(y$logFC) >= 1.0)
FALSE  TRUE
32495   189
table(abs(y$logFC) >= 1.4 & y$P.Value < 0.005)
FALSE  TRUE
32632    52

table(y$logFC >= 1.4 & y$P.Value < 0.005)
FALSE  TRUE
32657    27
table(y$logFC <= -1.4 & y$P.Value < 0.005)
FALSE  TRUE
32659    25

In the original papers, 448 probes down regulated and 545probes up regulated (Figure-5C).

This is the summary of the data analysis that have been done in Hamilton et al. 2015:

"... Hybridizations were carried out using the SurePrint G3 Mouse GE 8x60K Microarray (Agilent Technologies) containing probes targeting 39,430 Entre Gene RNAs and 16,251 lincRNAs. Arrays were scanned at 5μm resolution using a GenePix4000B scanner (Molecular Devices). Data from scanned images were extracted using GenePix 6.1 (Axon). Chip annotations file was updated to a more recent version (20130207). Expression data was analyzed using Bioconductor packages (http://www.bioconductor.org/) and R statistical language www.r-project.org). Bioconductor limma (Smyth et al., 2005) package was used to background correct, log2-transform, aggregate and quantile-normalize the median probe intensities. Probe intensities were averaged when more than one probe was associated with a given probe identifier. The resulting matrix showing 55,681 probes as rows and 8 samples as columns was filtered to exclude lincRNA entries and probes in intensities below background in more than half of the samples. 32,684 probes were used as input for linear modeling using the method available in limma which uses a moderated t-statistic to assess significance. A 3xTg-AD vs WT in 7 months was evaluated and probes presenting a p-value lower than 0.005 and a foldchange greater than 1.4 or lower than –1.4 were considered for further analysis. ..."

This is the summary of the data analysis that have been done in GEO:

""Bioconductor limma26 package was used to background correct, log2-transform, aggregate and quantile-normalize the median probe intensities. Probe intensities were averaged when more than one probe was associated with a given probe identifier.""

best,

ilyas.

limma genepix microarray agilent microarrays • 2.7k views

ADD COMMENT • link 9.1 years ago Mehmet Ilyas Cosacak • 0

0

Entering edit mode

Hi Gordon,

I am really grateful and thank you very much for your help. Yes, definitely you are right. I automatically assumed that "fold change cut-off" as "log2 fold change"! Now, it makes sense.

avereps(), I did not know this function! Really helpful and quicker!

Thanks a lot!

ilyas.

ADD REPLY • link 9.1 years ago Mehmet Ilyas Cosacak • 0

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 3 hours ago

The city by the bay

Trying to reproduce someone else's analyses from a plain-English description is a thankless task. There's a lot of room for interpretation, e.g., what are the background correction options? Did they do loess normalization within each array? You've also done a lot of manipulation before running limma; did the authors do the same? I would suggest contacting the authors and asking them directly for the R script they used to do their analysis. If the script you're showing in your post is from the authors, and you still can't reproduce the results... well, something's wrong. Best to ask whether the authors can reproduce the results on their own system, using only public data. Obviously, you should try to match whatever package versions they used, though this is easier said than done.

ADD COMMENT • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

thanks for your comments.

I asked the author (corresponding) for the R script they used and did not get any reply (more than 1 month). That is why actually I am posting. I will try again.

But at least, we could get similar expression pattern.

is the logic of my analysis correct? or What would you do to perform the micro array analysis. I will use the data to compare with our RNA-Seq data. That is why I am trying to reproduce at least similar expression pattern as done in the paper.

best,

ilyas.

ADD REPLY • link 9.1 years ago Mehmet Ilyas Cosacak • 0

0

Entering edit mode

For starters, I would have a look at the genes that were reported as being DE in the paper. If they're not DE (or at least in the same direction) in your analysis, then there's clearly something wrong. Also, for the sake of trying to reproduce things, I would suggest that you cut out a lot of your extra analysis code, just in case it changes things (or is wrong). Get rid of calMeanDupIds, etc.; only put in the minimal set of analysis steps corresponding to what was described in the paper.

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

calMeanDupIds is just used to calculate the mean of probe ids. If there are many probes with the same id, I calculated the mean of them. That is also written in the GEO. or you can use ddply directly on RG$genes.

"Bioconductor limma26 package was used to background correct, log2-transform, aggregate and quantile-normalize the median probe intensities. Probe intensities were averaged when more than one probe was associated with a given probe identifier."

ADD REPLY • link 9.1 years ago Mehmet Ilyas Cosacak • 0

0

Entering edit mode

They don't seem to have done multiple testing correction which I think is a bit dodgy.

ADD REPLY • link 9.1 years ago chris86 ▴ 420

0

Entering edit mode

Unbelievable, to get something published in Cell Stem Cell without multiple testing correction. Are peer reviewers sleeping?

ADD REPLY • link 9.1 years ago b.nota ▴ 370

score 4 · Accepted Answer · 2016-05-19

4

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 4 minutes ago

WEHI, Melbourne, Australia

Here are a few issues:

First, the original paper used a fold change cutoff of 1.4, whereas you have used a logFC cutoff of 1.4, which corresponds to a fold change cutoff of 2^1.4 = 2.6. So naturally you have fewer DE genes.

It is not correct to average over duplicate Ids before normalization. If you want to take average over duplicate (and I don't usually do it) then just run MA2 <- avereps(MA, ID) after normalization. This is no need to write your own function to do this sort of thing. avereps() will be very much quicker, and will take account of the weights, which your code is ignoring.

It is also not correct to quantile normalize after loess normalization, so better to omit the normalizeBetweenArrays() step. The normalizeBetweenArrays call given in your post will give an error anyway.

It would improve the analysis to filter low expressed probes or alternatively to use eBayes() with trend=TRUE.

Using FDR control would be far better than repeating the kludgy ad hoc DE cutoff from the Cell Stem Cell paper.

ADD COMMENT • link 9.1 years ago Gordon Smyth 52k

0

Entering edit mode

Hi Gordon,

First of all, sorry for my stupid questions! I am trying to understand clearly.

What do you mean by FDR control? Shall I consider adj.P.val (adjusted p-value) for cut-off?

or shall I do a multiple correction test with treat?

by the way, this is the current status of the code based on your suggestion:

library("limma")
targets <- data.frame(rbind(
    c("GSE60460_Pair-E-Samples9+10.gpr","3xTg","WT"),
    c("GSE60460_Pair-F-Samples11+12.gpr","3xTg","WT"),
    c("GSE60460_Pair-G-Samples13+14.gpr","3xTg","WT"),
    c("GSE60460_Pair-H-Samples15+16.gpr","3xTg","WT")))
colnames(targets) <- c("FileName","Cy5","Cy3")
rownames(targets) <- targets$FileName
# Required files can be downloaded from:
# "GSE60460_Pair-E-Samples9+10.gpr"  --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# "GSE60460_Pair-F-Samples11+12.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# "GSE60460_Pair-G-Samples13+14.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# "GSE60460_Pair-H-Samples15+16.gpr" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60460
# and IDs that have been used in Hamilton et al. 2015. (http://www.sciencedirect.com/science/article/pii/S1934590915003562)
# "GSE60460_series_matrix.txt" -->  ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE60nnn/GSE60460/matrix/
# "GPL10787-9758.txt" --> http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL10787
fF <- function(x) as.numeric(x$Flags > -99)
RG <- read.maimages(targets, source = "genepix.median", wt.fun = fF)
RG <- backgroundCorrect(RG, method = "normexp", offset = 50)
MA <- normalizeWithinArrays(RG, method = "median")
MA <- normalizeBetweenArrays(MA, method="Aquantile","normalizeQuantiles")
MA <- avereps(MA, ID = MA$genes$ID)
fit <- lmFit(MA)
fit <- eBayes(fit, trend = TRUE)
y <- topTable(fit, n = row(fit$genes)

thank you very much in advance!

ilyas.

ADD REPLY • link 9.1 years ago Mehmet Ilyas Cosacak • 0

1

Entering edit mode

Yes, adj.P.Value is FDR.

normalizeWithinArrays() with method="loess" (as you had before) was much better.

"normalizeQuantiles" isn't a valid argument for normalizeBetweenArrays(). No need for it.

You would do well to filter control probes (as always with Agilent arrays). Suggest you try this:

AnnCol <- c("ID","Name","ControlType","GeneName","TopHit","Description")
x <- read.maimages(targets,source="genepix.median",annotation=AnnCol)