Dear bioconductor forum,
I am using the Agilent RAW .txt microarray data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=+GSE59408. My goal is to determine a presence/absence (on/off) pattern for the respective genes on the microarrays.
It would be very kind if someone experienced with limma, especially the functions lmfit and model.matrix , could help me with my questions. Thanks in advance.
I have read the limma userguide (https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) and tried to do it like 17.4 in the userguide.
First I have read in the platform design file from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL18948:
> library(limma) > platform <- read.table(file="GPL18948_platform_design_file2.txt", sep="\t")
Then I have read in the different RAW files with help of the taget.txt file. Target .txt has the following format:
SampleNumber FileName Condition 1 GSM1436498_Parent_replicate1.txt NONE 2 GSM1436499_Parent_replicate2.txt NONE 3 GSM1436500_CPZ1.txt CPZ 4 GSM1436501_CPZ2.txt CPZ
I read in the data and process it:
> x <- read.maimages(targets, source="agilent.median",green.only=TRUE) > y <- backgroundCorrect(x, method="normexp") > y <- normalizeBetweenArrays(y,method="quantile")
Then I filter the data according to the limma userguide 17.4:
> neg95 <- apply(y$E[y$genes$ControlType==-1,],2,function(x) quantile(x,p=0.95)) > cutoff <- matrix(1.1*neg95,nrow(y),ncol(y),byrow=TRUE) > isexpr <- rowSums(y$E > cutoff) >= 2 > table(isexpr) isexpr FALSE TRUE 3489 59478 > y0 <- y[y$genes$ControlType==0 & isexpr,]
The different conditions are the different antibiotica treatments of the 42 strains/microarrays (NONE, CPZ, CFI, AMK, NM, DOXY, CP, AZM, TP, ENX, CPFX => 11, NONE=2 replicates, Rest=4 replicates, altogether=42).
I then create the model.matrix also according to limma userguide 17.4:
> Treatment <- factor(targets$Condition, levels = unique(targets$Condition)) > design <- model.matrix(~Treatment) > colnames(design) <- levels(Treatment)
And then get the results:
> fit <- lmFit(y0,design) > fit <- eBayes(fit,trend=TRUE) > plotSA(fit, main="Probe-level") > summary(decideTests(fit[,-1])) CPZ CFI AMK NM DOXY CP AZM TP ENX CPFX -1 2504 1924 586 1116 1662 2851 878 1896 1963 1258 0 54929 56244 57860 56986 56412 54460 57527 55701 56305 56886 1 1132 397 119 463 491 1254 160 968 297 421
Same analysis, but now averaging probes for each gene:
> yave <- avereps(y0,ID=y0$genes[,"SystematicName"]) > fit <- lmFit(yave,design) > fit <- eBayes(fit,trend=TRUE) > summary(decideTests(fit[,-1])) CPZ CFI AMK NM DOXY CP AZM TP ENX CPFX -1 2431 1870 552 1080 1628 2819 843 1838 1938 1177 0 53447 54702 56298 55462 54861 52907 55967 54172 54733 55383 1 1091 397 119 427 480 1243 159 959 298 409
As one can see, this is way too big. The problem in my case is that the yave$genes$ProbeNames are the same as the yave$genes$SytematicNames, so there can be no average. I use the GeneSymbol from the platform design file and swap it with the SystematicNames to get a better average value:
> x$genes$SystematicName <- platform$GeneSymbol > y <- backgroundCorrect(x, method="normexp") > y <- normalizeBetweenArrays(y,method="quantile") > y0 <- y[y$genes$ControlType==0 & isexpr,] > yave <- avereps(y0,ID=y0$genes[,"SystematicName"]) > fit <- lmFit(yave,design) > fit <- eBayes(fit,trend=TRUE) > summary(decideTests(fit[,-1])) CPZ CFI AMK NM DOXY CP AZM TP ENX CPFX -1 184 143 38 72 128 189 93 170 141 115 0 4204 4307 4426 4372 4310 4185 4364 4208 4316 4320 1 83 21 7 27 33 97 14 93 14 36
This looks way better. My questions are now:
1) Is my contrast.matrix design right in terms of my goal of finding a presence/absence pattern for the genes?
2) Can I swap GeneSymbol with SystematicNames to get a better average value for the respective genes?
3) Can I use the results of the last table as a presence/absence pattern for the genes?
Thanks in advance, Max
Dear Mr. Smyth,
thank you for your answer. I apologize for my unclear question.
You are right, with the posted code above I find the differentially expressed genes.
However, my goal is to find a present/absent pattern for the genes of the mentioned dataset and then use the COBRA toolbox to create tissue specific models (function createTissueSpecificModel).
In a tutorial of this COBRA function limma was used to process the data and then the function mas5calls was used to get the presence/absence calls of the probes on the microarrays. But I was told that this function doesn't work for Agilent single-color microarrays.
As you wrote above, to distinguish the present genes I could just use the computed cutoff value. But do you think that I could still use the results of summary(decideTests(fit[,-1])) from my code above and use the regular expressed genes (with value 0) to create tissue specific models?
Kind regards, Max