GSE68849 contains Illumina Beadchip arrays and you need to apply preprocessing and normalization appropriate for that type of microarray.
This is what you should be doing to read, background correct and normalize the arrays:
> library(limma)
> x <- read.ilmn("GSE68849_non-normalized.txt.gz", probeid="ID_REF")
Reading file GSE68849_non-normalized.txt.gz ... ...
> y <- neqc(x)
Note: inferring mean and variance of negative control probe intensities from the detection p-values.
> dim(y)
[1] 47323 10
> dim(y$E)
[1] 47323 10
> length(y$E)
[1] 473230
Note:
You should not be trying to use rma() because these are not Affymetrix arrays.
You should not be subsetting or dissasembling the EListRaw object from read.ilmn(), because doing so throws away crucial information that is required to properly background correct and normalize the arrays.
You should not be converting to an ExpressionSet, certainly not before the normalization is complete anyway.
There is no limit to the number of features that an ExpressionSet or EListRaw object can contain. The arrays here contain 47323 probes but the dataset contains 473230 values, because there are ten samples, each with one expression value per probe. In your posts here you have confused the number of expression values in a dataset with the number of probes or features. The number of probes is the number of rows, not the total number of values (i.e., length) of the data matrix.
Having normalized the arrays, you can then proceed to a differential expression analysis:
> targets <- readSampleInfoFromGEO("GSE68849_series_matrix.txt.gz")
> Donor <- factor(targets$CharacteristicsCh1[,"donor"])
> Agent <- factor(targets$CharacteristicsCh1[,"agent"])
> data.frame(Donor,Agent)
Donor Agent
1 Donor 1 No virus control
2 Donor 1 Influenza A
3 Donor 2 No virus control
4 Donor 2 Influenza A
5 Donor 3 No virus control
6 Donor 3 Influenza A
7 Donor 4 No virus control
8 Donor 4 Influenza A
9 Donor 5 No virus control
10 Donor 5 Influenza A
> design <- model.matrix(~Donor+Agent)
> fit <- lmFit(y,design)
> keep <- (fit$Amean > 5)
> fit2 <- eBayes(fit[keep,], trend=TRUE)
> topTable(fit2, coef=6)
logFC AveExpr t P.Value adj.P.Val B
ILMN_2202096 -8.91 8.93 -67.4 6.16e-20 7.29e-16 32.6
ILMN_1664595 -9.09 9.50 -66.7 7.26e-20 7.29e-16 32.5
ILMN_2176225 -8.25 8.73 -62.8 1.78e-19 9.44e-16 32.0
ILMN_1675192 -8.43 9.87 -62.3 2.00e-19 9.44e-16 31.9
ILMN_1660195 -8.70 8.87 -61.6 2.35e-19 9.44e-16 31.8
ILMN_2110561 -8.60 9.27 -57.7 6.16e-19 1.78e-15 31.3
ILMN_2186745 -9.34 9.64 -57.7 6.19e-19 1.78e-15 31.3
ILMN_1781057 -8.07 9.09 -56.1 9.34e-19 2.35e-15 31.0
ILMN_1688663 -8.43 9.14 -52.5 2.50e-18 5.59e-15 30.4
ILMN_1769988 -9.03 9.83 -49.1 6.89e-18 1.34e-14 29.7
I had a look at GSE68849 and see that the array platform has about 47k probes. The dataset has 10 samples. So the $E expression matrix contains 47323 probes x 10 samples = 473230 values.
When the ExpressionSet is formed, you have the correct number of probes (47323), with values for all 10 samples in the dataset, totalling 473230 values.
The only limits I know of are the OS limits on the number of things you can have in a vector. For really huge things you might want to use a SummarizedExperiment with an HDF5 back-end so you don't have to keep all the data in memory at once. But so far as I know there isn't a limit on an ExpressionSet directly.
Okay. I think I've found an explication of that but I'm not pretty sure.
In effect I've already learned in a previous course about Bioconductor that Affymetrix uses a design where a gene is being measured by many different probes simultaneously (i.e., probset). As part of the preprocessing step for Affymetrix arrays the measurements for all probes in a probeset needs to be combined into one expression measure.
So at the beggining ( i.e., in the raw data) I had 473230 rows which are in fact the 473230 probes used in the microarray, but when I put that in the ExpressionSet I got only 47323 features.
A simple computation lets us know that in the ExpressionSet there's only 10 % of what it is in the raw data.
I guess that the ExpressionSet takes automatically all the probes from the raw data and combines them by 10 as if we use 10 probes per gene.in the microarray.
So this is my reflexion. Any further clarification or explication would be greatly appreciated !
I am not exactly sure what you mean by 'when I put that in the ExpressionSet'. In general, one doesn't usually create an ExpressionSet, but instead it is generated by the software being used. As an example, if I were to do
> library(oligo)
> dat <- read.celfiles(list.celfiles())
> eset <- rma(dat)
I would technically be putting data in an ExpressionSet, but not really. I told oligo to summarize the data, it did its magic, and then I end up with an ExpressionSet that has fewer rows than the starting data because it's been summarized. So if that is what you mean, then yes. If you are doing something else (and you don't know exactly what you are doing), then you should probably not do that, and instead do what I have outlined above.
Creating an ExpressionSet that contains the expression data from the raw data:
esetraw <- new("ExpressionSet",exprs=rawdata$E)
Checking how many features I have got:
length(featureNames(esetraw))
47323
Here, you might have noticed that the ExpressionSet contains only 10 % of what is in "rawd$E".
I hope you have already read what I wrote about this story of affymetrix and probsets.
So my question is why that happened, why here the ExpressionSet contains only 10% of the raw data. I gave an explanation but I'm not sure.
I'm very grateful for you sparing the time to write all these useful informations. Thank you Sir !