Question: Am I on the right tracks with a 3 factor design in Limma?
0
10 months ago by
reubenmcgregor880 wrote:

IMPORTANT: I am using a "dummy" dataset to just get my working right and get used to Limma workflows, as pointed out the values are raw fluorescence value while I await the appropriate transformed expression data. The "real" data will come in the same txt format but with the normalised and transformed data.

I have put explicit questions highlighted in yellow.

1) I have some expression data from a protein microarray in txt format:

>head(exprs[,1:5])          RF0001 RF0002 RF0005 RF0006 RF0008 COL4A3BP   4244   9961   8018   4867   3819 IRAK4       626  27775   8451  14370    336 BCL7C       998   1110    563    842    285 EIF4EBP3    763   2150    963   1748    805 PHGDH      2202   8624   4421   3264   5085 PTPN2       590    972    387   1718    679​

2) And have constructed some corresponding phenotypic data:

> head(pData)          group       days card RF0001    case          8  yes RF0002 control control_NA   na RF0005    case          8  yes RF0006 control control_NA   na RF0008    case          7  yes RF0009    case         19  yes

> summary(pData)      group            days     card     case   :74   control_NA:78   na :78    control:75   10        : 5   no :15                 7         : 5   yes:56                 11        : 4                          22        : 4                          28        : 4                          (Other)   :49             

As you can see I have 2 "groups" with 74 cases and 75 controls, but for this analysis I would also like to compare the "card" groups with "na (being the control group) vs yes", "na vs no" and "yes vs no".

>metadata <- data.frame(labelDescription=c("Case/control status","days of blood draw","cardirus status or control"),row.names=c("case_control", "days", "card"))

>phenoData <- new("AnnotatedDataFrame",data=pData, varMetadata=metadata)

Which works fine:

>focussed_array_set <- ExpressionSet(assayData=exprs,phenoData=phenoData)

> focussed_array_set ExpressionSet (storageMode: lockedEnvironment) assayData: 34 features, 149 samples    element names: exprs  protocolData: none phenoData   sampleNames: RF0001 RF0002 ... RF0650 (149 total)   varLabels: group days card   varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation:  

This is where I get a little lost, and I have read the "Limma" user guide and a few tutorials to guide me but would like to make sure I am doing things correctly.

4) I set the factors to the groups in my ExpressionSet, set up my design,

>groups <- factor(focussed_array_set$card) > head(design) groupsna groupsno groupsyes 1 0 0 1 2 1 0 0 3 0 0 1 4 1 0 0 5 0 0 1 6 0 0 1 5) I fit the linear model, set a contrast matrix and run the contrasts and a bayes fit. >fit <- lmFit(focussed_array_set, design) >contrast.matrix <- makeContrasts(groupsyes-groupsno, groupsno-groupsna, groupsyes-groupsna, levels=design) >fit2 <- contrasts.fit(fit, contrast.matrix) >fit2 <- eBayes(fit2) Q1) I would like to re-nename the columns in the design matrix using something like: >colnames(design) <- c("na","no","yes") But if I try this I later get an error message which is fairly self explanatory but how would I get around this? "Warning message: In contrasts.fit(fit, contrast.matrix) : row names of contrasts don't match col names of coefficients 6) I can then pull out the differentially expressed proteins for each contrast: topTable(fit2, coef=1, adjust.method ="BH") topTable(fit2, coef=2, adjust.method ="BH") topTable(fit2, coef=3, adjust.method ="BH") > topTable(fit2, coef=1, adjust.method ="BH") logFC AveExpr t P.Value adj.P.Val B BOD1L2 1938.9833 1902.443 3.446549 0.0007404505 0.02517532 -4.594686 SCHIP1 -2816.1060 1773.396 -2.234502 0.0269579456 0.42485775 -4.594953 PDGFC -1172.7036 1509.685 -1.989672 0.0484812293 0.42485775 -4.594996 RHBDD1 1729.8881 2693.557 1.767317 0.0792490734 0.42485775 -4.595030 TIGD4 -5225.1845 5548.067 -1.746642 0.0827869724 0.42485775 -4.595033 SLC4A1AP 3523.9500 5021.651 1.583170 0.1155306453 0.42485775 -4.595056 CLUL1 -875.0940 1611.497 -1.564839 0.1197697363 0.42485775 -4.595058 LASP1 581.8988 1160.899 1.561791 0.1204862427 0.42485775 -4.595059 COL4A3BP -3950.4571 8249.201 -1.558048 0.1213710510 0.42485775 -4.595059 IRAK4 6789.8845 9758.685 1.530200 0.1281152297 0.42485775 -4.595063 Q2) I understand that coef=1 will mean the first contrast ("groupsyes - groupsno") which I can get the order from: > contrast.matrix Contrasts Levels groupsyes - groupsno groupsno - groupsna groupsyes - groupsna groupsna 0 -1 -1 groupsno -1 1 0 groupsyes 1 0 1 But is there a way to explicitly ask for "groupsyes - groupsno" in order to not call the wrong contrast? • Q3) How can I view and export a table giving the results of each contract (coef=1,2 and 3) as in the "topTable" call but with all of the proteins? • And how can I extract the actual "expression" values for each protein in order to plot in heat-maps and clustering etc? Q4) The main general question here is if a) is this analysis achieving what I want it to? b) have I excepted the code in the most efficient way? biobase limma R design matrix • 302 views ADD COMMENTlink modified 10 months ago by Gordon Smyth37k • written 10 months ago by reubenmcgregor880 No, your analysis isn't right. You've input raw fluescences to limma, whereas the expression data needs to be on the log-scale. (As the help page for lmFit will tell you.) The solution might be as simple as logging the exprs matrix (with log2), but I want to first check that you don't have negative or zero values. Can you show me the output from this: summary( apply(exprs,2,min) ) ? ADD REPLYlink modified 10 months ago • written 10 months ago by Gordon Smyth37k Thanks Gordon, Sorry I should have clarified I am not using the right data. I am using the raw values to get my head around the structures etc in Lima. I have edited the post to highlight this. ADD REPLYlink written 10 months ago by reubenmcgregor880 No, you are misunderstanding. You don't need to wait for the appropriate transformed data. You just make the transformation quickly and easy as part of the analysis in R. There is no point at all in doing a meaningless analysis. It is just as quick and easy to do the right analysis. And then you will have some chance of judging whether the results you get look correct. Could you please answer the question I asked you? I can't help you unless you give me the information I need. ADD REPLYlink modified 10 months ago • written 10 months ago by Gordon Smyth37k Whoops, miss read your comment again sorry, yes sure: output of original data: > summary( apply(exprs,2,min) ) Min. 1st Qu. Median Mean 3rd Qu. Max. 5.0 60.0 101.0 127.9 165.0 716.0 Same but with log2 transformed values: > summary( apply(exprs,2,min) ) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.322 5.907 6.658 6.599 7.366 9.484  So looks like no negative values, Yes the data make allot more sense once log2 transformed thank you for that. ADD REPLYlink modified 10 months ago • written 10 months ago by reubenmcgregor880 OK, thanks. Now a 2nd question. The output you give from head(pData) and summary(pData) are contradictory. The output from summary says that pData$card has 3 possible values: "na", "no" and "yes". However the output from head shows that the first 6 values are all either "carditis" or "control_NA", neither of which belong to 3 possible values above.

What gives? It's pretty hard to figure out the structure of your experiment from conflicting information.

I feel like I will be apologising allot if I'm not careful, I will be next time, sorry.

I changed all me phenotypic data half way through writing the post as I realised some of my nomenclature was too long winded so I originally had control_NA which I replaced with na,

I have updated the post to reflect this now hope it makes sense

2
10 months ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:

Simpler analysis

I would analyse the data like this:

y <- normalizeBetweenArrays( log2(exprs), method="quantile")
group <- factor(pData\$card)
design <- model.matrix(~0+group)
colnames(design) <- levels(group)
contrast.matrix <- makeContrasts(
CarditisVsNone = yes-no,
NoneVsControl = no-na,
CarditisVsControl = yes-na, levels=design)
fit <- lmFit(y, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2, trend=TRUE)

Then

topTable(fit2, coef="CarditisVsNone", n=Inf)


and so on.

Note that there's no need to fuss around constructing complex data objects. You can just work directly from the original expression matrix you started with.

Q1. No, that doesn't lead to the message you state, and it is a warning message rather than an error anyway. Naturally you have to be consistent and rename the columns of design right at the start. You can't change the column names halfway through, after having already run lmFit(), which would lead to the warning you state.

Q2. You just use the names of the contrasts, see above. In your original code, you could have used coef="groupyes - groupno". Just type colnames(fit2) to see what names are available.

Q3a. n=Inf.

Q3b. The "actual" expression values are just the same object you give to lmFit(). There's no need to extract anything. For example"

plotMDS(y)
coolmap(y)


Q4. See above.

Thanks. I see there is something I am not quite understanding. I am trying to understand the analysis in term of my experience with another package on RNAseq data where un-notmalised data is put in and normalised data is output by the model. I have asked for the raw expression, normalised values from the company that ran the arrays so I hope what they send me will be in the right format to input into Limma in this was.

I am a bit confused ,as I did not do the re-processing steps and have come in late to the project, so am trying to get my head around what kind of data is needed at what stage?

Well, now you are confusing me. limma can analyse the raw fluescent values just fine, as I've already shown you. I'd much rather use limma's normalization than who-knows-what from the unnamed company. So I'm not quite sure what the problem is.

It doesn't really help to try understand the analysis in terms of package that you are not using and a data type (RNA-seq) that you are not analysing. Microarray analysis is actually simpler. The RNA-seq context where you have to compute data summaries for plotting independently of the DE analysis is messier.

Ok, understood I will look into Limmas normalisation procedures and QC etc and do it from there and then compare to the  "who-knows-what company. Always good to check I guess.

Thank you for all the help