Question

Am I on the right tracks with a 3 factor design in Limma?

0

Entering edit mode

reubenmcgregor88 • 0

@reubenmcgregor88-13722

Last seen 3.2 years ago

IMPORTANT: I am using a "dummy" dataset to just get my working right and get used to Limma workflows, as pointed out the values are raw fluorescence value while I await the appropriate transformed expression data. The "real" data will come in the same txt format but with the normalised and transformed data.

I have put explicit questions highlighted in yellow.

1) I have some expression data from a protein microarray in txt format:

>head(exprs[,1:5]) RF0001 RF0002 RF0005 RF0006 RF0008 COL4A3BP 4244 9961 8018 4867 3819 IRAK4 626 27775 8451 14370 336 BCL7C 998 1110 563 842 285 EIF4EBP3 763 2150 963 1748 805 PHGDH 2202 8624 4421 3264 5085 PTPN2 590 972 387 1718 679

2) And have constructed some corresponding phenotypic data:

> head(pData) group days card RF0001 case 8 yes RF0002 control control_NA na RF0005 case 8 yes RF0006 control control_NA na RF0008 case 7 yes RF0009 case 19 yes

> summary(pData) group days card case :74 control_NA:78 na :78 control:75 10 : 5 no :15 7 : 5 yes:56 11 : 4 22 : 4 28 : 4 (Other) :49

As you can see I have 2 "groups" with 74 cases and 75 controls, but for this analysis I would also like to compare the "card" groups with "na (being the control group) vs yes", "na vs no" and "yes vs no".

3) I add some metadata and create an ExpressionSet

>metadata <- data.frame(labelDescription=c("Case/control status","days of blood draw","cardirus status or control"),row.names=c("case_control", "days", "card"))

>phenoData <- new("AnnotatedDataFrame",data=pData, varMetadata=metadata)

Which works fine:

>focussed_array_set <- ExpressionSet(assayData=exprs,phenoData=phenoData)

> focussed_array_set ExpressionSet (storageMode: lockedEnvironment) assayData: 34 features, 149 samples element names: exprs protocolData: none phenoData sampleNames: RF0001 RF0002 ... RF0650 (149 total) varLabels: group days card varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation:

This is where I get a little lost, and I have read the "Limma" user guide and a few tutorials to guide me but would like to make sure I am doing things correctly.

4) I set the factors to the groups in my ExpressionSet, set up my design,

>groups <- factor(focussed_array_set$card)

> head(design) groupsna groupsno groupsyes 1 0 0 1 2 1 0 0 3 0 0 1 4 1 0 0 5 0 0 1 6 0 0 1

5) I fit the linear model, set a contrast matrix and run the contrasts and a bayes fit.

>fit <- lmFit(focussed_array_set, design)

>contrast.matrix <- makeContrasts(groupsyes-groupsno, groupsno-groupsna, groupsyes-groupsna, levels=design)

>fit2 <- contrasts.fit(fit, contrast.matrix) >fit2 <- eBayes(fit2)

Q1) I would like to re-nename the columns in the design matrix using something like:

>colnames(design) <- c("na","no","yes")

But if I try this I later get an error message which is fairly self explanatory but how would I get around this?

"Warning message: In contrasts.fit(fit, contrast.matrix) : row names of contrasts don't match col names of coefficients

6) I can then pull out the differentially expressed proteins for each contrast:

topTable(fit2, coef=1, adjust.method ="BH") topTable(fit2, coef=2, adjust.method ="BH") topTable(fit2, coef=3, adjust.method ="BH")

> topTable(fit2, coef=1, adjust.method ="BH") logFC AveExpr t P.Value adj.P.Val B BOD1L2 1938.9833 1902.443 3.446549 0.0007404505 0.02517532 -4.594686 SCHIP1 -2816.1060 1773.396 -2.234502 0.0269579456 0.42485775 -4.594953 PDGFC -1172.7036 1509.685 -1.989672 0.0484812293 0.42485775 -4.594996 RHBDD1 1729.8881 2693.557 1.767317 0.0792490734 0.42485775 -4.595030 TIGD4 -5225.1845 5548.067 -1.746642 0.0827869724 0.42485775 -4.595033 SLC4A1AP 3523.9500 5021.651 1.583170 0.1155306453 0.42485775 -4.595056 CLUL1 -875.0940 1611.497 -1.564839 0.1197697363 0.42485775 -4.595058 LASP1 581.8988 1160.899 1.561791 0.1204862427 0.42485775 -4.595059 COL4A3BP -3950.4571 8249.201 -1.558048 0.1213710510 0.42485775 -4.595059 IRAK4 6789.8845 9758.685 1.530200 0.1281152297 0.42485775 -4.595063

Q2) I understand that coef=1 will mean the first contrast ("groupsyes - groupsno") which I can get the order from:

> contrast.matrix Contrasts Levels groupsyes - groupsno groupsno - groupsna groupsyes - groupsna groupsna 0 -1 -1 groupsno -1 1 0 groupsyes 1 0 1

But is there a way to explicitly ask for "groupsyes - groupsno" in order to not call the wrong contrast?

Q3) How can I view and export a table giving the results of each contract (coef=1,2 and 3) as in the "topTable" call but with all of the proteins?
And how can I extract the actual "expression" values for each protein in order to plot in heat-maps and clustering etc?

Q4) The main general question here is if a) is this analysis achieving what I want it to? b) have I excepted the code in the most efficient way?

r limma design matrix biobase • 1.8k views

ADD COMMENT • link updated 5.8 years ago by Gordon Smyth 50k • written 5.8 years ago by reubenmcgregor88 • 0

0

Entering edit mode

No, your analysis isn't right. You've input raw fluescences to limma, whereas the expression data needs to be on the log-scale. (As the help page for lmFit will tell you.)

The solution might be as simple as logging the exprs matrix (with log2), but I want to first check that you don't have negative or zero values. Can you show me the output from this:

summary( apply(exprs,2,min) )

?

ADD REPLY • link 5.8 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks Gordon,

Sorry I should have clarified I am not using the right data. I am using the raw values to get my head around the structures etc in Lima. I have edited the post to highlight this.

ADD REPLY • link 5.8 years ago reubenmcgregor88 • 0

0

Entering edit mode

No, you are misunderstanding. You don't need to wait for the appropriate transformed data. You just make the transformation quickly and easy as part of the analysis in R.

There is no point at all in doing a meaningless analysis. It is just as quick and easy to do the right analysis. And then you will have some chance of judging whether the results you get look correct.

Could you please answer the question I asked you? I can't help you unless you give me the information I need.

ADD REPLY • link 5.8 years ago Gordon Smyth 50k

0

Entering edit mode

Whoops, miss read your comment again sorry, yes sure:

output of original data:

> summary( apply(exprs,2,min) ) Min. 1st Qu. Median Mean 3rd Qu. Max. 5.0 60.0 101.0 127.9 165.0 716.0

Same but with log2 transformed values:

> summary( apply(exprs,2,min) ) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.322 5.907 6.658 6.599 7.366 9.484

So looks like no negative values,

Yes the data make allot more sense once log2 transformed thank you for that.

ADD REPLY • link 5.8 years ago reubenmcgregor88 • 0

0

Entering edit mode

OK, thanks.

Now a 2nd question. The output you give from head(pData) and summary(pData) are contradictory. The output from summary says that pData$card has 3 possible values: "na", "no" and "yes". However the output from head shows that the first 6 values are all either "carditis" or "control_NA", neither of which belong to 3 possible values above.

What gives? It's pretty hard to figure out the structure of your experiment from conflicting information.

ADD REPLY • link 5.8 years ago Gordon Smyth 50k

0

Entering edit mode

I feel like I will be apologising allot if I'm not careful, I will be next time, sorry.

I changed all me phenotypic data half way through writing the post as I realised some of my nomenclature was too long winded so I originally had control_NA which I replaced with na,

I have updated the post to reflect this now hope it makes sense

ADD REPLY • link 5.8 years ago reubenmcgregor88 • 0

score 2 · Accepted Answer · 2018-07-04

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

Simpler analysis

I would analyse the data like this:

y <- normalizeBetweenArrays( log2(exprs), method="quantile")
group <- factor(pData$card)
design <- model.matrix(~0+group)
colnames(design) <- levels(group)
contrast.matrix <- makeContrasts(
       CarditisVsNone = yes-no,
       NoneVsControl = no-na,
       CarditisVsControl = yes-na, levels=design)
fit <- lmFit(y, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2, trend=TRUE)

Then

topTable(fit2, coef="CarditisVsNone", n=Inf)

and so on.

Note that there's no need to fuss around constructing complex data objects. You can just work directly from the original expression matrix you started with.

Answers to your questions

Q1. No, that doesn't lead to the message you state, and it is a warning message rather than an error anyway. Naturally you have to be consistent and rename the columns of design right at the start. You can't change the column names halfway through, after having already run lmFit(), which would lead to the warning you state.

Q2. You just use the names of the contrasts, see above. In your original code, you could have used coef="groupyes - groupno". Just type colnames(fit2) to see what names are available.

Q3a. n=Inf.

Q3b. The "actual" expression values are just the same object you give to lmFit(). There's no need to extract anything. For example"

plotMDS(y)
coolmap(y)

Q4. See above.

ADD COMMENT • link 5.8 years ago • updated 5.3 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks. I see there is something I am not quite understanding. I am trying to understand the analysis in term of my experience with another package on RNAseq data where un-notmalised data is put in and normalised data is output by the model. I have asked for the raw expression, normalised values from the company that ran the arrays so I hope what they send me will be in the right format to input into Limma in this was.

I am a bit confused ,as I did not do the re-processing steps and have come in late to the project, so am trying to get my head around what kind of data is needed at what stage?

ADD REPLY • link 5.8 years ago reubenmcgregor88 • 0

0

Entering edit mode

Well, now you are confusing me. limma can analyse the raw fluescent values just fine, as I've already shown you. I'd much rather use limma's normalization than who-knows-what from the unnamed company. So I'm not quite sure what the problem is.

It doesn't really help to try understand the analysis in terms of package that you are not using and a data type (RNA-seq) that you are not analysing. Microarray analysis is actually simpler. The RNA-seq context where you have to compute data summaries for plotting independently of the DE analysis is messier.

ADD REPLY • link 5.8 years ago Gordon Smyth 50k

0

Entering edit mode

Ok, understood I will look into Limmas normalisation procedures and QC etc and do it from there and then compare to the "who-knows-what company. Always good to check I guess.

Thank you for all the help

ADD REPLY • link 5.8 years ago reubenmcgregor88 • 0