Am I on the right tracks with a 3 factor design in Limma?
1
0
Entering edit mode
@reubenmcgregor88-13722
Last seen 3.2 years ago

IMPORTANT: I am using a "dummy" dataset to just get my working right and get used to Limma workflows, as pointed out the values are raw fluorescence value while I await the appropriate transformed expression data. The "real" data will come in the same txt format but with the normalised and transformed data.

I have put explicit questions highlighted in yellow. 

1) I have some expression data from a protein microarray in txt format:

>head(exprs[,1:5])
         RF0001 RF0002 RF0005 RF0006 RF0008
COL4A3BP   4244   9961   8018   4867   3819
IRAK4       626  27775   8451  14370    336
BCL7C       998   1110    563    842    285
EIF4EBP3    763   2150    963   1748    805
PHGDH      2202   8624   4421   3264   5085
PTPN2       590    972    387   1718    679​

2) And have constructed some corresponding phenotypic data:

> head(pData)
         group       days card
RF0001    case          8  yes
RF0002 control control_NA   na
RF0005    case          8  yes
RF0006 control control_NA   na
RF0008    case          7  yes
RF0009    case         19  yes

> summary(pData)
     group            days     card   
 case   :74   control_NA:78   na :78  
 control:75   10        : 5   no :15  
              7         : 5   yes:56  
              11        : 4           
              22        : 4           
              28        : 4           
              (Other)   :49             

As you can see I have 2 "groups" with 74 cases and 75 controls, but for this analysis I would also like to compare the "card" groups with "na (being the control group) vs yes", "na vs no" and "yes vs no".

3) I add some metadata and create an ExpressionSet

>metadata <- data.frame(labelDescription=c("Case/control status","days of blood draw","cardirus status or control"),row.names=c("case_control", "days", "card"))

>phenoData <- new("AnnotatedDataFrame",data=pData, varMetadata=metadata)

Which works fine:

>focussed_array_set <- ExpressionSet(assayData=exprs,phenoData=phenoData)

> focussed_array_set
ExpressionSet (storageMode: lockedEnvironment)
assayData: 34 features, 149 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: RF0001 RF0002 ... RF0650 (149 total)
  varLabels: group days card
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation:  

This is where I get a little lost, and I have read the "Limma" user guide and a few tutorials to guide me but would like to make sure I am doing things correctly.

4) I set the factors to the groups in my ExpressionSet, set up my design,

>groups <- factor(focussed_array_set$card)

> head(design)
  groupsna groupsno groupsyes
1        0        0         1
2        1        0         0
3        0        0         1
4        1        0         0
5        0        0         1
6        0        0         1

5) I fit the linear model, set a contrast matrix and run the contrasts and a bayes fit.

>fit <- lmFit(focussed_array_set, design)

>contrast.matrix <- makeContrasts(groupsyes-groupsno, groupsno-groupsna, groupsyes-groupsna,
                                 levels=design)

>fit2 <- contrasts.fit(fit, contrast.matrix)
>fit2 <- eBayes(fit2)

Q1) I would like to re-nename the columns in the design matrix using something like:

>colnames(design) <- c("na","no","yes")

 

But if I try this I later get an error message which is fairly self explanatory but how would I get around this?

"Warning message:
In contrasts.fit(fit, contrast.matrix) :
  row names of contrasts don't match col names of coefficients

6) I can then pull out the differentially expressed proteins for each contrast:

topTable(fit2, coef=1, adjust.method  ="BH")
topTable(fit2, coef=2, adjust.method  ="BH")
topTable(fit2, coef=3, adjust.method  ="BH")

> topTable(fit2, coef=1, adjust.method  ="BH")
              logFC  AveExpr         t      P.Value  adj.P.Val         B
BOD1L2    1938.9833 1902.443  3.446549 0.0007404505 0.02517532 -4.594686
SCHIP1   -2816.1060 1773.396 -2.234502 0.0269579456 0.42485775 -4.594953
PDGFC    -1172.7036 1509.685 -1.989672 0.0484812293 0.42485775 -4.594996
RHBDD1    1729.8881 2693.557  1.767317 0.0792490734 0.42485775 -4.595030
TIGD4    -5225.1845 5548.067 -1.746642 0.0827869724 0.42485775 -4.595033
SLC4A1AP  3523.9500 5021.651  1.583170 0.1155306453 0.42485775 -4.595056
CLUL1     -875.0940 1611.497 -1.564839 0.1197697363 0.42485775 -4.595058
LASP1      581.8988 1160.899  1.561791 0.1204862427 0.42485775 -4.595059
COL4A3BP -3950.4571 8249.201 -1.558048 0.1213710510 0.42485775 -4.595059
IRAK4     6789.8845 9758.685  1.530200 0.1281152297 0.42485775 -4.595063

Q2) I understand that coef=1 will mean the first contrast ("groupsyes - groupsno") which I can get the order from:

> contrast.matrix
           Contrasts
Levels      groupsyes - groupsno groupsno - groupsna groupsyes - groupsna
  groupsna                     0                  -1                   -1
  groupsno                    -1                   1                    0
  groupsyes                    1                   0                    1

But is there a way to explicitly ask for "groupsyes - groupsno" in order to not call the wrong contrast?

  • Q3) How can I view and export a table giving the results of each contract (coef=1,2 and 3) as in the "topTable" call but with all of the proteins?
  • And how can I extract the actual "expression" values for each protein in order to plot in heat-maps and clustering etc?

Q4) The main general question here is if a) is this analysis achieving what I want it to? b) have I excepted the code in the most efficient way?

r limma design matrix biobase • 1.8k views
ADD COMMENT
0
Entering edit mode

No, your analysis isn't right. You've input raw fluescences to limma, whereas the expression data needs to be on the log-scale. (As the help page for lmFit will tell you.)

The solution might be as simple as logging the exprs matrix (with log2), but I want to first check that you don't have negative or zero values. Can you show me the output from this:

summary( apply(exprs,2,min) )

?

ADD REPLY
0
Entering edit mode

Thanks Gordon,

Sorry I should have clarified I am not using the right data. I am using the raw values to get my head around the structures etc in Lima. I have edited the post to highlight this. 

ADD REPLY
0
Entering edit mode

No, you are misunderstanding. You don't need to wait for the appropriate transformed data. You just make the transformation quickly and easy as part of the analysis in R.

There is no point at all in doing a meaningless analysis. It is just as quick and easy to do the right analysis. And then you will have some chance of judging whether the results you get look correct.

Could you please answer the question I asked you? I can't help you unless you give me the information I need.

ADD REPLY
0
Entering edit mode

Whoops, miss read your comment again sorry, yes sure:

output of original data:

> summary( apply(exprs,2,min) )
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    5.0    60.0   101.0   127.9   165.0   716.0

Same but with log2 transformed values:

> summary( apply(exprs,2,min) )
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.322   5.907   6.658   6.599   7.366   9.484 

So looks like no negative values,

Yes the data make allot more sense once log2 transformed thank you for that.

 

ADD REPLY
0
Entering edit mode

OK, thanks.

Now a 2nd question. The output you give from head(pData) and summary(pData) are contradictory. The output from summary says that pData$card has 3 possible values: "na", "no" and "yes". However the output from head shows that the first 6 values are all either "carditis" or "control_NA", neither of which belong to 3 possible values above.

What gives? It's pretty hard to figure out the structure of your experiment from conflicting information.

ADD REPLY
0
Entering edit mode

I feel like I will be apologising allot if I'm not careful, I will be next time, sorry.

I changed all me phenotypic data half way through writing the post as I realised some of my nomenclature was too long winded so I originally had control_NA which I replaced with na,

I have updated the post to reflect this now hope it makes sense

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 9 hours ago
WEHI, Melbourne, Australia

Simpler analysis

I would analyse the data like this:

y <- normalizeBetweenArrays( log2(exprs), method="quantile")
group <- factor(pData$card)
design <- model.matrix(~0+group)
colnames(design) <- levels(group)
contrast.matrix <- makeContrasts(
       CarditisVsNone = yes-no,
       NoneVsControl = no-na,
       CarditisVsControl = yes-na, levels=design)
fit <- lmFit(y, design)
fit2 <- contrasts.fit(fit, contrast.matrix)
fit2 <- eBayes(fit2, trend=TRUE)

Then

topTable(fit2, coef="CarditisVsNone", n=Inf)

and so on.

Note that there's no need to fuss around constructing complex data objects. You can just work directly from the original expression matrix you started with.

Answers to your questions

Q1. No, that doesn't lead to the message you state, and it is a warning message rather than an error anyway. Naturally you have to be consistent and rename the columns of design right at the start. You can't change the column names halfway through, after having already run lmFit(), which would lead to the warning you state.

Q2. You just use the names of the contrasts, see above. In your original code, you could have used coef="groupyes - groupno". Just type colnames(fit2) to see what names are available.

Q3a. n=Inf.

Q3b. The "actual" expression values are just the same object you give to lmFit(). There's no need to extract anything. For example"

plotMDS(y)
coolmap(y)

Q4. See above.

 

ADD COMMENT
0
Entering edit mode

Thanks. I see there is something I am not quite understanding. I am trying to understand the analysis in term of my experience with another package on RNAseq data where un-notmalised data is put in and normalised data is output by the model. I have asked for the raw expression, normalised values from the company that ran the arrays so I hope what they send me will be in the right format to input into Limma in this was.

I am a bit confused ,as I did not do the re-processing steps and have come in late to the project, so am trying to get my head around what kind of data is needed at what stage? 

ADD REPLY
0
Entering edit mode

Well, now you are confusing me. limma can analyse the raw fluescent values just fine, as I've already shown you. I'd much rather use limma's normalization than who-knows-what from the unnamed company. So I'm not quite sure what the problem is.

It doesn't really help to try understand the analysis in terms of package that you are not using and a data type (RNA-seq) that you are not analysing. Microarray analysis is actually simpler. The RNA-seq context where you have to compute data summaries for plotting independently of the DE analysis is messier.

ADD REPLY
0
Entering edit mode

Ok, understood I will look into Limmas normalisation procedures and QC etc and do it from there and then compare to the  "who-knows-what company. Always good to check I guess. 

Thank you for all the help

ADD REPLY

Login before adding your answer.

Traffic: 714 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6