designing matrices practice
1
0
Entering edit mode
@christopherclarskon15-9965
Last seen 8.0 years ago

I have a really basic understanding of how the package limma works in that it fits a linear model to each row/sample in a micro-array dataset. What I do not understand is how to use a design matrix to specify how to fit a model to the micro-array data.

Where should I start if I want to learn how this is done? Do I have to study basic linear algebra?

I find the vignette for limma oversimplified for the project that I am working on as the dataset that I have has 10226 obs. of  120 variables.

My knowledge thus far extends as far as the following:

"str(data) # first 5 lines

'data.frame':    10226 obs. of  120 variables:

$ X266_1_1.txt: num  5.9 6.06 5.97 5.86 6 ...
$ X270_2_2.txt: num  6 6.87 6.27 5.94 6.04 ...
$ X266_2_3.txt: num  5.92 6.26 5.91 5.97 5.99 ...
$ X279_1_4.txt: num  5.98 6.01 5.95 6 6.04 ...

# The individuals are each labeled unconventionally where each label ends in ".txt" I just want to be able to fit a model to the data within

# should I be fitting the model to the columns i.e. c(data$X266_1_1.txt, data$X270_2_2.txt.....)?

Esentially I'm confused by this command:

model.matrix(~0+paste(not sure whether to specify rows or columns here))

 

limma microarray • 1.3k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 53 minutes ago
WEHI, Melbourne, Australia

Well, the first step is far more basic than linear algebra. The first step to designing an analysis is to have a question. What scientific question is it that are you trying to answer?

Really you don't have 120 variables, you just have 120 columns of data. Although you haven't said so, I will assume that each column corresponds to an RNA sample and the numbers are normalized log-expression values. If you want to undertake a differential expression (DE) analysis, then you first need to have sample groups between which you want to evaluate DE, or else predictor variables that you want to relate expression to. In other words, you need to know something about the columns of data that you have.

In the limma User's Guide, the data.frame of explanatory variables is called a "targets frame". The targets frame links the column IDs (X266_1_1 etc) to information about the treatment conditions.

So, the first question that you need to answer is this: do your samples (columns) belong to groups? If not, do you have a targets file of explanatory information?

ADD COMMENT
0
Entering edit mode

Thank you that's very informative- yes I have a file of explanatory information. I have two expression datasets one is from mRNA and the other from ncRNA and I have two correlate the two to find differentially expressed ncRNAs. I have two separate "information files" that can be used to describe the data in each of the mRNA and ncRNA files. 

The following is a sample of the colnames of the mRNA file: "HC_50_SMH, OD_10_KEN, greyu_20_SMH" expression values descend in each of these columns, and the accompanying information file is also organised into columns- one of these columns contains the names of the columns of the mRNA file. So with the following code, I reorganise the information file to match the order of the colnames of the mRNA file:

"E.mRNA <- read.table('expression_set.txt',header=TRUE,sep='\t')
sample.details <- read.table('sample_details.txt',header=TRUE,stringsAsFactors=FALSE,comment.char = "", quote="", sep='\t' ) # all IDs and some clinical data for individuals with mRNA measures
ptr <- match( colnames(E.mRNA), sample.details$my_category_2 )
sample.details <- sample.details[ptr,]"

The case is the same for ncRNA with it's corresponding information file, only now I reorganise the ncRNA file rather than the information file:

"E.ncRNA <- read.csv('120arrays.flatfile.csv',header=TRUE,stringsAsFactors=FALSE)

clinical <-  read.csv('clinical.csv',header=TRUE,stringsAsFactors=FALSE) # all IDs and some clinical data for individuals with ncRNA measures

ptr <- match( colnames(E.ncRNA), paste('X',clinical$ncRNA.array.ID,sep="") )

colnames(E.ncRNA) <- clinical$Sample.label[ptr]"

So in summary: the information that accompanies the mRNA data has been reorganised according to the mRNA data-colnames and the colnames of the nRNA data has been relabelled as the array IDs from its corresponding information file.

PROBLEM: I don't know how to correlate these newly reorganised files. I would have guessed that I need to create a design matrix to fit a linear model to each of the datasets or is there some thing that I'm missing. I realise that there is very little context for you to pick up on but I'm a little desperate and was hoping you might have some clues?

Thanks

ADD REPLY
1
Entering edit mode

You haven't really answered the question; what is your experimental design? Are the patients organized into groups, e.g., treated or untreated? You haven't provided any context, so we don't know what the patients are, what your biological hypothesis is, etc. All of this information is absolutely essential to designing an analysis, so we'd just be spinning our wheels if we tried to proceed without it. And whether it makes any sense to combine mRNA and ncRNA analyses - well, that's a whole other problem.

ADD REPLY
0
Entering edit mode

Hi Aaron,

The point of the experiment is to distinguish differentially expressed genes in TB patients. The patients are divided into "Culture confirmed TB, Culture negative TB, Other diagnoses, healthy contact and healthy LTBI".

Culture confirmed TB  Culture negative TB      healthy contact         healthy LTBI 
                  18                   35                   12                   11 
     Other diagnoses 
                  44 

This is to find biomarker expression profiles from TB patients. Is it faulted to combine the experiment results of the nc vs mRNA arrays? I thought I could correlate them somehow

ADD REPLY
1
Entering edit mode

I would analyze the mRNA and ncRNA arrays separately, especially if they're different array platforms with different normalization schemes, mean-variance trends, etc. Besides, it's not clear how would you correlate the ncRNA and mRNA results. The features are fundamentally different; you can't check whether a transcript that's DE in the mRNA analysis is also DE in the ncRNA analysis, because a transcript can't be both a mRNA and a ncRNA.

ADD REPLY
0
Entering edit mode

I have found a source that makes a little bit of sense: the model.matrix command allows me to specify a hypothetical log change in each of the conditions so my I could use Culture negative as my negative control and hypothesise that the difference in expression will be 1 for the Culture confirmed TB: cbind(Cultneg=c(0,0,0,0......), TB=c(1,1,1,1....))

Am I on the right track?

ADD REPLY
0
Entering edit mode

Or even better:

design <- cbind( 1, clinical2$Diagnostic.category=="Culture confirmed TB", clinical2$Diagnostic.category=="Culture negative TB")

    fit <- lmFit( E.mRNA2, design ) ???

ADD REPLY
1
Entering edit mode

If you just want to model each group, let model.matrix do the work for you.

grouping <- factor(sub(" ", "", clinical2$Diagnostic.category))
design <- model.matrix(~0 + grouping)
colnames(design) <- levels(grouping)

In this intercept-free model, each coefficient represents the average log-expression in its named group. To compare two groups, use makeContrasts to set up a contrast between their coefficients. For example, to detect DE between confirmed and negative TB groups:

con <- makeContrasts(CultureconfirmedTB - CulturenegativeTB, levels=design)

... and process that through contrasts.fit. Note that the sub command just gets rid of spaces in your group names, as makeContrasts doesn't like them.

ADD REPLY

Login before adding your answer.

Traffic: 772 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6