I have a really basic understanding of how the package limma works in that it fits a linear model to each row/sample in a micro-array dataset. What I do not understand is how to use a design matrix to specify how to fit a model to the micro-array data.
Where should I start if I want to learn how this is done? Do I have to study basic linear algebra?
I find the vignette for limma oversimplified for the project that I am working on as the dataset that I have has 10226 obs. of 120 variables.
My knowledge thus far extends as far as the following:
"str(data) # first 5 lines
'data.frame': 10226 obs. of 120 variables:
$ X266_1_1.txt: num 5.9 6.06 5.97 5.86 6 ...
$ X270_2_2.txt: num 6 6.87 6.27 5.94 6.04 ...
$ X266_2_3.txt: num 5.92 6.26 5.91 5.97 5.99 ...
$ X279_1_4.txt: num 5.98 6.01 5.95 6 6.04 ...
# The individuals are each labeled unconventionally where each label ends in ".txt" I just want to be able to fit a model to the data within
# should I be fitting the model to the columns i.e. c(data$X266_1_1.txt, data$X270_2_2.txt.....)?
Esentially I'm confused by this command:
model.matrix(~0+paste(not sure whether to specify rows or columns here))
Thank you that's very informative- yes I have a file of explanatory information. I have two expression datasets one is from mRNA and the other from ncRNA and I have two correlate the two to find differentially expressed ncRNAs. I have two separate "information files" that can be used to describe the data in each of the mRNA and ncRNA files.
The following is a sample of the colnames of the mRNA file: "HC_50_SMH, OD_10_KEN, greyu_20_SMH" expression values descend in each of these columns, and the accompanying information file is also organised into columns- one of these columns contains the names of the columns of the mRNA file. So with the following code, I reorganise the information file to match the order of the colnames of the mRNA file:
"E.mRNA <- read.table('expression_set.txt',header=TRUE,sep='\t')
sample.details <- read.table('sample_details.txt',header=TRUE,stringsAsFactors=FALSE,comment.char = "", quote="", sep='\t' ) # all IDs and some clinical data for individuals with mRNA measures
ptr <- match( colnames(E.mRNA), sample.details$my_category_2 )
sample.details <- sample.details[ptr,]"
The case is the same for ncRNA with it's corresponding information file, only now I reorganise the ncRNA file rather than the information file:
"E.ncRNA <- read.csv('120arrays.flatfile.csv',header=TRUE,stringsAsFactors=FALSE)
clinical <- read.csv('clinical.csv',header=TRUE,stringsAsFactors=FALSE) # all IDs and some clinical data for individuals with ncRNA measures
ptr <- match( colnames(E.ncRNA), paste('X',clinical$ncRNA.array.ID,sep="") )
colnames(E.ncRNA) <- clinical$Sample.label[ptr]"
So in summary: the information that accompanies the mRNA data has been reorganised according to the mRNA data-colnames and the colnames of the nRNA data has been relabelled as the array IDs from its corresponding information file.
PROBLEM: I don't know how to correlate these newly reorganised files. I would have guessed that I need to create a design matrix to fit a linear model to each of the datasets or is there some thing that I'm missing. I realise that there is very little context for you to pick up on but I'm a little desperate and was hoping you might have some clues?
Thanks
You haven't really answered the question; what is your experimental design? Are the patients organized into groups, e.g., treated or untreated? You haven't provided any context, so we don't know what the patients are, what your biological hypothesis is, etc. All of this information is absolutely essential to designing an analysis, so we'd just be spinning our wheels if we tried to proceed without it. And whether it makes any sense to combine mRNA and ncRNA analyses - well, that's a whole other problem.
Hi Aaron,
The point of the experiment is to distinguish differentially expressed genes in TB patients. The patients are divided into "Culture confirmed TB, Culture negative TB, Other diagnoses, healthy contact and healthy LTBI".
Culture confirmed TB Culture negative TB healthy contact healthy LTBI
18 35 12 11
Other diagnoses
44
This is to find biomarker expression profiles from TB patients. Is it faulted to combine the experiment results of the nc vs mRNA arrays? I thought I could correlate them somehow
I would analyze the mRNA and ncRNA arrays separately, especially if they're different array platforms with different normalization schemes, mean-variance trends, etc. Besides, it's not clear how would you correlate the ncRNA and mRNA results. The features are fundamentally different; you can't check whether a transcript that's DE in the mRNA analysis is also DE in the ncRNA analysis, because a transcript can't be both a mRNA and a ncRNA.
I have found a source that makes a little bit of sense: the model.matrix command allows me to specify a hypothetical log change in each of the conditions so my I could use Culture negative as my negative control and hypothesise that the difference in expression will be 1 for the Culture confirmed TB: cbind(Cultneg=c(0,0,0,0......), TB=c(1,1,1,1....))
Am I on the right track?
Or even better:
design <- cbind( 1, clinical2$Diagnostic.category=="Culture confirmed TB", clinical2$Diagnostic.category=="Culture negative TB")
fit <- lmFit( E.mRNA2, design ) ???
If you just want to model each group, let
model.matrix
do the work for you.In this intercept-free model, each coefficient represents the average log-expression in its named group. To compare two groups, use
makeContrasts
to set up a contrast between their coefficients. For example, to detect DE between confirmed and negative TB groups:... and process that through
contrasts.fit
. Note that thesub
command just gets rid of spaces in your group names, asmakeContrasts
doesn't like them.