Problems with analysing Agilent G3 GEx Rat V2 chips
1
0
Entering edit mode
AB • 0
@3d8c51e7
Last seen 3.4 years ago
Belgium

Hi,

We recently received a dataset of 48 rat samples on 6 Agilent G3 GEx Rat V2 chips. I have no experience with Agilent arrays so I’m really struggling with this one.

A couple of questions I was hoping someone could help me with:

1) This dataset contains samples from 3 different tissues that were randomized across all chips. I was wondering whether it is ok to read in all the data in at once and then normalize or whether it would be better to normalize per tissue? If that is even ok to do so..

2) I had some issues reading in the datasets that were provided to me so I used the command below to read in the files but I’m not sure if this is ok:

targets <- readTargets("targets.txt", row.names="FileName")
x <- read.maimages(targets$FileName, source="agilent", columns=list(R="F635 Median", Rb="B635 Median"), annotation=c("Row","Column","ID","Name"))

3) The chip contains ERCC spike in probes but I’m not sure how to handle these in the analysis. I tried to analyze the data using commands below

bc<-backgroundCorrect.matrix(x, method="normexp")
E <- normalizeBetweenArrays(bc, method="quantile")
ct <- factor(targets$Type)
design <- model.matrix(~0+ct)
colnames(design) <- levels(ct)
fit <- lmFit (E,design)
contrasts <- makeContrasts (treatment-control, levels=design)
contrasts.fit <- eBayes(contrasts.fit(fit, contrasts))
summary(decideTests(contrasts.fit, method="global"))
a<-topTable(contrasts.fit, coef=1)
write.table(a, file="treatment-control.txt", sep="\t")

But I guess that is not the correct way to do so because the spike ins turned up being differentially expressed.. Can anyone help me on this? Thanks so much..

Best, Annelies

Agilent limma • 1.3k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 4 days ago
United States

It looks like these Agilent chips are green only, in which case you can just do

x <- read.maimages(targets$FileName, source="agilent", green.only = TRUE)

The remainder is in general what you would do if all you have are treatment and control samples. But you say that there are three tissues! In which case you wouldn't want to simply compare treatment and controls because that assumes no differences due to tissue (in which case why would you do separate tissues?).

Anyway, the easiest thing to do is to generate a composite factor that includes both the tissue and type, doing something like

targets$combo <- paste(targets$Type, targets$Tissue, sep = "_")
## here I assume the tissue is in a Tissue column
design <- model.matrix(~0+combo, targets)
colnames(design) <- gsub("combo", "", colnames(design))
contrast <- makeContrasts(tissue1_treatment - tissue1_control, tissue2_treatment - tissue2_control,
                          tissue3_treatment - tissue3_control, levels = design)
## and go on from there, testing each of the three coefficients

As for the spike-ins, I have yet to find a real reason to do anything with them. This is mainly due to the fact that (so far as I know) the spike-ins are added using some really small volume like 1 µl, which in the world of molecular biology is something that people think is actually a thing. But it's not! People who work in fields where you actually have to accurately aliquot small volumes like that will tell you that it's nearly impossible to do accurately, particularly with an air-aspirating pipettor, and especially if there are proteins or surfactants in the solution that will break the water tension. You can have more than 1 µl of fluid just clinging to the exterior of a pipettor tip. Even doing 50 µl is hard. You can get a five decimal scale and try to pipet 50 µl of RO water (the easiest thing to do) repeatedly, and you will likely be surprised at how hard it is to reproducibly get 50 ng of water from a Rainin or Eppendorf. Now maybe if you used a glass syringe...

Long story short, I would argue that you can safely ignore the ERCC spike-ins.

ADD COMMENT
0
Entering edit mode

Thank you so much for your reply. Greatly appreciated.

To clarify, we have sham, treated and untreated samples and we always investigate the influence of treatment in different tissues. I'm just not sure why they were now all randomized over all the chips. I would probably put sham, treated and untreated together per tissue. Do you think it is ok to normalize them all together? Different tissues have different RNA contents..

When I try your suggested read in command, I get following error:

Error in readGenericHeader(fullname, columns = columns, sep = sep) : Specified column headings not found in file

I'm not sure how the header specifically should look like for that command to work.

It also doesn't want to read in .gpr files. The only thing that has worked so far is:

x <- read.maimages(targets$FileName, source="agilent", columns=list(R="F635 Median", Rb="B635 Median"), annotation=c("Row","Column","ID","Name"))

Thanks for your advice on the spike-ins.

Kind regards, Annelies

ADD REPLY
0
Entering edit mode

What do you mean by 'randomized over all the chips'? If these are Agilent single color arrays, then there is one sample per array, so I can't interpret what you mean by that.

Anyway, you shouldn't be reading the gpr files anyway, you should be reading the txt files. There should be something like 9 garbage lines at the top and then the data. I have some Agilent single color arrays, and they look like this:

$ sed -n '10p'  ../../rawData/US23502338_252643710042_S01_GE1_107_Sep09_1_1.txt
FEATURES    FeatureNum  Row Col accessions  chr_coord   SubTypeMask SubTypeName Start   Sequence    ProbeUID    ControlType ProbeName   GeneName    SystematicName  Description 
PositionX   PositionY   gSurrogateUsed  gIsFound    gProcessedSignal    gProcessedSigError  gNumPixOLHi gNumPixOLLo gNumPix gMeanSignal gMedianSignal   gPixSDev    gPixNormIQR gBGNumPix   gBGMeanSignal   gBGMedianSignal gBGPixSDev  gBGPixNormIQR   
gNumSatPix  gIsSaturated    gIsFeatNonUnifOL    gIsBGNonUnifOL  gIsFeatPopnOL   gIsBGPopnOL IsManualFlag    gBGSubSignal    gBGSubSigError  gIsPosAndSignif gPValFeatEqBG   
gNumBGUsed  gIsWellAboveBG  gBGUsed gBGSDUsed   ErrorModel  gSpatialDetrendIsInFilteredSet  gSpatialDetrendSurfaceValueSpotExtentX  SpotExtentY gNetSignal  gMultDetrendSignal  
gProcessedBackground    gProcessedBkngError IsUsedBGAdjust  gInterpolatedNegCtrlSub gIsInNegCtrlRange   gIsUsedInMD

And the arguments for read.maimages should pick out the gMedianSignal and the gBMedianSignal, which are the green foreground and background columns, respectively, for an Agilent array.

ADD REPLY
0
Entering edit mode

As for putting them all together, the assumption for the quantile normalization (as with most microarray normalization methods) is that the bulk of the genes aren't changing between samples. There is always a tradeoff that you have to make - more samples are probably better, particularly for the model fitting step, but maybe the tissues are so different that you can't really normalize well - and it's not usually possible to know, really, what the underlying truth is. So as an analyst you have to make decisions without knowing what the underlying truth is, and have to be able to credibly defend what you have done.

ADD REPLY
0
Entering edit mode

Yes, these are Agilent single color array, one sample per array. One chip contains 8 arrays and we used a total of 6 chips (48 samples). If it turns out that there is too much variation between the tissues, can I normalize per tissue although some sample might be on one chip and others on the the other chips or do I need to take variation between the chips into account? Sorry might be a stupid question but I'm quite new at this and want to make sure I'm doing this right.

Thanks a lot for header example, helps a lot!

ADD REPLY
0
Entering edit mode

I would look at a PCA plot and a plot of the densities (plotDensities from limma) prior to normalizing in order to say if it's reasonable to normalize together. You can also use PCA plots to see if there is an array-specific signal, which will probably be in the higher principal components, so you should look at PC3 vs PC4, etc to check that out. A super easy way to do that is to use the Glimma package, and you can even add in group identifiers so you can change the plotting colors on the fly. I normally use glMDSplot, which is the original function for that sort of thing (note that using gene.selection = "common" is how you get a PCA plot). There is also a new function called glimmaMDS that might be better - I'm all old school and stuff so haven't even tried it yet - but either way it's nice to be able to have an interactive plot to play around with.

Anyway, if there does seem to be a plate-specific signal, then you will probably want to account for that in your model. If the samples were randomized well, the plate should be orthogonal to the treatment and you could hypothetically just add in a fixed factor for plate. But from what you've said that sounds unlikely, so you might need to use duplicateCorrelation instead, to fit a linear mixed model.

Given that you are new to this, you should read the limma User's Guide like five times or so. It's too long and complex to get in one shot, so I would read it several times and do your best to understand. It's pretty clear and comprehensive, but IMO it's hard for someone as smart/experienced as Gordon Smyth to write things in a completely approachable manner for new people. So flounder through it until it starts to stick.

ADD REPLY
0
Entering edit mode

I'll definitely do so.. Again, thanks so much for your help..

All the best, Annelies

ADD REPLY

Login before adding your answer.

Traffic: 390 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6