Question

Filtering absent transcripts from Gene ST array

0

Entering edit mode

Sophie Marion de Proce • 0

@sophie-marion-de-proce-11984

Last seen 2.7 years ago

University of Edinburgh

Dear all,

I've been going round in circles with a microarray analysis, as this is the first time I analyse microarray data!

I had initially used oligo for my analysis, and managed to get results of differential expression both at the transcript level and at the probeset level.

The reason why I thought oligo wasn't appropriate was that it can't do transcript-level presence/absence calls, it can only do it at the probeset level.

The samples I'm analysing are not brown Norway rat, but SHR strain rats, so there will probably be some probes from the Rat 2.1 Gene ST array that don't bind to this strain.

Hence I thought it was important to do the presence / absence calls at the transcript level. I believe a transcript-level analysis is best suited for gene ST arrays.

I then tried to use xps but documentation seemed limited and I struggled with it.

What would you recommend?

Are the presence/absence calls necessary in my case? Or should my results from oligo at the transcript level using only rma normalisation be reliable?

Or perhaps there is a way to do the presence / absence calls at the probeset level and then use this to do the transcript level summarisation?

I'm at a loss!

I copy my code for oligo below, including my presence / absence calls at the probeset level.

Thanks a lot for your help,

Best wishes,

Sophie.

# Read the raw data (CEL files)

library(oligo)
(celFiles <- list.celfiles(baseDir, full.names=TRUE))
rawData <- read.celfiles(celFiles) # Gives an GeneFeatureSet, ExpressionFeatureSet

# Add the phenotype data to geneCore

tab <- exprs(geneCore)
phenoData<-read.table("phenotypeData.txt",h=T)
row.names(phenoData) <- phenoData$id
phenoData <- phenoData[colnames(tab),]
pData(geneCore)<-phenoData

# Normalise rawData using rma, target = “core”

geneCore <- rma(rawData, target = "core")
featureData(geneCore) <- getNetAffx(geneCore, "transcript")

# Transcript level differential expression (2x2 factorial design)

library(limma)
TS <- paste(phenoData$Strain, phenoData$Tissue, sep=".")
TS <- factor(TS,)
design <- model.matrix(~0+TS)
colnames(design) <- levels(TS)
fitGeneCore <- lmFit(geneCore, design)

cont.matrix <- makeContrasts(SvsC2inLV=S.LV-C2.LV,SvsC2inEF=S.EF-C2.EF,Diff=(S.EF-C2.EF)-(S.LV-C2.LV),levels=design)
fitGeneCore <- contrasts.fit(fitGeneCore, cont.matrix)
fitGeneCore <- eBayes(fitGeneCore)
fit2<-fitGeneCore

geneNames<-unlist(lapply(fit2$genes$geneassignment, function(x) if(!is.na(x)) {return(strsplit(strsplit(x, " // ")[[1]][2]," /// ")[[1]][1])}else{return("NA")}))
entrezNames<-as.character(sapply(sapply(sapply(lapply(fit2$genes$geneassignment, function(x) if(!is.na(x)) {strsplit(x, " // ")}else{return(list(list("NA","NA","NA","NA","NA")))}),"[[",1),"[[",5), function(x) gsub(" /// .+","",x)))

topTable(fit2,coef=1,adjust.method="BH",number=Inf,genelist=geneNames)

# Presence / absence calls at the probeset level (remove probesets for which more than 50% of samples do not have a p-value < 0.05)

eset <- rma(rawData,target="probeset")
featureData(eset) <- getNetAffx(eset, "probeset")
dabgPS<-paCalls(rawData,"PSDABG")
ind <- apply(dabgPS, 1, function(x) sum(x<0.05) > 4)
sorted.ind<-ind[match(featureNames(eset),names(ind))] # sort ind according to probe order in eset
not_ctrls <- featureNames(eset) %in% names(sorted.ind[sorted.ind=="TRUE"])
eset.filtered <- eset[not_ctrls,]

microarray oligo gene ST pacalls • 1.3k views

ADD COMMENT • link updated 7.0 years ago by James W. MacDonald 65k • written 7.0 years ago by Sophie Marion de Proce • 0

score 3 · Answer 1 · 2017-04-28

The Gene ST arrays aren't really suited for probeset-level analysis, and as you note there isn't a way to get P/A calls at the transcript level. You could hypothetically do some things to try to figure out which probes are not compatible with the strain of rat that you are using, and then exclude them when summarizing your data. It might actually be fun to do so, if you have a warped sense of what is fun.

However, the RMA algorithm will control for this sort of thing, and so long as most probes in a probeset are complementary to the sequence they are intended to measure, you should be OK. Also please note that P/A calls can't distinguish between probesets that aren't complementary (and hence are not binding to transcript that exists in your sample) and probesets for which there is no transcript (where the expression level of the probeset is close to background because there isn't anything to bind to). Both will have expression levels close to background, and it isn't possible (without actually comparing the probe sequences to the strain's particular transcript sequences) to say which is happening.

When analyzing data, one of the goals is to understand what assumptions you are making. There is nothing wrong with assuming that the SHR strain's transcriptome is similar enough to the Wistar strain for the Affy array to (mostly) accurately assess transcript abundance, while still acknowledging that this might not be true for some transcripts. Other people (say, reviewers) may disagree with this assumption and think you should 'do something' to ensure that is the case, but in my experience most people have only a cursory understanding of how microarrays work, so I would be sort of surprised if anybody raised that point.