Question

Methods for removing unexpressed probes of Human Transcriptome Array HTA 2.0

0

Entering edit mode

Yang Shi ▴ 10

@ea61ff7a

Last seen 9 months ago

Zheng Zhou

Dear Bioconductor members, I'm asked to re-analyze the microarray data of GSE76297 (CEL). As far as I know, paCalls can not be used in HTAFeatureSet. And there is a method based on density plot of main, intronic, and antigenomic probesets by Prof. James W. MacDonald (Appropriate pre-processing pipeline for Human Transcriptome Array HTA 2.0 with oligo for DE analysis). However, there are still some questions about this methods. Why is the cutoff based on intronic probesets? What is the purpose for the antigenomic probesets here for filtering? And are there quantitative methods like paCalls to filtering the unexpressed probes? Thanks in advance! Yang Shi

eset <- oligo::rma(object = affyRaw, target = 'core')
eset.main <- getMainProbes(input = eset, level = 'core')
require(hta20transcriptcluster.db)
eset.main <- annotateEset(eset.main, hta20transcriptcluster.db)
plot(density(exprs(eset.main)), main = "Probeset distribution", xaxt = 'n')
for(i in c(2,7)) lines(density(exprs(eset.main)[as.character(probeType[probeType[,2] %in% i,1]),1]), lty = if(i == 2) 2 else 3)
legend("topright", c("Main","Antigenomic","Intronic"), lty = 1:3, bty="n")
axis(1, xaxp=c(1,14,15), las=2)

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8 
[2] LC_CTYPE=Chinese (Simplified)_China.utf8   
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C                               
[5] LC_TIME=Chinese (Simplified)_China.utf8    

attached base packages:
[1] stats4    stats     graphics  utils     datasets  grDevices
[7] methods   base

enter image description here

hta20transcriptcluster.db Microarray oligo • 788 views

ADD COMMENT • link 23 months ago Yang Shi ▴ 10

score 2 · Accepted Answer · 2022-05-16

In the previous post I noted that paCalls doesn't exist for this array, and that is still true today. And there are no methods similar to paCalls that have been invented in the last five years because nobody uses these arrays any longer. It's cheaper and better to use RNA-Seq, and any methods development on Affy arrays was halted many years ago.

What paCalls (which means Present/Absent calls) is meant to do is tell you if a given probeset is 'present' or not, where 'present' means that there is evidence that a given transcript is being expressed in a given sample. To make those calls you need to know the distribution of 'absent' probesets, so you can compare and say if the intensity of a given probeset is sufficiently larger than an 'absent' probeset to say that it is probably measuring some transcript.

The intronic probesets are meant to interrogate introns, which for the most part shouldn't exist in your total RNA, assuming you are getting it from the cytosol, and most are mature mRNA transcripts (nuclear transcripts may still have introns, but that should be a minority of your total RNA). If you assume that most of your mRNA no longer contains the intron, then any expression of probesets meant to measure intronic sequences will really just represent background binding or noise or whatever. But certainly not any transcript that could be considered 'present'. The intronic probesets therefore represent a measure of 'absent' probesets, and you can then assume that any main probesets that have comparable expression levels are themselves 'absent' as well.

Your plot indicates that most of the intronic probesets have an intensity < 4, so you could use that as a cutoff between absent and present and exclude any probesets below that level.

The antigenomic probes are sequences that Affy determined are not present in nature and are therefore not binding to any complementary sequences. But they are sort of weird, ranging from 100% AT to 100% GC. The reason I included those sequences was to show that they are not a good candidate for 'absent' probesets, because as the GC content increases, the binding increases as well. And if you have 100% GC content, those probes will bind to all sorts of things, which is why the tail of the distribution of the antigenomic probes goes all the way out to 12 or so, which is pretty close to saturated binding.