Question: avereps with probes mapping to multiple genes (Agilent dual color microarray)
0
9 weeks ago by
wd20
Germany
wd20 wrote:

Hi

I have a set of old Agilent microarray gene expression data   (e.g. GSE80337 ) and would like to reanalyze the results using an updated  annotation. (fyi, the orignal array design was done in such a way that each transcript had 2 to3 probes).

I first mapped the array probes against the new annotation (cDNA) using Bowtie2 (restricting to those probes that have a perfect match). Based on the Bowtie2 output, some of these probes map to multiple genes/IDs (multi-mapping probes).

Now I would like to use the avereps function to group/average the probe signal for each gene (instead of probes individually, as I would like to do some GO enrichment and to "easily" compare with other expression (RNAseq) data).

So my question is, how should I use the avereps function to also include multimapped probes when averaging (e.g. use probe 1 & 2 for NR_003039&NR_0030038/probe 1,3 & 4 for NR_003041 in the Agilent raw file example below)?

Row Col ControlType      ProbeName SystematicName(ID)

1:  13  12           0 probe_1      NR_003038, NR 003039, NR_003041
2:  13  53           0 probe_2      NR_003039, NR_0030038
3: 106  26           0 probe_3      NR_003041

4: 108  31           0 probe_4      NR_003041

If I do "MA_average<-avereps(MA_normalized, ID=MA_normalized$genes$SystematicName)" then it ignores the komma separator and considers e.g. "NR_003038, NR 003039, NR_003041" as one ID...

One could say/other posts suggest to remove multimapping probes, but then I exclude those genes that are 100% identical duplicates (and thus only having multimapping probes, NR_0030038/39 in this example)...

Thank you for helping out!

modified 9 weeks ago by Gordon Smyth36k • written 9 weeks ago by wd20
Answer: avereps with probes mapping to multiple genes (Agilent dual color microarray)
1
9 weeks ago by
United States
James W. MacDonald49k wrote:

You are confusing genes and transcripts. GenBank IDs are for transcripts, so it's quite possible that you have very few multi-mapping probes here if you are concerned with gene expression. But then again, there are lots of non-translated genes that come from within translated genes, so it's not that simple:

 > select(org.Hs.eg.db, c("NR_003038", "NR 003039", "NR_003041"), c("ENTREZID","SYMBOL"),"ACCNUM")
'select()' returned 1:1 mapping between keys and columns
ACCNUM ENTREZID  SYMBOL
1 NR_003038   387066   SNHG5
2 NR 003039     <NA>    <NA>
3 NR_003041   692084 SNORD13


So for the first 'gene' you have there, it's a combination of a non-coding (snoRNA) transcript, and transcripts from a couple of pseudogenes. It's difficult to know what to do with that probe, even after knowing what it measures, and you are trying to say what you should do with a bunch of these things without knowing what each one is measuring (other than they measure multiple things).

Whatever you do, it will be pretty naive, unless you are planning to go through each probe and figure out what's going on, and making individual decisions. But in the end your question is more of a philosophical question about how you should do your analysis, rather than a question about how to use Bioconductor software, so as the analyst it's up to you to make the decision.

I missed that one of your GeneBank IDs was missing the underscore:

  > select(org.Hs.eg.db, c("NR_003038", "NR_003039", "NR_003041"), c("ENTREZID","SYMBOL"),"ACCNUM")
'select()' returned 1:1 mapping between keys and columns
ACCNUM ENTREZID  SYMBOL
1 NR_003038   387066   SNHG5
2 NR_003039   644076 GLYCAM1
3 NR_003041   692084 SNORD13

ADD REPLYlink written 9 weeks ago by James W. MacDonald49k

Dear James

However, I copied (and modified) the raw Agilent file example from another topic (see here).

In my array design for a non-model arthropod species, all probes map to the CDS of a protein coding gene (and not to introns or snoRNAs).

My apologies for the confusion.

Well that's pretty confusing. You copied something that includes some GenBank IDs that are for human, but the array is in fact a spider mite array? OK, whatever. The main point here is that you have a decision to make as the analyst, and you are asking on a site intended for the support of people who are having technical problems with software.

Your question has nothing to do with the software, but instead is a decision you need to make as the analyst. If you are unable to make that decision yourself you should find someone local with experience who can help you. Asking random people on the internet how you should analyze your data is a suboptimal way to proceed.

ADD REPLYlink written 9 weeks ago by James W. MacDonald49k

Dear James

My apologies for the confusion. I did not know that these were well-established human-IDs, but thaught these were random IDs.

However, I do know the decision I want to make. I want to average/group probes, including multi-mapping ones, for each spider mite gene. I just could not figure out how to do it and I thought that this forum, on which many experts (like you and mr. Smyth) and not some random people are active, might help me with my problem.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by wd20

In that case you can't use avereps, but instead will have to use regular R functions to do what you want. So assuming you want an average for each unique GenBank (or whatever) ID you have, something like (untested)

lst <- split(MA_normalized$genes$SystematicName, ",")
fac <- factor(do.call(c, lst), levels = unique(do.call(c, lst)))
mapper <- rep(1:nrow(MA_normalized), sapply(lst, length))
sumdat <- rowsum(MA_normalized$E[mapper,], fac, reorder = FALSE, na.rm = TRUE) nper <- rowsum(1L - is.na(MA_normalized$E[mapper,], fac, reorder = FALSE)
avrep <- sumdat/nper

ADD REPLYlink written 9 weeks ago by James W. MacDonald49k

Thank you James for the proposed solution, but for some reason R always stops after running the third "mapper" line.

I also do not completely understand the code. You make a factor of each ID in the SystematicName column, but then the use of mapper in the 4th and 5th line I do not understand...

Like I said 'untested'. The first line should be strsplit, not split. Here is a trivial example that might be more explanatory:

> genes <- data.frame(SystematicName = c("NR_001,NR_002", "NR_002,NR_003,NR_004", "NR_001,NR_005"))
> genes
SystematicName
1        NR_001,NR_002
2 NR_002,NR_003,NR_004
3        NR_001,NR_005

> dat <- matrix(rnorm(9), ncol = 3)
> dat
[,1]        [,2]        [,3]
[1,] -1.11347206 -0.03608268 -0.15984818
[2,] -0.04748212  0.63844231  0.62600111
[3,] -0.23618088  0.47983796 -0.08863861
> lst <- strsplit(as.character(genes\$SystematicName), ",")
> lst
[[1]]
[1] "NR_001" "NR_002"

[[2]]
[1] "NR_002" "NR_003" "NR_004"

[[3]]
[1] "NR_001" "NR_005"

> fac <- factor(do.call(c, lst), levels = unique(do.call(c, lst)))
> fac
[1] NR_001 NR_002 NR_002 NR_003 NR_004 NR_001 NR_005
Levels: NR_001 NR_002 NR_003 NR_004 NR_005

> mapper <- rep(1:nrow(dat), sapply(lst, length))
> mapper
[1] 1 1 2 2 2 3 3

## to make the next step more explicit, consider this

> z <- dat[mapper,]
> row.names(z) <- as.character(fac)
> z
[,1]        [,2]        [,3]
NR_001 -1.11347206 -0.03608268 -0.15984818
NR_002 -1.11347206 -0.03608268 -0.15984818
NR_002 -0.04748212  0.63844231  0.62600111
NR_003 -0.04748212  0.63844231  0.62600111
NR_004 -0.04748212  0.63844231  0.62600111
NR_001 -0.23618088  0.47983796 -0.08863861
NR_005 -0.23618088  0.47983796 -0.08863861

## so we are duplicating the values for each probe, for each of the genes measured
## and then we sum them up

> sumdat <- rowsum(dat[mapper,], fac, reorder=FALSE, na.rm = TRUE)
> sumdat
[,1]      [,2]        [,3]
NR_001 -1.34965294 0.4437553 -0.24848679
NR_002 -1.16095418 0.6023596  0.46615293
NR_003 -0.04748212 0.6384423  0.62600111
NR_004 -0.04748212 0.6384423  0.62600111
NR_005 -0.23618088 0.4798380 -0.08863861

> nper <- rowsum(1L - is.na(dat)[mapper,], fac, reorder = FALSE)
> nper
[,1] [,2] [,3]
NR_001    2    2    2
NR_002    2    2    2
NR_003    1    1    1
NR_004    1    1    1
NR_005    1    1    1

## this gives us the number of replicates we had for each gene, which we need to divide by to get the mean

> sumdat/nper
[,1]      [,2]        [,3]
NR_001 -0.67482647 0.2218776 -0.12424340
NR_002 -0.58047709 0.3011798  0.23307646
NR_003 -0.04748212 0.6384423  0.62600111
NR_004 -0.04748212 0.6384423  0.62600111
NR_005 -0.23618088 0.4798380 -0.08863861

ADD REPLYlink written 9 weeks ago by James W. MacDonald49k
Answer: avereps with probes mapping to multiple genes (Agilent dual color microarray)
0
9 weeks ago by
Gordon Smyth36k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth36k wrote:

You can't use avereps for this purpose. avereps is deliberately designed for cases when each probe is associated with exactly one gene.

If you really did have large numbers of probes mapping to multiple genes or transcripts, and you needed to obtain expression values for those genes or transcripts, then you would need something like a microarray version of Salmon or kallisto, but such a thing doesn't exist. I don't agree with the summing approach that you have been discussing with James.

I am sorry to be sceptical, but I have some doubts that you have done the annotation properly. I agree that aligning probes to the transcriptome should in principle be a good idea, but I am not convinced that you have proper gene annotation. If you have good quality gene annotation, then most Agilent probes should map to a unique gene. It should be very unusual for a probe to map to more than one gene.

You don't say what annotation you are using. The examples you give of probes mapping to multiple genes are not correct, firstly because the IDs you give are human rather than spider mite, and secondly because the IDs are RefSeq sequence IDs, not gene IDs.

If I were you, I would just do a standard probe-level analysis and try to figure out what genes the probes belong to at the end of the analysis. If you want to summarize expression by gene then you really, truly need gene IDs rather than transcript or sequence IDs. Neither Agilent arrays nor limma are designed for transcript-level analyses.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Gordon Smyth36k

Dear Gordon and James

Thank you for your replies. A pity that a microarray version of Salmon/kallisto does not exist.

Regarding Gordons scepticism for the annotation: I do want to let you know that identical (95-100%) duplicated protein coding genes do exist in the spider mite genome (you can find the cDNA for each gene (19086 genes in total), here) and, based on Bowtie2, probes (probe seqs can be found here) map multiple times (you can find output here; probe with SAM FLAG "256" are multi-mapping probes).

As an example: tetur13g00030 is highly identical to tetur09g00010, and tetur13g00030 has only multi-mapping probes (CUST25613PI426396793, CUST25615PI426396793, CUST32781PI426396793, CUST32782PI426396793)

Same for tetur35g01130 and tetur35g01150...

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by wd20

If you have genes that are 95-100% identical then it is unrealistic to expect a microarray to be able to distinguish them. So IMO what you are trying to do is essentially impossible.

A microarray version of Salmon wouldn't help either. Salmon and kallisto rely on comprehensive transcriptome annotation and comprehensive probe coverage in order to be successful, neither of which you have, and the results very noisy even then.

The annotation file that you give simply lists 19086 sequences. Although the IDs are called "genes" on the OrcAE website, the annotation file doesn't actually distinguish genes from transcripts. Since transcripts have sequences but genes don't (because a gene is a collection of exons rather than a single sequence), it would appear that the annotation file is actually listing transcripts rather than genes. I know nothing about spider mite at all, but I'm guessing that the so-called "genes" that are 95-100% identical may actually be different transcripts for the same gene.

If I were you, I would definitely to a probe-level analysis and try to figure out what genes the probes belong to at the end of the analysis. For example, you might choose to cluster together probes that map to neighbouring genomic locations and are DE in the same direction.

That's as much help as I can give you.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by Gordon Smyth36k

I am aware that it is not realistic to distinguish between identically duplicated genes with our microarray design. However, my aim was not to be able to distinguish these genes in microarray experiments, but that I do not have to exclude the multi-mapping probes from my array analysis, and hence, when grouping the probe signal for each gene, loose the possibility to detect duplicated genes that are differentially expressed and only have multi-mapping probes (for example tetur35g01130 and tetur35g01150 would not be detected in a differential expression analysis as they only have multi-mapping probes). Based on your advice, this seems however not possible at the gene level.

Regarding the annotation. The fasta file I sent you, does contain one transcript for each gene of the spider mite genome. For now, multiple/different transcripts for each gene have not yet been predicted/generated and only one transcript per gene is available. Thus, the so-called "genes "are not a collection of different transcripts of the same gene but one transcript for each gene. Tetur35g01130 and tetur35g01150, for example, are duplicated in tandem, with a transposon in between(see screenshot here, blue boxes= genes, orange box= transposon) and for each of these genes a transcript is available in the fasta.

Thank you for your help (and time)!

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by wd20