Question

Affymetrix Rat Transcriptome Array (Clariom D) annotation

0

Entering edit mode

Ed D • 0

@ed-d-11474

Last seen 8.6 years ago

I'm trying to analyze some Affymetrix Rat Clariom D array data, and from what I understand, it's basically the same as Rat Transcriptome Array. There are annotation packages in Bioconductor for both the human and mouse versions of this array platform (hta20sttranscriptcluster.db and mta10sttranscriptcluster.db), but nothing for the rat.

Are there any plans in creating an annotation library for the rat arrays?

I've also tried using pdInfoBuilder/oligo with the necessary files downloaded from Affymetrix, but encounter an error. I'm running R version 3.2.4 on OS X (10.11.6) with Bioconductor version 3.2.

> library(pdInfoBuilder)
> p <- new("AffyHTAPDInfoPkgSeed",
+          clfFile = "RTA-1_0.r3.clf",
+          pgfFile = "RTA-1_0.r3.pgf",
+          coreMps = "RTA-1_0.r3.Psrs.mps",
+          transFile = "RTA-1_0.na36.1.rn6.transcript.csv",
+          probeFile = "RTA-1_0.na36.1.rn6.probeset.csv",
+          author = "",
+          email = "someone@gmail dot com",
+          genomebuild="rn6",
+          organism="Rattus norvegicus",
+          species="Rattus norvegicus",
+          version = "0.0.1")
> makePdInfoPackage(p, unlink=TRUE)
==========================================================================================================================================
Building annotation package for Affymetrix HTA Array
PGF.........: RTA-1_0.r3.pgf
CLF.........: RTA-1_0.r3.clf
Probeset....: RTA-1_0.na36.1.rn6.probeset.csv
Transcript..: RTA-1_0.na36.1.rn6.transcript.csv
Core MPS....: RTA-1_0.r3.Psrs.mps
==========================================================================================================================================
Parsing file: RTA-1_0.r3.pgf... OK
Parsing file: RTA-1_0.r3.clf... OK
Creating initial table for probes... OK
Creating dictionaries... OK
Parsing file: RTA-1_0.na36.1.rn6.probeset.csv... OK
Parsing file: RTA-1_0.r3.Psrs.mps... OK
Creating package in ./pd.rta.1.0 
Inserting 2760 rows into table chrom_dict... OK
Inserting 5 rows into table level_dict... OK
Inserting 10 rows into table type_dict... OK
Inserting 425813 rows into table core_mps... OK
Inserting 726149 rows into table featureSet... OK
Inserting 6048415 rows into table pmfeature... OK
Counting rows in chrom_dict
Counting rows in core_mps
Counting rows in featureSet
Counting rows in level_dict
Counting rows in pmfeature
Counting rows in type_dict
Creating index idx_pmfsetid on pmfeature... OK
Creating index idx_pmfid on pmfeature... OK
Creating index idx_fsfsetid on featureSet... OK
Creating index idx_core_meta_fsetid on core_mps... OK
Creating index idx_core_fsetid on core_mps... OK
Saving DataFrame object for PM.
Saving NetAffx Annotation... Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘’

annotation microarray pdinfobuilder oligo • 2.8k views

ADD COMMENT • link 8.6 years ago Ed D • 0

score 1 · Answer 1 · 2016-09-13

Well, that last bit is quite the bummer! It takes like forever just to get to that step, so I feel your pain.

Affymetrix has an unfortunate habit of changing their file formats with no apparent rationale, so it makes it very difficult to create software to parse said files, given the shifting landscape. Well, not difficult per se, but a pain to deal with, given all the patches that need to be introduced for each array's infelicities.

This is another instance of random changes for obscure reasons. Back in the good old days (like two years ago, before we were inflicted with these newfangled transcript abominations), one could assume that the column labeled 'probeset_id' in the transcript csv file would be identical to the 'transcript_cluster_id' column, and one could use them interchangeably. Or more to the point, use the probeset_id column and ignore the transcript_cluster_id column.

Those days are over, and now the two columns are a mishmash of different IDs with no apparent purpose. Your problem arises because there are 318 (!) rows in the transcript csv file that have nothing in the probeset_id column. Because obviously. There are a bunch of IDs in the transcript_cluster_id column, and hypothetically they are useful, and can be simply copied over to the probeset_id column. Which is what will probably have to happen in the code for pdInfoBuilder. But for now you can just kludge it up yourself:

> z <- read.csv("RTA-1_0.na36.1.rn6.probeset.csv",header=TRUE, comment.char="#", na.strings="---", stringsAsFactors=FALSE)
> z$probeset_id <- ifelse(z$probeset_id == "", z$transcript_cluster_id, z$probeset_id)
> write.csv(z, "ugh_really.csv")
Edit: You don't have to do this part - it's just a check to see if it will get read in
## now make sure this reads in and gets converted correctly
> z <- pdInfoBuilder:::annot2fdata("ugh_really.csv")

> z
An object of class 'AnnotatedDataFrame'
  rowNames: AFFX-BkGr-GC03_st AFFX-BkGr-GC04_st ... RG1-pos-10701618_st
    (717875 total)
  varLabels: x probesetid ... mitomap (47 total)
  varMetadata: labelDescription

And now you should be able to generate the pdInfoPackage, but remember to point to your corrected transcript csv file instead of the original you got from Affy. Something like

> p <- new("AffyHTAPDInfoPkgSeed",
+          clfFile = "RTA-1_0.r3.clf",
+          pgfFile = "RTA-1_0.r3.pgf",
+          coreMps = "RTA-1_0.r3.Psrs.mps",
+          transFile = "ugh_really.csv",
+          probeFile = "RTA-1_0.na36.1.rn6.probeset.csv",
+          author = "",
+          email = "someone@gmail dot com",
+          genomebuild="rn6",
+          organism="Rattus norvegicus",
+          species="Rattus norvegicus",
+          version = "0.0.1")

score 0 · Answer 2 · 2016-09-14

0

Entering edit mode

Ed D • 0

@ed-d-11474

Last seen 8.6 years ago

That workaround did the trick. Thanks!

ADD COMMENT • link 8.6 years ago Ed D • 0

score 0 · Answer 3 · 2016-09-14

0

Entering edit mode

Ed D • 0

@ed-d-11474

Last seen 8.6 years ago

So I've tried using the new annotation library as described above and then went through the analysis using limma and topTable to get the list of differentially expressed probes. Now I'm trying to use that annotation library to help me get the list of RefSeq and/or gene symbols for those probes.

I followed the code that was outlined in an early post, Probeset ID's to entrez for Clariom D Human Affy Array, and now I'm running into a new problem.

> library(limma)
> library(oligo)
> library(affycoretools)
> library(pd.rta.1.0)
>
> targets <- readTargets("data/targets.txt", sep="\t")
> 
> affyRaw <- read.celfiles(paste0("data/09-08-16/",targets$Filename))
Platform design info loaded.
Reading in : data/09-08-16/090716DS01.CEL
Reading in : data/09-08-16/090716DS02.CEL
Reading in : data/09-08-16/090716DS03.CEL
Reading in : data/09-08-16/090716DS04.CEL
Reading in : data/09-08-16/090716DS05.CEL
Reading in : data/09-08-16/090716DS06_2.CEL
Reading in : data/09-08-16/090716DS07.CEL
Reading in : data/09-08-16/090716DS08_2.CEL
Reading in : data/09-08-16/090716DS09.CEL
Reading in : data/09-08-16/090716DS10.CEL
Reading in : data/09-08-16/090716DS11.CEL
Reading in : data/09-08-16/090716DS12.CEL
>
> normData <- rma(affyRaw, target="core")
Background correcting
Normalizing
Calculating Expression
>
> normData2 <- annotateEset(normData, pd.rta.1.0)
Error in `row.names<-.data.frame`(`*tmp*`, value = c("NA", "NA", "NA",  : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': 'NA', 'TC0100000002.rn.1', 'TC0100000006.rn.1', 'TC0100000007.rn.1', 'TC0100000009.rn.1', 'TC0100000010.rn.1', 'TC0100000011.rn.1', 'TC0100000013.rn.1', 'TC0100000015.rn.1', 'TC0100000016.rn.1', 'TC0100000019.rn.1', 'TC0100000020.rn.1', 'TC0100000021.rn.1', 'TC0100000022.rn.1', 'TC0100000023.rn.1', 'TC0100000025.rn.1', 'TC0100000027.rn.1', 'TC0100000028.rn.1', 'TC0100000029.rn.1', 'TC0100000030.rn.1', 'TC0100000031.rn.1', 'TC0100000034.rn.1', 'TC0100000035.rn.1', 'TC0100000036.rn.1', 'TC0100000039.rn.1', 'TC0100000040.rn.1', 'TC0100000041.rn.1', 'TC0100000042.rn.1', 'TC0100000043.rn.1', 'TC0100000044.rn.1', 'TC0100000047.rn.1', 'TC0100000048.rn.1', 'TC0100000049.rn.1', 'TC0100000054.rn.1', 'TC0100000055.rn.1', 'TC0100000057.rn.1', 'TC0100000058.rn.1', 'TC0100000059.rn.1', 'TC0100000061.rn.1', 'TC0100000071.rn.1', 'TC0100000077.rn.1', 'TC0100000078.rn.1', 'TC0100000079.rn.1', 'TC0100000080.rn.1', 'TC0100000081.rn.1', 'TC0100000083.rn.1', 'TC01 [... truncated]

ADD COMMENT • link 8.6 years ago Ed D • 0

0

Entering edit mode

Please don't post further questions using the 'Add your answer' box - a new question is by definition not an answer!

I'll take a look at this, but I doubt it will be this week.

ADD REPLY • link 8.6 years ago James W. MacDonald 68k

0

Entering edit mode

So, I am the original question asker on the post you referenced. from having dealt with this (but for the human array) I was able to find a work around.

The problem with mine ended up being when you try to annotate the object as an expression set, you can not have repeat names. As this array has stuff that does not map to a gene you are going to get a fair amount of 'NA's.

##Get Gene ID's from Netaffx files
celfiles.rma2 <- celfiles.rma
featureData(celfiles.rma2) <- getNetAffx(celfiles.rma2, "transcript")
geneid <- pData(featureData(celfiles.rma2))["geneassignment"]

##Extract the gene names from the other data and generate a matrix with the gene names
geneid.d <- data.frame(geneid)
b <- strsplit(as.character(geneid.d$geneassignment), "//", fixed = TRUE)
c <- sapply(b, "[", 1:3)
d <- c[2, ]
d <- as.matrix(d)

library(stringr)
d2 <- apply(d, 1, str_trim)
d2 <- as.matrix(d2)

##replace rownames from affy ID to gene names
exp_g <- exprs(celfiles.rma)
rownames(exp_g) <- d2
exp_g2 <- exp_g[!is.na(rownames(exp_g)), ]

##get Entrez ID from gene names
library(gage)
library(org.Hs.eg.db)
new_EG <- as.data.frame(org.Hs.egSYMBOL2EG)
new_EG <- as.matrix(new_EG)
data("egSymb")
egSymb <- new_EG
rn_eg <- sym2eg(rownames(exp_g2))
exp_g2_eg <- exp_g2
rownames(exp_g2_eg) <- rn_eg
exp_g2_eg2 <- exp_g2_eg[!is.na(rownames(exp_g2_eg)), ]

## add back gene names in a new column for later retrieval and save
eg2_sym <- eg2sym(rownames(exp_g2_eg2))
exp_g2_eg2_sym <- cbind(eg2_sym, exp_g2_eg2)
write.table(exp_g2_eg2_sym, "OvarianFPN.csv", row.names = TRUE, col.names = NA, sep = ",")

##remove duplicate genes only keeping those with the maximum average expression

OvarianFPN_exp <- exp_g2_eg2
get_mean=apply(OvarianFPN_exp, 1, mean)
sel_mean=tapply(1:nrow(OvarianFPN_exp), rownames(OvarianFPN_exp), function(x){x[which.max(get_mean[x])]})
OvarianFPN_exp_Unique <- OvarianFPN_exp[sel_mean, ]
write.table(OvarianFPN_exp_Unique, 'OvarianFPN_exp_Unique.csv', row.names = TRUE, col.names = NA, sep = ',')

##generate a gene name version for use later
OvarianFPN_exp_Unique_sym <- eg2sym(rownames(OvarianFPN_exp_Unique))
OvarianFPN_exp_Unique_annotation <- cbind(OvarianFPN_exp_Unique_sym, OvarianFPN_exp_Unique)
write.table(OvarianFPN_exp_Unique_annotation, 'OvarianFPN_exp_Unique_annotation.csv', row.names = TRUE, col.names = NA, sep = ',')

## Generate an expresion set
library(genefilter)
exprs<-as.matrix(OvarianFPN_exp_Unique)
pData<- read.table("phenodata.txt",row.names=1,header=TRUE,sep=",") 
phenoData <- new("AnnotatedDataFrame",data=pData)
Ovarian_ExpSet<-ExpressionSet(assayData=exprs,phenoData=phenoData,annotation="org.Hs.eg")

I can not promise 100% that this is the right work around, as I am just as new to this chip as you, and possibly newer to micro array analysis in general, but this is what worked for me.

This is after you do RMA normalization. also, you will need to make a pData.txt file. where the first row is just the names of your samples (and the name needs to match the sample names as they are in your expression set). In my code I used ',' as my separator, but you an use a tab delineated file (I believe that is the default).

Also, the str-split code when applying gene names may need to change for you slightly (if you need it at all). The Clariom D Human array lists the gene id as a concatenation of the gene id, the full written out name and some other things. In my case the gene id i needed was always listed in the second location. While they may also have Entrez id's listed in some locations, i found that the netAffx fie for the Clariom D Human had not been updated since 2013, so i broke it into two steps so I would have more updated mappings of the Affyids.

If that is not your speed you can also look into extracting the sequence information (also contained in the netAffx, calling Var on the netAffx object should allow you to see all the data types contained) you could remap yourself, but that is not something I know how to do.

ADD REPLY • link 8.6 years ago jayme.rickman • 0