Question

Annotation of Affy Porcine Gene 1.0 ST array data

0

Entering edit mode

aminbau14 • 0

@aminbau14-7116

Last seen 9.8 years ago

Germany

Dear List

I am beginner of R/bioconductor. I am analyzing a microarray dataset obtained from Affymetrix Porcine Gene 1.0 ST array. With the help of posts of the site, I have summarized the expression values for each array probes using ‘oligo’ package in R v3.1.2. However, I got every time the warning message below while RMA normalization.

> eset <- oligo::rma(oligoRaw)

Background correcting

Normalizing

Calculating Expression

Warning message:

'isIdCurrent' is deprecated.

Use 'dbIsValid' instead.

See help("Deprecated")

What does it mean? how can fix it?

While loading platform design package I got the warning message below-

> librarypd.porgene.1.0.st)

Warning message:

package ‘pd.porgene.1.0.st’ was built under R version 3.2.0

Does this making wrong in downstream process? As i am using R v 3.1.2

Secondly, the big battle I faced in annotation and biological interpretation of the data. I tried with ‘porcine.db’ package for annotation but not sure is it the right one or not?

With limma package I can run the lmFit function and get topTable but that contains only Affy probeset ID of eight digit number (e.g ‘15192707’, ‘15212448’,…).

How can I get the list of deferentially expressed genes with biological info like gene name, symbol, ensembl_id, fold change and adj p value ? How can I filter genes which up and down regulated in contrast group of interest?

Please help me with the suggestions/tips/codes for way to go. Thanks very much in advance.

Here is the code i am trying with-

#load libraby

library(oligo)

library(limma)

library(porcine.db)

library(pd.porgene.1.0.st)


#nomalization

oligoRaw <- read.celfiles(list.celfiles())

eset <- oligo::rma(oligoRaw)

write.exprs(eset,file="data.normal.txt")


#linear model fit

design <- model.matrix(~ 0+factor(c(1,1,1,2,2,2,3,3,3, 4,4,4,5,5,5,6,6,6,7,7,7)))

colnames(design) <- c("cont.0h", "cont.6h", "vac.6h", "cont.24h", "vac.24h", "cont.72h", "vac.72h")

rownames(design)<-colnames(eset)

fit <- lmFit(eset, design)


# contrast groups for Diff expressed genes

cont.dif1<-makeContrasts(Dif24hr=(vac.24h-cont.0h), levels=design)

fit1 <- contrasts.fit(fit, cont.dif1)

fit.eBayes <- eBayes(fit1)

topTable(fit.eBayes, coef=1, adjust="BH")

results <- decideTests(fit.eBayes)

vennDiagram(results)


#Annotation

my_frame <- data.frame(exprs(eset))

porcine()

Annot <- data.frame(ACCNUM=sapply(contents(porcineACCNUM), paste, collapse=", "), SYMBOL=sapply(contents(porcineSYMBOL), paste, collapse=", "), DESC=sapply(contents(porcineGENENAME), paste, collapse=", "))

all <- merge(Annot, my_frame, by.x=0, by.y=0, all=T)

write.table(all,file="data.annoted.txt",sep="\t")

summary(all)

sessionInfo()

R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] genefilter_1.48.1        pd.porgene.1.0.st_3.10.0 porcine.db_3.0.0        
 [4] org.Ss.eg.db_3.0.0       RSQLite_1.0.0            DBI_0.3.1               
 [7] AnnotationDbi_1.28.1     GenomeInfoDb_1.2.3       limma_3.22.1            
[10] oligo_1.30.0             Biostrings_2.34.1        XVector_0.6.0           
[13] IRanges_2.0.1            S4Vectors_0.4.0          Biobase_2.26.0          
[16] oligoClasses_1.28.0      BiocGenerics_0.12.1     

loaded via a namespace (and not attached):
 [1] affxparser_1.38.0     affyio_1.34.0         annotate_1.44.0      
 [4] BiocInstaller_1.16.1  bit_1.1-12            codetools_0.2-9      
 [7] ff_2.2-13             foreach_1.4.2         GenomicRanges_1.18.3 
[10] iterators_1.0.7       preprocessCore_1.28.0 splines_3.1.2        
[13] survival_2.37-7       tools_3.1.2           XML_3.98-1.1         
[16] xtable_1.7-4          zlibbioc_1.12.0

Please help me with the suggestions/tips/codes for way to go. Thanks very much in advance.

Sincerely

Amin

limma oligo annotation • 3.8k views

ADD COMMENT • link updated 9.7 years ago by alakatos ▴ 130 • written 10.0 years ago by aminbau14 • 0

score 2 · Answer 1 · 2015-01-05

You don't have to worry about the warning for 'isIdCurrent' this has to do with some underlying code changes that won't affect your results.

It looks like you have a devel version of pd.porgene.1.0.st. You can fix that by doing

library(BiocInstaller)
biocLite("pd.porgene.1.0.st")

##Also do this

biocLite("pig.db0")

The annotation question is a bit more tricky. Ideally you would have a porgene10sttranscriptcluster.db package to annotate with, but we haven't built an annotation package for this array. That said, it isn't that difficult to build your own. You first need to go to the Affy website (requires a free registration) http://www.affymetrix.com/Auth/analysis/downloads/na34/wtgene/PorGene-1_0-st-v1.na34.sscr2.transcript.csv.zip

Unzip this file somewhere convenient.

Then go to this gist:

EDIT: Now that's cool ^^^^^^^^^^^^^ Didn't expect that to happen.

probably the easiest thing to do is click the 'view raw' button at the bottom right of the code, then copy-paste into a text file, and save as makeTranscriptPkg.R, in the same directory that contains your PorGene-1_0-st-v1.na34.sscr2.transcript.csv file.

Now open R, change the directory to wherever you put those files, and source() the makeTranscriptPkg.R file. You can now build an annotation package by doing

pv <- parseCsvFiles("PorGene-1_0-st-v1.na34.sscr2.transcript.csv")

makePkg(pv$outname, pv$outsub, "8.2.1", "PIGCHIP_DB")

install.packages("porgene10sttranscriptcluster.db", repos = NULL, type = "source")

score 0 · Answer 2 · 2015-01-14

0

Entering edit mode

aminbau14 • 0

@aminbau14-7116

Last seen 9.8 years ago

Germany

Hi James,

Great! Many thanks for the support. I could built my package with the code you provided and it worked well except a typo, should 'parseCsv' be 'parseCsvFiles' in the first command after running the source () function.

Best regards

Amin

ADD COMMENT • link 9.9 years ago aminbau14 • 0

score 0 · Answer 3 · 2015-03-03

0

Entering edit mode

aminbau14 • 0

@aminbau14-7116

Last seen 9.8 years ago

Germany

Hi James and others,

I am curious to look for what comes out about alternate splicing events from my PorGene 1.0 ST array data set. After normalized with 'probeset' as target with oligo package, i could find some deferentially expressed probes/exons with limma, now i need to annotate them.

Could you please help me providing with the code for building package for exon level annotation (e.g. something like 'porgene10stprobeset.db').

Many thanks in advance.

Best regards

Amin

ADD COMMENT • link 9.8 years ago aminbau14 • 0

0

Entering edit mode

Just use the code above, but instead of using the transcript csv file, use the probeset csv file. But do note that the Gene ST arrays are (IMO) sort of compromised when it comes to probeset-level summarization, because a large proportion of the probesets have three or fewer probes/probeset.

> library(pd.porgene.1.0.st)

> con <- db(pd.porgene.1.0.st)

> z <- dbGetQuery(con, "select * from pmfeature;")
> table(table(z$fsetid))

    1     2     3     4     5     6     7     8     9    10    11    12    13
35931 42591 23674 13298  7779  2362  3111  1472  1953   827   675   629   967
   14    15    16    17    18    19    20    21    22    23    24    25    26
  267   429   220   226   235   235   567   324   373   408   600  4689   115
   27    28    29    30    31    32    33    34    35    38    40    43    44
  175   223    26   188    11    13    10     3    10     1     1     1     1
   50    52   268   322   407   585   697   703   813   849   873   912   914
    2     1     1     1     1     1     1     1     1     1     1     1     1
  940   942   949   952   959   960   963   968   973
    1     1     1     1     1     2     1     1     1

> cumsum(zz)/sum(zz)
        1         2         3         4         5         6         7         8
0.2484099 0.5428639 0.7065347 0.7984707 0.8522510 0.8685808 0.8900888 0.9002655
        9        10        11        12        13        14        15        16
0.9137676 0.9194851 0.9241517 0.9285003 0.9351857 0.9370316 0.9399975 0.9415185
       17        18        19        20        21        22        23        24
0.9430809 0.9447056 0.9463303 0.9502503 0.9524903 0.9550690 0.9578897 0.9620378
       25        26        27        28        29        30        31        32
0.9944554 0.9952504 0.9964603 0.9980020 0.9981817 0.9994815 0.9995575 0.9996474
       33        34        35        38        40        43        44        50
0.9997165 0.9997373 0.9998064 0.9998133 0.9998202 0.9998272 0.9998341 0.9998479
       52       268       322       407       585       697       703       813
0.9998548 0.9998617 0.9998686 0.9998756 0.9998825 0.9998894 0.9998963 0.9999032
      849       873       912       914       940       942       949       952
0.9999101 0.9999170 0.9999240 0.9999309 0.9999378 0.9999447 0.9999516 0.9999585
      959       960       963       968       973
0.9999654 0.9999793 0.9999862 0.9999931 1.0000000
>

So about 71% of probesets have 3 or fewer probes...

In addition, the probeset level of summarization isn't actually by exon - it's by what Affymetrix calls a probe set region (PSR), which may be the entire exon, or it may be just a portion of an exon, depending on how long the exon in question might be.

So you cannot blindly attribute signal from a given PSR to an exon - you have to explore further in order to determine exactly what the PSR is supposed to be measuring.

In addition, figuring out alternative splicing is more difficult than you might imagine. The Affy TAC software is supposed to be able to do alternative splicing with the Gene ST arrays, but they fit a 'splicing index' which is pretty similar to fitting a two-way ANOVA model with a 'gene' level measure and an 'exon' level measure, and looking for significant interactions. The problem with that idea is that the likelihood of coming up significant is directly correlated with exon length (e.g., longer exons, which have more probesets, have inherently lower variability, so they are more likely to appear significantly different than shorter exons, which have fewer probes and tend to have lower apparent expression and higher variability). In the end, most of the evidence for transcript variants has to do with long exons, and in particular, long exons in long genes with lots of exons. Short genes and/or short exons are usually too variable to attain significance.

ADD REPLY • link 9.8 years ago James W. MacDonald 67k

score 0 · Answer 4 · 2015-03-04

0

Entering edit mode

alakatos ▴ 130

@alakatos-6983

Last seen 5.2 years ago

United States

Hello,

I need to build an annotation package for my GeneChip® Mouse Transcriptome Assay 1.0 array.

I followed James' code but it did not work for this platform.

Would you please help to make it work?

Thank you in advance.

Anita

>pv <- parseCsvFiles("MTA-1_0.na34.1.mm10.transcript.csv")

>ls()

[1] "makePkg"       "parseCsvFiles" "pv"

>pv
$outname
[1] "mta10"

$outsub
[1] "transcript"

>makePkg(pv$outname, pv$outsub, "r1", "MOUSECHIP_DB")

Error in sqliteSendQuery(con, statement, bind.data) : error in statement: table metadata already exists

ADD COMMENT • link 9.8 years ago alakatos ▴ 130

0

Entering edit mode

The code will fail if you already have a partially-built package in your working directory. You need to remove those files first, then run the code:

unlink(dir(".", "db$|sqlite$"), recursive = TRUE)

makePkg(pv$outname, pv$outsub, "r1", "MOUSECHIP_DB")

ADD REPLY • link 9.8 years ago James W. MacDonald 67k

score 0 · Answer 5 · 2015-03-04

0

Entering edit mode

alakatos ▴ 130

@alakatos-6983

Last seen 5.2 years ago

United States

James, 
Another error message.
Would you please help?
Thank you.
Anita 

>unlink(dir(".", "db$|sqlite$"), recursive = TRUE)

> makePkg(pv$outname, pv$outsub, "r1", "MOUSECHIP_DB")
baseMapType is gb or gbNRef

Error in sqliteSendQuery(con, statement, bind.data) : error in statement: no such table: src.accession

In addition: Warning message: 'dbBeginTransaction' is deprecated. Use 'dbBegin' instead. See help("Deprecated")

ADD COMMENT • link 9.8 years ago alakatos ▴ 130

0

Entering edit mode

How about

library(BiocInstaller)

biocLite("mta10sttranscriptcluster.db")

ADD REPLY • link 9.8 years ago James W. MacDonald 67k

0

Entering edit mode

It works. Thank you very much for making it.

Anita

ADD REPLY • link 9.8 years ago alakatos ▴ 130

0

Entering edit mode

Hello,

I processed the my GeneChip® Mouse Transcriptome Assay 1.0 (MTA10) arrays.

Command:

library(affy)

raw <- ReadAffy(cdfname = "mta10.r1.genecdf", phenoData="phenotype.txt")

boxplot(raw)

Warning message: In data.row.names(row.names, rowsi, i) : some row.names duplicated: 16,17,18,23,24,25,26,27,28,29,31,33,34,35,36,38,42,43,44,45,47,49,50,51,54,55,56,62,64,66,67,69,70,71,73,74,76,81,82,83,85,87,88,89,91,92,94,96,97,98,99,101,102,105,108,109,112,115,117,118,119,121,124,125,127,128,129,131,132,133,134,135,137,138,139,140,142,144,145,146,147,151,152,153,154,155,156,159,161,162,166,169,176,180,181,183,186,190,191,192,193,194,195,197,198,203,208,211,212,213,215,220,222,223,224,226,228,229,232,233,236,238,241,242,244,245,246,247,248,251,254,255,257,261,262,266,267,270,273,274,275,276,278,280,284,286,287,288,293,294,295,297,298,299,302,308,309,312,314,316,318,321,323,326,327,328,330,331,333,336,338,339,340,341,344,346,349,354,356,357,359,360,361,362,364,365,366,368,370,374,379,380,381,382,383,386,387,391,392,393,396,398,399,400,403,404,405,409,410,413,415,416,420,422,424,425,426,428,429,430,434,435,436,439,440,441,442,443,444,446,447,450,452,453,454,458,460,461,464,465,466,467,470,475,478,480,482,484,486,487,490,491,493,494,495,498, [... truncated]

eset = rma(raw, target="core")
annotation(eset) <- "mta10sttranscriptcluster.db"

              0215F-02_01-(MTA-1_0).CEL        0215F-02_02-(MTA-1_0).CEL      etc    
20550008              5.609107                                    5.647651
20550009              6.488896                                    6.488896
20550010              6.284914                                    6.284914
20550011              6.738269                                    6.738269
20550012              6.668315                                    6.668315

In addition, the same files were processed in Affy Expression Console (SST-RMA Gene level).

When I compared the 2 results, the boxplots (Expression console (25% ~ 4.2; median ~6; 75%~ 8), R(affy)(25% ~ 6.25; median ~6.6; 75%~ 7) and the probe_ids were different. I double-checked the probe-ids in the original MTA10 annotation file.

Probe_ids in the MTA10 annotation file: transcript_cluster_id	probeset_id
TC0100000001.mm.1	TC0100000001.mm.1
TC0100000002.mm.1	TC0100000002.mm.1
TC0100000003.mm.1	TC0100000003.mm.1
TC0100000004.mm.1	TC0100000004.mm.1
TC0100000005.mm.1	TC0100000005.mm.1
TC0100000006.mm.1	TC0100000006.mm.1
TC0100000007.mm.1	TC0100000007.mm.1
TC0100000008.mm.1	TC0100000008.mm.1 etc.


question 1: I am wondering if I used the correct tool ("affy") to process this type of array? question 2: Why is there a discrepancy between R processed and Affy expression Console processed result with regards to expression values and annotation. question 3: Is there any better tool(s) (e.g "puma") or recommendation to process this array ?
Thank you for your help in advance. Anita P.S: the completion mta10cdf file is posted (C: mta10cdf not installed Bioconductor )
R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] oligo_1.30.0 Biostrings_2.34.1 XVector_0.6.0 [4] oligoClasses_1.28.0 mta10sttranscriptcluster.db_8.2.0 org.Mm.eg.db_3.0.0 [7] mta10.r1.genecdf_1.42.0 affy_1.44.0 biomaRt_2.22.0 [10] gplots_2.16.0 ggplot2_1.0.0 GGally_0.5.0 [13] hwriter_1.3.2 ReportingTools_2.6.0 RSQLite_1.0.0 [16] DBI_0.3.1 knitr_1.9 limma_3.22.6 [19] annotate_1.44.0 XML_3.98-1.1 AnnotationDbi_1.28.1 [22] GenomeInfoDb_1.2.4 IRanges_2.0.1 S4Vectors_0.4.0 [25] genefilter_1.48.1 Biobase_2.26.0 BiocGenerics_0.12.1 [28] RColorBrewer_1.1-2

ADD REPLY • link 9.7 years ago alakatos ▴ 130

0

Entering edit mode

I didn't notice before that you hijacked this thread. Please don't ask random questions on an existing thread that have nothing to do with that thread. If you have a question, open a new thread.

ADD REPLY • link 9.7 years ago James W. MacDonald 67k