Question: DGE matrix annotation
0
gravatar for maxininfa
14 months ago by
maxininfa0
maxininfa0 wrote:

Hi everyone!

I'm trying to apply pathway analysis to Array Express experiments. For that I downloaded the processed data, along with the sample and data relationship files (e.g. https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-15641/). I made every value an integer (I know this isn't how the proper analysis is made, but this is just a fast way to corraborate suspicions on something) and prepared bouth countData and colData files (both .csv).

colData:

sample type
GSM391107 normal human kidney tissue

countData:

ID_REF GSM391107 GSM391108 GSM391109 GSM391110 GSM391111
AFFX-TrpnX-M_at 24 30 25 21 22

Now in R Studio I run the following commands:
library(tidyverse)
files <- list.files()
metadata <-  read_csv(files[1])
counts <- read_csv(files[2])

library(DESeq2)
mycounts <- as.data.frame(counts)
metadata <- as.data.frame(metadata)

dds <- DESeqDataSetFromMatrix(countData=mycounts,colData=metadata,design=~type,tidy=TRUE)
dds <- DESeq(dds)
res <- results(dds, tidy=TRUE)

res <- tbl_df(res)

res %>% 

filter(padj<0.05) %>% 

write_csv("DGE.csv")

DGE.csv:

row baseMean log2FoldChange lfcSE stat pvalue padj
AFFX-TrpnX-5_at 51.4388012667095 0.422155144223 0.0677142709644 6.2343058862228 4.53785621436E-10 2.52792643745E-09
So now I want to run pathway analysis according to this guide: http://www.gettinggeneticsdone.com/2015/12/tutorial-rna-seq-differential.html

The experiment was made with Affymetrix Gene Chip HGU133A, so I:
library("AnnotationDbi") 
library("hgu133a.db")

And when I'm trying to merge annotation data to my matrix with:

res$entrez = mapIds(hgu133a.db, keys=row.names(res), column="ENTREZID", keytype="PROBEID", multiVals="first")
Error in .testForValidKeys(x, keys, keytype, fks) : 
  None of the keys entered are valid keys for 'PROBEID'. Please use the keys method to see a listing of valid arguments.

If I just type: res

# A tibble: 22,283 x 7
   row          baseMean log2FoldChange  lfcSE     stat   pvalue     padj
   <chr>           <dbl>          <dbl>  <dbl>    <dbl>    <dbl>    <dbl>
 1 AFFX-TrpnX-…     25.7        0.134   0.0882   1.52   1.28e- 1 1.84e- 1
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_AR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=es_AR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_AR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] hugene10sttranscriptcluster.db_8.7.0
 [2] annotate_1.56.2                     
 [3] XML_3.98-1.16                       
 [4] hgu133a.db_3.2.3                    
 [5] hgu95av2.db_3.2.3                   
 [6] org.Hs.eg.db_3.5.0                  
 [7] AnnotationDbi_1.40.0                
 [8] BiocInstaller_1.28.0                
 [9] DESeq2_1.18.1                       
[10] SummarizedExperiment_1.8.1          
[11] DelayedArray_0.4.1                  
[12] matrixStats_0.54.0                  
[13] Biobase_2.38.0                      
[14] GenomicRanges_1.30.3                
[15] GenomeInfoDb_1.14.0                 
[16] IRanges_2.12.0                      
[17] S4Vectors_0.16.0                    
[18] BiocGenerics_0.24.0                 
[19] forcats_0.3.0                       
[20] stringr_1.3.1                       
[21] dplyr_0.7.6                         
[22] purrr_0.2.5                         
[23] readr_1.1.1                         
[24] tidyr_0.8.1                         
[25] tibble_1.4.2                        
[26] ggplot2_3.0.0                       
[27] tidyverse_1.2.1                     

loaded via a namespace (and not attached):
 [1] nlme_3.1-131           bitops_1.0-6           lubridate_1.7.4       
 [4] bit64_0.9-7            RColorBrewer_1.1-2     httr_1.3.1            
 [7] tools_3.4.4            backports_1.1.2        utf8_1.1.4            
[10] R6_2.2.2               rpart_4.1-13           Hmisc_4.1-1           
[13] DBI_1.0.0              lazyeval_0.2.1         colorspace_1.3-2      
[16] nnet_7.3-12            withr_2.1.2            tidyselect_0.2.4      
[19] gridExtra_2.3          bit_1.1-14             compiler_3.4.4        
[22] cli_1.0.1              rvest_0.3.2            htmlTable_1.12        
[25] xml2_1.2.0             scales_1.0.0           checkmate_1.8.5       
[28] genefilter_1.60.0      digest_0.6.17          foreign_0.8-69        
[31] XVector_0.18.0         base64enc_0.1-3        pkgconfig_2.0.2       
[34] htmltools_0.3.6        htmlwidgets_1.3        rlang_0.2.2           
[37] readxl_1.1.0           rstudioapi_0.7         RSQLite_2.1.1         
[40] bindr_0.1.1            jsonlite_1.5           BiocParallel_1.12.0   
[43] acepack_1.4.1          RCurl_1.95-4.11        magrittr_1.5          
[46] GenomeInfoDbData_1.0.0 Formula_1.2-3          Matrix_1.2-14         
[49] fansi_0.3.0            Rcpp_0.12.19           munsell_0.5.0         
[52] stringi_1.2.4          yaml_2.2.0             zlibbioc_1.24.0       
[55] plyr_1.8.4             blob_1.1.1             grid_3.4.4            
[58] crayon_1.3.4           lattice_0.20-35        haven_1.1.2           
[61] splines_3.4.4          hms_0.4.2              locfit_1.5-9.1        
[64] knitr_1.20             pillar_1.3.0           geneplotter_1.56.0    
[67] glue_1.3.0             latticeExtra_0.6-28    data.table_1.11.8     
[70] modelr_0.1.2           cellranger_1.1.0       gtable_0.2.0          
[73] assertthat_0.2.0       xtable_1.8-3           broom_0.5.0           
[76] survival_2.41-3        memoise_1.1.0          bindrcpp_0.2.2        
[79] cluster_2.0.7-1  

 

I've been trying lots of guides and I can't get past through the annotation. When I type:

select(hgu133a.db, c("74694_s_at"), c("SYMBOL","ENTREZID", "GENENAME"))
'select()' returned 1:1 mapping between keys and columns
     PROBEID SYMBOL ENTREZID
1 74694_s_at RABEP2    79874
                                         GENENAME
1 rabaptin, RAB GTPase binding effector protein 2

I'm new to R, so I'm betting this has something to do with a very basic deal, hope you don't mind telling me.

Thank you for your time!

Maxi

 

annotationdbi deseq2 dge • 637 views
ADD COMMENTlink modified 14 months ago by James W. MacDonald52k • written 14 months ago by maxininfa0

"I know this isn't how the proper analysis is made, but this is just a fast way ..."

You can't save time by doing something nonsensical, and treating normalized microarray intensities as if they were counts is in that category. You're following a guide that isn't at all designed for the type of data you have, as James has noted.

 

ADD REPLYlink modified 14 months ago • written 14 months ago by Gordon Smyth39k
Answer: DGE matrix annotation
3
gravatar for James W. MacDonald
14 months ago by
United States
James W. MacDonald52k wrote:

First off, DESeq2 is not the right package to use for Affymetrix arrays! Did you get that idea somewhere? It's a bad idea, so you should stop doing that and use limma.

Secondly, you are using an old version of R/Bioconductor so you should update.

Third, if I am not mistaken, using row.names on a tibble returns the row names rather than a column called 'row'. I could be wrong, as I avoid as much of the tidyverse as possible; my go-to move when confronted with a tibble is to wrap it in as.data.frame, so I can interact with it like a civilized person. But to my jaundiced eye, it looks like the row names for that tibble are just 1:22283.

Anyway, if you get an error from mapIds that says the keys you are using aren't valid keys, that means that the keys you are using aren't the things you think they are. Your next move should be to look at the keys and try to figure out what the problem is. My guess is you are extracting 1:22282, which are not Affy IDs, and which is causing the error.

ADD COMMENTlink written 14 months ago by James W. MacDonald52k

Hi James!
1) I didn't know that there was such a restricted way to do things regarding this data, thank you for pointing it out.

2) I think I have this particulars versions because there are some packages that arent supported in newer versions (now I can't remember which, but it truly rings a bell in my brain)
3) Im such a newby that I never dug on what a tibble was, I simply followed the guide I found. I find your phrase "I can interact with it like a civilized person" made me laugh hard, my PhD lab partner and mentor always scold me on how caveman my scripting was (still is). You are right on that the row.names I'm getting are just 1:22283. When I run:

 

> (res[[1]])
   [1] "AFFX-TrpnX-M_at"             "AFFX-TrpnX-5_at"            
   [3] "AFFX-TrpnX-3_at"             "AFFX-ThrX-M_at" 

So as you can see, the column is there, it's just that I'm not accessing it right.

I will get into the data I'm using so I can deal with this, but if I can't I will follow your advice and try wrapping it in as.data.frame

Thank you for your time!

ADD REPLYlink written 14 months ago by maxininfa0

1.) Any time you are doing inferential statistics, the first thing you have to do is figure out the null distribution of your test statistic. In other words, you want to be able to say what the likely range and frequency of your test statistic will be, under the null distribution, because you will compare your observed results to that null distribution in order to say how likely it is that your observed statistic is something you expect to see (under the null) or not. For count data, like what you get from doing RNA-Seq, the canonical null distribution is the Poisson distribution, and for over-dispersed counts like RNA-Seq, you might use the negative binomial distribution. That is what DESeq uses. But you don't have count data! You have data, that upon taking logs, is reasonably 'hump-shaped', in which case it's conventional to say the null distribution is the t-distribution, which is what the limma package uses.

So the restriction here is that the underlying null distribution for DESeq2 isn't reasonable for microarray data, which are continuously distributed rather than being discrete counts.

2.) If there are packages that aren't supported in the new version of R, that implies that they are no longer supported, in which case you should strongly consider finding a package that is supported that will accomplish the task you want to accomplish. Using busted old unsupported packages is a recipe for disaster, IMO.

 

ADD REPLYlink written 14 months ago by James W. MacDonald52k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 156 users visited in the last hour