Question

Problem encountered when running GSVA for RNAseq

0

Entering edit mode

polly0806 • 0

@polly0806-6951

Last seen 9.5 years ago

United States

Dear all:

I'm a new R user but attempted to use GSVA to analyze a RNAseq gene read count table downloaded from GTEx (http://www.gtexportal.org/home/). I have previously used GSVA to analyze Affymetrix data without issue, but I have no clue about the issue encountered when analyzing the RNAseq read count table (annotation issue, maybe???). For preprocessing, I normalized using DEseq (LogGeoMean), and filtered out genes with low count. The gene IDs are Ensemble ID such as ENSG00000227232. Here is the info for the expressionset I made:

> ExprSetName
ExpressionSet (storageMode: lockedEnvironment)
assayData: 20831 features, 187 samples
  element names: exprs
protocolData: none
phenoData
  sampleNames: GTEX.N7MS.0007.SM.2D7W1 GTEX.N7MT.0007.SM.3GACQ ... GTEX.Y8E5.0006.SM.47JWQ (187 total)
  varLabels: SUBJID GENDER ... ScaleFactor (9 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: org.Hs.eg.db

However, when I tried to run GSVA, the resulting gene set list is empty with the following message:

> traceback()
5: stop("The gene set list is empty!  Filter may be too stringent.")
4: GSVA:::.gsva(exprs(expr), mapped.gset.idx.list, method, rnaseq,
       abs.ranking, no.bootstraps, bootstrap.percent, parallel.sz,
       parallel.type, mx.diff, tau, kernel, verbose)
3: .local(expr, gset.idx.list, annotation, ...)
2: gsva(ExprSetName, c2BroadSets, rnaseq = TRUE, abs.ranking = FALSE,
       min.sz = 1, max.sz = 1000, mx.diff = TRUE, parallel.sz = 2)
1: gsva(ExprSetName, c2BroadSets, rnaseq = TRUE, abs.ranking = FALSE,
       min.sz = 1, max.sz = 1000, mx.diff = TRUE, parallel.sz = 2)

The following is my session info:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252   

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
 [1] GSVAdata_0.99.11     hgu95a.db_2.10.1     BiocInstaller_1.12.1 genefilter_1.44.0    GSVA_1.10.3          GSEABase_1.24.0      graph_1.40.1         annotate_1.40.1      org.Hs.eg.db_2.10.1
[10] RSQLite_0.11.4       DBI_0.3.1            AnnotationDbi_1.24.0 Biobase_2.22.0       BiocGenerics_0.8.0 

loaded via a namespace (and not attached):
[1] IRanges_1.20.7  splines_3.1.1   stats4_3.1.1    survival_2.37-7 tools_3.1.1     XML_3.98-1.1    xtable_1.7-4  

> packageDescription('GSVA')$Maintainer
[1] "Justin Guinney <justin.guinney@sagebase.org>"

Thank you :)

gsva rnaseq • 3.1k views

ADD COMMENT • link updated 9.5 years ago by Robert Castelo ★ 3.3k • written 9.5 years ago by polly0806 • 0

score 2 · Answer 1 · 2014-10-30

Hi,

The ExpressionSet object you are showing has the organism-centric annotation package org.Hs.eg.db in its annotation slot. This annotation package is anchored at Entrez gene identifiers and since the feature names in your ExpressionSet object are Ensembl gene identifiers, the underlying machinery to match identifiers between gene sets and features in the expression data does not find any match.

One solution would be to translate your feature names in the ExpressionSet object from Ensembl gene identifiers to Entrez gene identifiers. For that purpose you could use the 'select()' function on the org.Hs.eg.db package as follows.

library(org.Hs.eg.db)
ensft <- featureNames(ExprSetName)
egft <- select(org.Hs.eg.db, keys=ensft, columns="ENTREZID", keytype="ENSEMBL")
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' resulted in 1:many mapping between keys and return rows
head(egft)
          ENSEMBL ENTREZID
1 ENSG00000127720    84190
2 ENSG00000242018     <NA>
3 ENSG00000224440     <NA>
4 ENSG00000214453     <NA>
5 ENSG00000237787   152118
6 ENSG00000051596    84321

Since I do not have your data, I have used some other Ensembl gene identifiers which may recreate your situation.

In this case, note that the there is a warning message telling us that one Ensembl gene identifier may correspond to more than one Entrez gene identifier. Likewise, as you can see from the first few lines of the mapped identifiers, some Ensembl gene identifiers have no corresponding Entrez gene identifier. You will have to discard this latter subset of genes. With respect to the former, you will have to make a decision as to what of the multiple Entrez gene identifiers you want to choose. One option is to take one arbitrarily, but others in the Bioconductor community may give a better advice.

First, discard genes without Entrez gene identifier.

ensmissing <- egft$ENSEMBL[is.na(egft$ENTREZID)]
ExprSetName2 <- ExprSetName[!featureNames(ExprSetName) %in% ensmissing, ]

Second, pick arbitrarily one of the multiple Entrez gene identifers:

egft <- egft[!egft$ENSEMBL %in% ensmissing, ]
head(egft)
           ENSEMBL ENTREZID
1  ENSG00000127720    84190
5  ENSG00000237787   152118
6  ENSG00000051596    84321
11 ENSG00000135541    54806
13 ENSG00000143376    81609
17 ENSG00000172172    28998
egftxens <- split(egft$ENTREZID, egft$ENSEMBL)
ens2eg <- sapply(egftxens, "[", 1)
featureNames(ExprSetName2) <- ens2eg[featureNames(ExprSetName2)]

You will have to take a look to the intermediate objects generated with these instructions if you want to fully understand what are they doing. After them, try again to run GSVA and hopefully the problem is solved. You have not shown how the gene sets are stored, if you run into problems again please show how the gene sets object looks like.

cheers,

robert.

score 0 · Answer 2 · 2014-10-30

Hi Robert:

Thank you very much for your promote reply and finding the issue quickly:) I tried your codes, and something is not quite right (sorry I don't have much R programming experience so don't know what went wrong and how to fix it, something in the last step).

> dim(egft)
[1] 21814 2

> ensmissing <- egft$ENSEMBL[is.na(egft$ENTREZID)]
> ensmissing
[1] "ENSG00000243970" "ENSG00000271723" "ENSG00000236085" "ENSG00000171943" "ENSG00000196369" "ENSG00000253193" "ENSG00000201129"
[8] "ENSG00000213033" "ENSG00000213453" "ENSG00000212175" "ENSG00000257207" "ENSG00000202363" "ENSG00000253092" "ENSG00000200418"
[15] "ENSG00000200320" "ENSG00000215837" "ENSG00000215630" "ENSG00000177839" "ENSG00000271361" "ENSG00000239650" "ENSG00000207062"
[22] "ENSG00000223705" "ENSG00000176227" "ENSG00000251402" "ENSG00000198566" "ENSG00000204807" "ENSG00000189357" "ENSG00000186788"
[29] "ENSG00000159247" "ENSG00000272373" "ENSG00000189014" "ENSG00000221164" "ENSG00000242338" "ENSG00000250959" "ENSG00000201403"
[36] "ENSG00000254978" "ENSG00000214534" "ENSG00000134612" "ENSG00000210825" "ENSG00000264043" "ENSG00000253051" "ENSG00000197775"
[43] "ENSG00000255994" "ENSG00000212371" "ENSG00000185182" "ENSG00000186399" "ENSG00000254374" "ENSG00000188388" "ENSG00000259205"
[50] "ENSG00000185596" "ENSG00000257391" "ENSG00000244257" "ENSG00000214940" "ENSG00000169246" "ENSG00000196993" "ENSG00000271699"
[57] "ENSG00000200084" "ENSG00000238917" "ENSG00000154898" "ENSG00000264943" "ENSG00000252699" "ENSG00000199293" "ENSG00000267322"
[64] "ENSG00000252139" "ENSG00000214223" "ENSG00000200530" "ENSG00000270408" "ENSG00000233838" "ENSG00000238390" "ENSG00000264346"
[71] "ENSG00000217261" "ENSG00000215105"
> ExprSetName2 <- ExprSetName[!featureNames(ExprSetName) %in% ensmissing, ]
> ExprSetName2
ExpressionSet (storageMode: lockedEnvironment)
assayData: 20759 features, 187 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GTEX.N7MS.0007.SM.2D7W1 GTEX.N7MT.0007.SM.3GACQ ... GTEX.Y8E5.0006.SM.47JWQ (187 total)
varLabels: SUBJID GENDER ... ScaleFactor (9 total)
varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: org.Hs.eg.db

The number of features in ExprSetName2 is correct. Originally I have 20831. After getting rid of 72 NAs, I got 20759. So I continued:
> egft <- egft[!egft$ENSEMBL %in% ensmissing, ]
> head(egft)
          ENSEMBL ENTREZID
1 ENSG00000227232    653635
2 ENSG00000233750    402483
3 ENSG00000237094 101059936
4 ENSG00000237094 101060494
5 ENSG00000228327 100288069
6 ENSG00000237491 100287934

> dim(egft)
[1] 21742 2

> egftxens <- split(egft$ENTREZID, egft$ENSEMBL)
> ens2eg <- sapply(egftxens, "[", 1)
> head(ens2eg)
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938
"7105" "64102" "8813" "57147" "55732" "2268"
> featureNames(ExprSetName2) <- ens2eg[featureNames(ExprSetName2)]
Error in `row.names<-.data.frame`(`*tmp*`, value = c("653635", "402483", :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘100286922’, ‘100287497’, ‘100288332’, ‘100463486’, ‘10...

For gene sets, I just directly used the c2BroadSets in the GSVAdata without any modification.

Thank you for your help again :)

Bao-Li

score 0 · Answer 3 · 2014-10-30

Hi,

something I did not think about (because the data I tried out did not show that problem) is that in this mapping nighmare one may also have multiple Ensembl gene identifiers mapping to a common Entrez gene identifier. This means that among those Ensembl gene identifiers with a common Entrez gene identifier, you need to discard all of them but one, i.e., discard Entrez duplicates. Again, different options are available, one of them being arbitrarily selecting the one to keep. However, in this case I would suggest to do something I believe is more sensible as keeping the gene with largest variability, measured by the interquartile range (IQR), for instance. This could be done as follows:

iqrs <- esApply(ExprSetName, 1, IQR)
ensxeg <- split(names(ens2eg, ens2eg)
ensxeg <- sapply(ensxeg, function(x, iqrs) x[which.max(iqrs[x])], iqrs)
ExprSetName3 <- ExprSetName2[ensxeg, ]
featureNames(ExprSetName3) <- ens2eg[featureNames(ens2eg)]

I cannot try this code, so hopefully it works.

cheers,

robert.

score 0 · Answer 4 · 2014-10-30

Hi Robert:

Sorry it's me again. Why is the following code not working??

> ensxeg <- split(names(ens2eg, ens2eg))
Error in names(ens2eg, ens2eg) :
2 arguments passed to 'names' which requires 1
> head(ens2eg)
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938
"7105" "64102" "8813" "57147" "55732" "2268"

Many thanks, Bao-Li

score 0 · Answer 5 · 2014-10-30

0

Entering edit mode

Robert Castelo ★ 3.3k

@rcastelo

Last seen 3 days ago

Barcelona/Universitat Pompeu Fabra

oops, sorry, there's a typo, it should say

ensxeg <- split(names(ens2eg), ens2eg)

robert.

ADD COMMENT • link 9.5 years ago Robert Castelo ★ 3.3k