Search
Question: GeneList issue ClusterProfiler
0
gravatar for bhgyu
5 months ago by
bhgyu20
bhgyu20 wrote:

Hi guys, 

I've been back and forth on this with all sorts of variations, but error keeps coming up. Basically trying to create a geneLists for analysis for clusterprofiler. My numeric are Fisher's P.values from a meta analysis and the names are Gene Symbols though I did also try ProbeIDs. 

Any chance you know why I'm getting the data.frame error. Oh I check manually if there were differing rows, but didn't find any. 

Thanks!

> genelist <- blood_sorted_001[,11]
> names(geneList) = as.character(blood_sorted_001[,1])
> geneList = sort(geneList, decreasing = TRUE)
> kk2 <- gseKEGG(geneList = geneList,
+                organism = 'hsa',
+                nPerm = 1000,
+                minGSSize= 120,
+                pvalueCutoff = 0.05,
+                verbose = FALSE)
Error in data.frame(ID = as.character(tmp_res$pathway), Description = Description,  : 
  arguments imply differing number of rows: 0, 1
In addition: Warning messages:
1: In min(p) : no non-missing arguments to min; returning Inf
2: In max(p) : no non-missing arguments to max; returning -Inf
3: In min(p) : no non-missing arguments to min; returning Inf
4: In max(p) : no non-missing arguments to max; returning -Inf 

ADD COMMENTlink modified 5 months ago • written 5 months ago by bhgyu20
1
gravatar for Guangchuang Yu
5 months ago by
Hong Kong
Guangchuang Yu800 wrote:

Already updated in DOSE (v >= 3.3.1).

For example, the ID in names(geneList) is EntrezGene and if we specify using ncbi-proteinid, clusterProfiler will throw the error with some sample ID for your reference:


> require(clusterProfiler)
> data(geneList)
> x = gseKEGG(geneList, organism='hsa', keyType='ncbi-proteinid')
preparing geneSet collections...
--> Expected input gene ID: NP_001152759,NP_115512,NP_940684,NP_005909,NP_000373,NP_004276
Error in check_gene_id(geneList, geneSets) :
  --> No gene can be mapped....
ADD COMMENTlink written 5 months ago by Guangchuang Yu800
0
gravatar for Guangchuang Yu
5 months ago by
Hong Kong
Guangchuang Yu800 wrote:

For KEGG analysis of hsa, the default ID type is Entrez geneID. Although you can also use ncbi-proteinid and uniprot, gene symbol or probeID are not supported.

So all gene symbols cannot be mapped to pathway and thus the internally output tmp_res has 0 row.

The warning msg is due to the calculation of the qvalue and the error is also from this step as clusterProfiler capture the error and return NA as qvalue. Then when updating the data.frame, qvalue has length 1 (the NA) while others have length 0 and thus throw the error.

ADD COMMENTlink modified 5 months ago • written 5 months ago by Guangchuang Yu800

> qvalue(numeric(0), lambda = 0.05, pi0.method = "bootstrap")
Error in if (pi0 <= 0) { : missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In min(p) : no non-missing arguments to min; returning Inf
2: In max(p) : no non-missing arguments to max; returning -Inf
3: In min(p) : no non-missing arguments to min; returning Inf
4: In max(p) : no non-missing arguments to max; returning -Inf

So the problem here is you are using the wrong ID type.

For me, I will update the source code to throw more useful/informative error msg.

 

ADD REPLYlink written 5 months ago by Guangchuang Yu800
0
gravatar for bhgyu
5 months ago by
bhgyu20
bhgyu20 wrote:

Hi there, thanks for the quick reply! 

I did manage to get it working, but just wondering what the best p.value cutoff for Fisher's p.values is since I'm running a meta analysis so I cannot use single data set p values or logFCs, but have to work with the p-values from the combined analysis?

I have a list of over 4,000 genes. Can the result really be 0?

blood_sorted_001$ENTREZID = bitr(blood_sorted_001$Symbol, fromType="SYMBOL", toType="ENSEMBL", OrgDb="org.Hs.eg.db")
geneList = as.numeric(blood_sorted_001$Fishers.P.Value)

names(geneList) = as.character(blood_sorted_001$ENTREZID)
geneList = sort(geneList, decreasing = TRUE)
kk2 <- gseKEGG(geneList     = geneList,
+                organism     = 'hsa',
+                nPerm        = 1000,
+                minGSSize    = 120,
+                pvalueCutoff = 0.05,
+                verbose      = FALSE)
no term enriched under specific pvalueCutoff..

 

ADD COMMENTlink written 5 months ago by bhgyu20

So, you just copy and past codes from vignette without reading the text, right?

I used minGSSize=120 in vignette and do mention that setting to such large value is for speeding up the compilation of the vignette.

ADD REPLYlink written 5 months ago by Guangchuang Yu800

Yes, always, to test things first. I only specify ones I'm happy it all works. 

But my problem isn't the minSize, it was the p.value cutoff.

ADD REPLYlink written 5 months ago by bhgyu20

Would it be possible to resurrect this thread as I am also having problems with the geneList issue.  I am having trouble with the geneList issue even after carrying out the troubleshooting steps above. 

I get the error:

Error in data.frame(ID = as.character(tmp_res$pathway), Description = Description, : arguments imply differing number of rows: 0, 1
5.
stop(gettextf("arguments imply differing number of rows: %s", paste(unique(nrows), collapse = ", ")), domain = NA)
4.
data.frame(ID = as.character(tmp_res$pathway), Description = Description, setSize = tmp_res$size, enrichmentScore = tmp_res$ES, NES = tmp_res$NES, pvalue = tmp_res$pval, p.adjust = p.adj, qvalues = qvalues, stringsAsFactors = FALSE)
3.
.GSEA(geneList = geneList, exponent = exponent, nPerm = nPerm, minGSSize = minGSSize, maxGSSize = maxGSSize, pvalueCutoff = pvalueCutoff, pAdjustMethod = pAdjustMethod, verbose = verbose, seed = seed, USER_DATA = USER_DATA)
2.
GSEA_internal(geneList = geneList, exponent = exponent, nPerm = nPerm, minGSSize = minGSSize, maxGSSize = maxGSSize, pvalueCutoff = pvalueCutoff, pAdjustMethod = pAdjustMethod, verbose = verbose, USER_DATA = KEGG_DATA, seed = seed, by = by)
1.
gseKEGG(geneList = geneListadvsDDS, organism = "mouse", nPerm = 1000, minGSSize = 100, pvalueCutoff = 0.01, verbose = FALSE)

 

My code for the genelist and enrichment is as follows:

 

geneListadvsDDS = idsadvsDDSliver[,2]
names(geneListadvsDDS)=as.character(idsadvsDDSliver[,1])
geneListadvsDDS=sort(geneListadvsDDS, decreasing = TRUE)


AdvsDDSLiverGSEA <- gseKEGG(geneList =geneListadvsDDS,
               organism     = 'mouse',
              nPerm        = 1000,
               minGSSize    = 120,
               pvalueCutoff = 0.05,
               verbose      = FALSE)

 

I get the above error. When I set the minGSize down to 10 I no longer recieve the error however, I get a message saying that no terms are enriched under specific cutoff value. This is abit confusing as what I am trying to measure is enrichment of terms between adult and damage liver. 

 

When I carry out the same analysis for normal liver versus neontal liver the gseKEGG function works without a problem and I get enrichment for processes like cell cycle etc. I have over 1000 genes in each condition group and so I am confused as to why the example above is not showing any enrichment. 

 

The genelist is made from a toptable output with the first two columns being ID (ENTREZ ID) and second logFC.

 

Any help would be much appreciated! will be happy to provide further details if required. many thanks!

 

ADD REPLYlink written 2 days ago by A0

I don't know whether it is related to your problem, but please realize you specified organism='mouse' whereas the 3 letter abbreviation from KEGG is required... Thus organism='mmu'.

See ?gseKEGG

Also: could you show the output of head(geneListadvsDDS) and str(geneListadvsDDS)?

 

 

ADD REPLYlink modified 2 days ago • written 2 days ago by Guido Hooiveld2.1k

Many thanks! and great point! I had not realised, however, adding mmu results in the same output so I wonder if they are interchangeable and accepted the gseKEGG function ?

Output is as follows: head(geneListadvsDDS), 

15496     28248 100039008     28248     57266     12686 
 8.425410  8.354373  7.708415  7.176694  6.950060  6.772626

str(geneListadvsDDS) 

Named num [1:535] 8.43 8.35 7.71 7.18 6.95 ...
 - attr(*, "names")= chr [1:535] "15496" "28248" "100039008" "28248" ... 

(I was mistaken, there 535 differentially expressed genes!)

Also attaching the outputs for the adult versus neonatal liver, an output that works and returns enrichment for cell cycle processes which we expect. 

head(geneListadvsneoliver) 

100039008 /// 17840 /// 17841 /// 17842                                   17842                                   13095 
                              10.191673                                9.372920                                8.585623 
          100039008 /// 17840 /// 17841                                  321018                                   18405 
                               8.474880                                8.470735                                7.845729 

str(geneListadvsneoliver)

Named num [1:1761] 10.19 9.37 8.59 8.47 8.47 ...
 - attr(*, "names")= chr [1:1761] "100039008 /// 17840 /// 17841 /// 17842" "17842" "13095" "100039008 /// 17840 /// 17841" ...

One last thing, the minGSSize    = 120 confuses me slightly. For neonatal v adult liver which gives me enrichment scores, setting the size to say 10, produces 1 enrichment group (cell cycle). Increasing to 30 now gives me 4 enrichments groups,  cell cycle, IGF signalling etc. With a Gssize of 120, shouldn't it capture ALL of the groups that are present in genes up to 120 genes?

For example, adding 120 genes returns the error, no non-missing arguments to min; returning Infno non-missing arguments to max; returning -Infno non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfError in data.frame(ID = as.character(tmp_res$pathway), Description = Description,  : 
  arguments imply differing number of rows: 0, 1

But then specifying 30 as the GSsize provides an enrichment result! I hope this is clear? I am so confused as I am not sure if the genelist is ok or not, and if I am actually getting back truly all the known KEGG pathways in which my genes are enriched! Many thanks!

ADD REPLYlink written 2 days ago by A0

When I use your very small gene list (test) gseKEGG is working fine; no unexpected errors are reported, except that no enriched gene sets are found when using minGSSize=5 (which of course is to be expected with such very small input list.

The type/structure of your gene list (Named num) is the same as one of my lists, so that should be fine as well.

My suggestions: lower minGSSize to e.g. 5 or 10 (or even 1), and increase pvalueCutoff  to 0.9 or 1. Although the min gene set size and p-value cutoff make no sense now, this will / should allow gene sets to be returned, confirming the analysis as such is working, You can then fine-tune these settings to your need. I always set nPerm to 10,000. Please note that using your very small input list with minGSSize=1 indeed gives (nonsense) results.

Also be sure to use the latest version of clusterProfiler/DOSE.

Finally, check whether gseKEGG runs with the sample dataset. If so, the the problem is specific to your input list.

minGSSize defines the minimum size of the KEGG gene set to be included.Gene sets smaller than this are excluded from the analysis. If you set it to low, the possibility of obtaining inflated scores is high. Remember: "...the GSEA normalization is not very accurate for extremely small or extremely large gene sets. For example, for gene sets with fewer than 10 genes, just 2 or 3 genes can generate significant results. Therefore, by default, GSEA ignores gene sets that contain fewer than 25 genes or more than 500 genes; defaults that are appropriate for datasets with 10,000 to 20,000 features. However, keep in mind the possibility of inflated scorings for very small gene sets and inaccurate normalization for large ones." Source. I usually set minGSSize to 15, and maxGSSize to 500.

Using your very limited list (which basically shows data format/input of your list is OK).

test <- c(8.43, 8.35, 7.71, 7.18)
names(test) <- c("15496", "28248", "100039008", "28248")

> AdvsDDSLiverGSEA <- gseKEGG(geneList =test,
+ organism     = 'mmu',
+ nPerm        = 1000,
+ minGSSize    = 5,
+ pvalueCutoff = 1,
+ verbose      = FALSE)
no term enriched under specific pvalueCutoff...
>

Now setting minGSSize to 1.
> AdvsDDSLiverGSEA <- gseKEGG(geneList =test,
+ organism     = 'mmu',
+ nPerm        = 10000,
+ minGSSize    = 1,
+ pvalueCutoff = 1,
+ verbose      = FALSE)
> AdvsDDSLiverGSEA
#
# Gene Set Enrichment Analysis
#...@geneList    Named num [1:4] 8.43 8.35 7.71 7.18
 - attr(*, "names")= chr [1:4] "15496" "28248" "100039008" "28248"
#...nPerm        10000
#...pvalues adjusted by 'BH' with cutoff <1
#...5 enriched terms found
'data.frame':   5 obs. of  11 variables:
 $ ID             : chr  "mmu00140" "mmu01100" "mmu04913" "mmu04925" ...
 $ Description    : chr  "Steroid hormone biosynthesis" "Metabolic pathways" "Ovarian steroidogenesis" "Aldosterone synthesis and secretion" ...
 $ setSize        : int  1 1 1 1 1
 $ enrichmentScore: num  1 1 1 1 0.667
 $ NES            : num  1.197 1.197 1.197 1.197 0.798
 $ pvalue         : num  0.506 0.506 0.506 0.506 0.506
 $ p.adjust       : num  0.506 0.506 0.506 0.506 0.506
 $ qvalues        : num  0.506 0.506 0.506 0.506 0.506
 $ rank           : int  4 4 4 4 4
 $ leading_edge   : chr  "tags=100%, list=100%, 
<<snip>>
>

 

 

gseKEGG using sample data set:

> data(geneList)
> test.run <- gseKEGG(geneList = geneList,
+ organism     = 'hsa',
+ nPerm        = 10000,
+ minGSSize    = 15,
+ pvalueCutoff = 0.05,
+ verbose      = FALSE)
> test.run
#
# Gene Set Enrichment Analysis
#
#...@organism    hsa
#...@setType     KEGG
#...@geneList    Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
 - attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...nPerm        10000
#...pvalues adjusted by 'BH' with cutoff <0.05
#...45 enriched terms found
'data.frame':   45 obs. of  11 variables:
 $ ID             : chr  "hsa04510" "hsa03030" "hsa03050" "hsa03008" ...
 $ Description    : chr  "Focal adhesion" "DNA replication" "Proteasome" 
<<snip>>
..

 

ADD REPLYlink modified 2 days ago • written 2 days ago by Guido Hooiveld2.1k

One more thing: I noticed that your input list is relatively small (~550 genes). I think GSE(A) analysis on such small list is not useful, but that you rather should perform an over-representation analysis (ORA) with e.g. the function enrichKEGG (when wanting to use info from the KEGG database).

 

ADD REPLYlink written 2 days ago by Guido Hooiveld2.1k

Thank you so so much! I really appreciate your help! it is indeed nice to know that the gentlest itself is ok. The main reason I am wanting to do this analysis is to look for directionality in the data and plot it, which is why I am using gseKEGG, so I can use the sseaplot functions. I will play around with the parameters to see how these help and add to the data. 

I will play around with the P.values too although, I would like to report results with statistical significance. I have done the enrichKEGG too and the results are promising although, again, I can not test directionality of the data, and I would like to make the ranked list and running score plots to easily visualise this!

ADD REPLYlink written 2 days ago by A0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 127 users visited in the last hour