Search
Question: GeneList issue ClusterProfiler
0
16 months ago by
bhgyu30
bhgyu30 wrote:

Hi guys,

I've been back and forth on this with all sorts of variations, but error keeps coming up. Basically trying to create a geneLists for analysis for clusterprofiler. My numeric are Fisher's P.values from a meta analysis and the names are Gene Symbols though I did also try ProbeIDs.

Any chance you know why I'm getting the data.frame error. Oh I check manually if there were differing rows, but didn't find any.

Thanks!

> genelist <- blood_sorted_001[,11]
> names(geneList) = as.character(blood_sorted_001[,1])
> geneList = sort(geneList, decreasing = TRUE)
> kk2 <- gseKEGG(geneList = geneList,
+                organism = 'hsa',
+                nPerm = 1000,
+                minGSSize= 120,
+                pvalueCutoff = 0.05,
+                verbose = FALSE)
Error in data.frame(ID = as.character(tmp_res$pathway), Description = Description, : arguments imply differing number of rows: 0, 1 In addition: Warning messages: 1: In min(p) : no non-missing arguments to min; returning Inf 2: In max(p) : no non-missing arguments to max; returning -Inf 3: In min(p) : no non-missing arguments to min; returning Inf 4: In max(p) : no non-missing arguments to max; returning -Inf ADD COMMENTlink modified 16 months ago • written 16 months ago by bhgyu30 1 16 months ago by Guangchuang Yu1.0k China/Guangzhou/Southern Medical University Guangchuang Yu1.0k wrote: Already updated in DOSE (v >= 3.3.1). For example, the ID in names(geneList) is EntrezGene and if we specify using ncbi-proteinid, clusterProfiler will throw the error with some sample ID for your reference:  > require(clusterProfiler) > data(geneList) > x = gseKEGG(geneList, organism='hsa', keyType='ncbi-proteinid') preparing geneSet collections... --> Expected input gene ID: NP_001152759,NP_115512,NP_940684,NP_005909,NP_000373,NP_004276 Error in check_gene_id(geneList, geneSets) : --> No gene can be mapped....  ADD COMMENTlink written 16 months ago by Guangchuang Yu1.0k 0 16 months ago by Guangchuang Yu1.0k China/Guangzhou/Southern Medical University Guangchuang Yu1.0k wrote: For KEGG analysis of hsa, the default ID type is Entrez geneID. Although you can also use ncbi-proteinid and uniprot, gene symbol or probeID are not supported. So all gene symbols cannot be mapped to pathway and thus the internally output tmp_res has 0 row. The warning msg is due to the calculation of the qvalue and the error is also from this step as clusterProfiler capture the error and return NA as qvalue. Then when updating the data.frame, qvalue has length 1 (the NA) while others have length 0 and thus throw the error. ADD COMMENTlink modified 16 months ago • written 16 months ago by Guangchuang Yu1.0k  > qvalue(numeric(0), lambda = 0.05, pi0.method = "bootstrap") Error in if (pi0 <= 0) { : missing value where TRUE/FALSE needed In addition: Warning messages: 1: In min(p) : no non-missing arguments to min; returning Inf 2: In max(p) : no non-missing arguments to max; returning -Inf 3: In min(p) : no non-missing arguments to min; returning Inf 4: In max(p) : no non-missing arguments to max; returning -Inf  So the problem here is you are using the wrong ID type. For me, I will update the source code to throw more useful/informative error msg. ADD REPLYlink written 16 months ago by Guangchuang Yu1.0k 0 16 months ago by bhgyu30 bhgyu30 wrote: Hi there, thanks for the quick reply! I did manage to get it working, but just wondering what the best p.value cutoff for Fisher's p.values is since I'm running a meta analysis so I cannot use single data set p values or logFCs, but have to work with the p-values from the combined analysis? I have a list of over 4,000 genes. Can the result really be 0? blood_sorted_001$ENTREZID = bitr(blood_sorted_001$Symbol, fromType="SYMBOL", toType="ENSEMBL", OrgDb="org.Hs.eg.db") geneList = as.numeric(blood_sorted_001$Fishers.P.Value)

names(geneList) = as.character(blood_sorted_001$ENTREZID) geneList = sort(geneList, decreasing = TRUE) kk2 <- gseKEGG(geneList = geneList, + organism = 'hsa', + nPerm = 1000, + minGSSize = 120, + pvalueCutoff = 0.05, + verbose = FALSE) no term enriched under specific pvalueCutoff.. ADD COMMENTlink written 16 months ago by bhgyu30 So, you just copy and past codes from vignette without reading the text, right? I used minGSSize=120 in vignette and do mention that setting to such large value is for speeding up the compilation of the vignette. ADD REPLYlink written 16 months ago by Guangchuang Yu1.0k Yes, always, to test things first. I only specify ones I'm happy it all works. But my problem isn't the minSize, it was the p.value cutoff. ADD REPLYlink written 16 months ago by bhgyu30 Would it be possible to resurrect this thread as I am also having problems with the geneList issue. I am having trouble with the geneList issue even after carrying out the troubleshooting steps above. I get the error: Error in data.frame(ID = as.character(tmp_res$pathway), Description = Description, : arguments imply differing number of rows: 0, 1
5.
stop(gettextf("arguments imply differing number of rows: %s", paste(unique(nrows), collapse = ", ")), domain = NA)
4.
data.frame(ID = as.character(tmp_res$pathway), Description = Description, setSize = tmp_res$size, enrichmentScore = tmp_res$ES, NES = tmp_res$NES, pvalue = tmp_res$pval, p.adjust = p.adj, qvalues = qvalues, stringsAsFactors = FALSE) 3. .GSEA(geneList = geneList, exponent = exponent, nPerm = nPerm, minGSSize = minGSSize, maxGSSize = maxGSSize, pvalueCutoff = pvalueCutoff, pAdjustMethod = pAdjustMethod, verbose = verbose, seed = seed, USER_DATA = USER_DATA) 2. GSEA_internal(geneList = geneList, exponent = exponent, nPerm = nPerm, minGSSize = minGSSize, maxGSSize = maxGSSize, pvalueCutoff = pvalueCutoff, pAdjustMethod = pAdjustMethod, verbose = verbose, USER_DATA = KEGG_DATA, seed = seed, by = by) 1. gseKEGG(geneList = geneListadvsDDS, organism = "mouse", nPerm = 1000, minGSSize = 100, pvalueCutoff = 0.01, verbose = FALSE) My code for the genelist and enrichment is as follows: geneListadvsDDS = idsadvsDDSliver[,2] names(geneListadvsDDS)=as.character(idsadvsDDSliver[,1]) geneListadvsDDS=sort(geneListadvsDDS, decreasing = TRUE) AdvsDDSLiverGSEA <- gseKEGG(geneList =geneListadvsDDS, organism = 'mouse', nPerm = 1000, minGSSize = 120, pvalueCutoff = 0.05, verbose = FALSE) I get the above error. When I set the minGSize down to 10 I no longer recieve the error however, I get a message saying that no terms are enriched under specific cutoff value. This is abit confusing as what I am trying to measure is enrichment of terms between adult and damage liver. When I carry out the same analysis for normal liver versus neontal liver the gseKEGG function works without a problem and I get enrichment for processes like cell cycle etc. I have over 1000 genes in each condition group and so I am confused as to why the example above is not showing any enrichment. The genelist is made from a toptable output with the first two columns being ID (ENTREZ ID) and second logFC. Any help would be much appreciated! will be happy to provide further details if required. many thanks! ADD REPLYlink written 11 months ago by A0 I don't know whether it is related to your problem, but please realize you specified organism='mouse' whereas the 3 letter abbreviation from KEGG is required... Thus organism='mmu'. See ?gseKEGG Also: could you show the output of head(geneListadvsDDS) and str(geneListadvsDDS)? ADD REPLYlink modified 11 months ago • written 11 months ago by Guido Hooiveld2.3k Many thanks! and great point! I had not realised, however, adding mmu results in the same output so I wonder if they are interchangeable and accepted the gseKEGG function ? Output is as follows: head(geneListadvsDDS), 15496 28248 100039008 28248 57266 12686 8.425410 8.354373 7.708415 7.176694 6.950060 6.772626 str(geneListadvsDDS) Named num [1:535] 8.43 8.35 7.71 7.18 6.95 ... - attr(*, "names")= chr [1:535] "15496" "28248" "100039008" "28248" ... (I was mistaken, there 535 differentially expressed genes!) Also attaching the outputs for the adult versus neonatal liver, an output that works and returns enrichment for cell cycle processes which we expect. head(geneListadvsneoliver) 100039008 /// 17840 /// 17841 /// 17842 17842 13095 10.191673 9.372920 8.585623 100039008 /// 17840 /// 17841 321018 18405 8.474880 8.470735 7.845729 str(geneListadvsneoliver) Named num [1:1761] 10.19 9.37 8.59 8.47 8.47 ... - attr(*, "names")= chr [1:1761] "100039008 /// 17840 /// 17841 /// 17842" "17842" "13095" "100039008 /// 17840 /// 17841" ... One last thing, the minGSSize = 120 confuses me slightly. For neonatal v adult liver which gives me enrichment scores, setting the size to say 10, produces 1 enrichment group (cell cycle). Increasing to 30 now gives me 4 enrichments groups, cell cycle, IGF signalling etc. With a Gssize of 120, shouldn't it capture ALL of the groups that are present in genes up to 120 genes? For example, adding 120 genes returns the error, no non-missing arguments to min; returning Infno non-missing arguments to max; returning -Infno non-missing arguments to min; returning Infno non-missing arguments to max; returning -InfError in data.frame(ID = as.character(tmp_res$pathway), Description = Description,  :
arguments imply differing number of rows: 0, 1

But then specifying 30 as the GSsize provides an enrichment result! I hope this is clear? I am so confused as I am not sure if the genelist is ok or not, and if I am actually getting back truly all the known KEGG pathways in which my genes are enriched! Many thanks!

When I use your very small gene list (test) gseKEGG is working fine; no unexpected errors are reported, except that no enriched gene sets are found when using minGSSize=5 (which of course is to be expected with such very small input list.

The type/structure of your gene list (Named num) is the same as one of my lists, so that should be fine as well.

My suggestions: lower minGSSize to e.g. 5 or 10 (or even 1), and increase pvalueCutoff  to 0.9 or 1. Although the min gene set size and p-value cutoff make no sense now, this will / should allow gene sets to be returned, confirming the analysis as such is working, You can then fine-tune these settings to your need. I always set nPerm to 10,000. Please note that using your very small input list with minGSSize=1 indeed gives (nonsense) results.

Also be sure to use the latest version of clusterProfiler/DOSE.

Finally, check whether gseKEGG runs with the sample dataset. If so, the the problem is specific to your input list.

minGSSize defines the minimum size of the KEGG gene set to be included.Gene sets smaller than this are excluded from the analysis. If you set it to low, the possibility of obtaining inflated scores is high. Remember: "...the GSEA normalization is not very accurate for extremely small or extremely large gene sets. For example, for gene sets with fewer than 10 genes, just 2 or 3 genes can generate significant results. Therefore, by default, GSEA ignores gene sets that contain fewer than 25 genes or more than 500 genes; defaults that are appropriate for datasets with 10,000 to 20,000 features. However, keep in mind the possibility of inflated scorings for very small gene sets and inaccurate normalization for large ones." Source. I usually set minGSSize to 15, and maxGSSize to 500.

Using your very limited list (which basically shows data format/input of your list is OK).

test <- c(8.43, 8.35, 7.71, 7.18)
names(test) <- c("15496", "28248", "100039008", "28248")

+ organism     = 'mmu',
+ nPerm        = 1000,
+ minGSSize    = 5,
+ pvalueCutoff = 1,
+ verbose      = FALSE)
no term enriched under specific pvalueCutoff...
>

Now setting minGSSize to 1.
> AdvsDDSLiverGSEA <- gseKEGG(geneList =test,
+ organism     = 'mmu',
+ nPerm        = 10000,
+ minGSSize    = 1,
+ pvalueCutoff = 1,
+ verbose      = FALSE)
#
# Gene Set Enrichment Analysis
#...@geneList    Named num [1:4] 8.43 8.35 7.71 7.18
- attr(*, "names")= chr [1:4] "15496" "28248" "100039008" "28248"
#...nPerm        10000
#...pvalues adjusted by 'BH' with cutoff <1
#...5 enriched terms found
'data.frame':   5 obs. of  11 variables:
$ID : chr "mmu00140" "mmu01100" "mmu04913" "mmu04925" ...$ Description    : chr  "Steroid hormone biosynthesis" "Metabolic pathways" "Ovarian steroidogenesis" "Aldosterone synthesis and secretion" ...
$setSize : int 1 1 1 1 1$ enrichmentScore: num  1 1 1 1 0.667
$NES : num 1.197 1.197 1.197 1.197 0.798$ pvalue         : num  0.506 0.506 0.506 0.506 0.506
$p.adjust : num 0.506 0.506 0.506 0.506 0.506$ qvalues        : num  0.506 0.506 0.506 0.506 0.506
$rank : int 4 4 4 4 4$ leading_edge   : chr  "tags=100%, list=100%,
<<snip>>
>

gseKEGG using sample data set:

> data(geneList)
> test.run <- gseKEGG(geneList = geneList,
+ organism     = 'hsa',
+ nPerm        = 10000,
+ minGSSize    = 15,
+ pvalueCutoff = 0.05,
+ verbose      = FALSE)
> test.run
#
# Gene Set Enrichment Analysis
#
#...@organism    hsa
#...@setType     KEGG
#...@geneList    Named num [1:12495] 4.57 4.51 4.42 4.14 3.88 ...
- attr(*, "names")= chr [1:12495] "4312" "8318" "10874" "55143" ...
#...nPerm        10000
#...pvalues adjusted by 'BH' with cutoff <0.05
#...45 enriched terms found
'data.frame':   45 obs. of  11 variables:
$ID : chr "hsa04510" "hsa03030" "hsa03050" "hsa03008" ...$ Description    : chr  "Focal adhesion" "DNA replication" "Proteasome"
<<snip>>
..


One more thing: I noticed that your input list is relatively small (~550 genes). I think GSE(A) analysis on such small list is not useful, but that you rather should perform an over-representation analysis (ORA) with e.g. the function enrichKEGG (when wanting to use info from the KEGG database).

Thank you so so much! I really appreciate your help! it is indeed nice to know that the gentlest itself is ok. The main reason I am wanting to do this analysis is to look for directionality in the data and plot it, which is why I am using gseKEGG, so I can use the sseaplot functions. I will play around with the parameters to see how these help and add to the data.

I will play around with the P.values too although, I would like to report results with statistical significance. I have done the enrichKEGG too and the results are promising although, again, I can not test directionality of the data, and I would like to make the ranked list and running score plots to easily visualise this!