crlmm support for Illumina arrays with no annotation package?
2
1
Entering edit mode
deanpett ▴ 30
@deanpett-22164
Last seen 4.3 years ago

I've been experiencing some trouble attempting to use the function crlmmIllumina for preprocessing and genotyping of Illumina QC Array-24 chip using the CRLMM algorithm. According to the current crlmm documentation, if there is no bioconductor annotation package for your array, you should still be able to import the data with a properly formatted anno data.frame.

From the help page for genotype.Illumina(): "In general, a chip specific annotation package is required to use the genotype.Illumina function. If this is not available (newer chip types or custom chips often don't have a chip-specific package available on Bioconductor), consider using cdfName='nopackage' and specifying anno and genome, which runs 'krlmm' on the samples available. Here anno is a data.frame read in from the relevant chip-specific manifest, which must have additional columns 'isSnp' which is a logical that indicates whether a probe is polymorphic or not, 'position', 'chromosome' and 'featureNames' that give the location on the chromosome and SNP name."

I've prepared my anno data.frame:

> head(manifest)
  chromosome  position    featureNames isSnp                           IlmnID            Name IlmnStrand   SNP AddressA_ID AlleleA_ProbeSeq AddressB_ID
1          1 159174749 1:159174749-C-T  TRUE 1:159174749-C-T-0_B_F_2304232049 1:159174749-C-T        BOT [T/C]    65600245               NA           0
2          1 159174749 1:159175193-A-G  TRUE 1:159175193-A-G-0_B_R_2304232052 1:159175193-A-G        BOT [T/C]    13658935               NA           0
3          1 159174749 1:159175211-C-T  TRUE 1:159175211-C-T-0_T_R_2304232054 1:159175211-C-T        TOP [A/G]    14755267               NA           0
4          1 159174749 1:159175253-G-A  TRUE 1:159175253-G-A-0_T_F_2304232055 1:159175253-G-A        TOP [A/G]    78702422               NA           0
5          1 159174749 1:159175495-G-A  TRUE 1:159175495-G-A-0_T_F_2304232061 1:159175495-G-A        TOP [A/G]    73657552               NA           0
6          1 159174749  1:159175540-TC  TRUE  1:159175540-TC-0_T_R_2299219123  1:159175540-TC        TOP [A/G]    19715188               NA           0
  AlleleB_ProbeSeq Chr   MapInfo  Ploidy      Species CustomerStrand IlmnStrand_1 IllumicodeSeq TopGenomicSeq
1               NA   1 159204959 diploid Homo sapiens            BOT          BOT            NA            NA
2               NA   1 159205403 diploid Homo sapiens            TOP          BOT            NA            NA
3               NA   1 159205421 diploid Homo sapiens            BOT          TOP            NA            NA
4               NA   1 159205463 diploid Homo sapiens            TOP          TOP            NA            NA
5               NA   1 159205705 diploid Homo sapiens            TOP          TOP            NA            NA
6               NA   1 159205750 diploid Homo sapiens            BOT          TOP            NA            NA

and i make my call to genotype.Illumina() as follows:

crlmmResult = genotype.Illumina(path = "../path/to/idat_files",
                                arrayNames= NULL,
                                sep = "_",
                                highDensity = F,
                                fileExt=list(green="Grn.idat", red="Red.idat"),
                                cdfName= 'nopackage',
                                call.method = "krlmm",
                                anno = manifest,
                                genome = "hg19")

Despite the documentation suggesting that anno should be a: "data.frame containing SNP annotation information from manifest and additional columns 'isSnp', 'position', 'chromosome' and 'featureNames'. For use when cdfName='nopackage'' it still throws the following error:

Instantiate CNSet container.
Initializing container for genotyping and copy number estimation
Processing sample stratum 1 of 1
Error in colnames(anno@data) : 
  trying to get slot "data" from an object (class "data.frame") that is not an S4 object 
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0         stringr_1.4.0         dplyr_0.8.3           purrr_0.3.3           readr_1.3.1           tidyr_1.0.0           tibble_2.1.3         
 [8] ggplot2_3.2.1         tidyverse_1.2.1       crlmm_1.43.0          preprocessCore_1.47.1 oligoClasses_1.47.0   ff_2.2-14             bit_1.1-14           
[15] Biobase_2.45.1        BiocGenerics_0.31.6  

loaded via a namespace (and not attached):
 [1] nlme_3.1-139                bitops_1.0-6                matrixStats_0.55.0          lubridate_1.7.4             httr_1.4.1                 
 [6] GenomeInfoDb_1.21.2         tools_3.6.0                 backports_1.1.5             utf8_1.1.4                  R6_2.4.0                   
[11] affyio_1.55.0               DBI_1.0.0                   lazyeval_0.2.2              colorspace_1.4-1            withr_2.1.2                
[16] tidyselect_0.2.5            base64_2.0                  compiler_3.6.0              cli_1.1.0                   rvest_0.3.4                
[21] xml2_1.2.2                  DelayedArray_0.11.8         scales_1.0.0                mvtnorm_1.0-11              askpass_1.1                
[26] illuminaio_0.27.1           XVector_0.25.0              pkgconfig_2.0.3             limma_3.41.18               rlang_0.4.0                
[31] readxl_1.3.1                rstudioapi_0.10             VGAM_1.1-1                  generics_0.0.2              jsonlite_1.6               
[36] BiocParallel_1.19.4         RCurl_1.95-4.12             magrittr_1.5                GenomeInfoDbData_1.2.1      Matrix_1.2-17              
[41] Rcpp_1.0.2                  munsell_0.5.0               S4Vectors_0.23.25           fansi_0.4.0                 lifecycle_0.1.0            
[46] stringi_1.4.3               SummarizedExperiment_1.15.9 zlibbioc_1.31.0             grid_3.6.0                  crayon_1.3.4               
[51] lattice_0.20-38             Biostrings_2.53.2           haven_2.1.1                 splines_3.6.0               hms_0.5.1                  
[56] zeallot_0.1.0               knitr_1.25                  beanplot_1.2                pillar_1.4.2                GenomicRanges_1.37.17      
[61] codetools_0.2-16            stats4_3.6.0                glue_1.3.1                  BiocManager_1.30.8          modelr_0.1.5               
[66] vctrs_0.2.0                 foreach_1.4.7               cellranger_1.1.0            gtable_0.3.0                openssl_1.4.1              
[71] assertthat_0.2.1            xfun_0.10                   broom_0.5.2                 RcppEigen_0.3.3.5.0         iterators_1.0.12           
[76] IRanges_2.19.17             ellipse_0.4.1      

This all suggests to me that anno cannot actually be just a data.frame and genotype.Illumina() in fact does in fact require a S4 annotation object created with a package for custom and/or currently unsupported arrays rather than accepting a data.frame as the annotation suggests.

I'd love some help to get down to the bottom of this, as I REALLY want to avoid using the super clunky GenomeStudio software so I can fully automate my genotyping process.

Thanks in advance, Dean

crlmm QCarray annotation • 2.5k views
ADD COMMENT
0
Entering edit mode

Hello! I'm trying to use crlmm in the same way and the algorithm throws me the same error.

Did you find any solution?

Thanks, Valeria

ADD REPLY
1
Entering edit mode

unfortunately, I haven't found any solutions yet. At this point, i think the only option is to write your own annotation package, which i have not yet found time to do. I wish I had a better answer, but it seems like crlmm is not actively supported right now.

best of luck, Dean

ADD REPLY
0
Entering edit mode

unfortunately, I haven't found any solutions yet. At this point, i think the only option is to write your own annotation package, which i have not yet found time to do. I wish I had a better answer, but it seems like crlmm is not actively supported right now.

best of luck, Dean

ADD REPLY
1
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

It looks to me like there is a small bug in RGtoXY, which is an internal function that does

    if (chipType == "nopackage" && !is.null(anno)) {
        pkgname = NULL
        if(sum(colnames(anno@data)=="AddressA_ID")!=1){

           message("annotation format is wrong, could not find a column with name: AddressA_ID")
         }else{
         anno2=anno@data

where it appears to expect an S4 object with a data slot. I'll do a PR and send it to the maintainers. In the meantime you may be able to hackify your way around this by doing something sweet like

> setClass("fakeo", representation(data = "data.frame"))
## normally you would then do
> anno <- new("fakeo", data = anno)
## and feed that to genotype.Illumina
## but I don't have your anno data.frame, so here's some extra fakeness
## to show that internally RGtoXY will get the data.frame like you want
> anno <- new("fakeo", data = data.frame(huh = letters, HUH = LETTERS))
> anno@data
   huh HUH
1    a   A
2    b   B
3    c   C
4    d   D
5    e   E
6    f   F
7    g   G
8    h   H
9    i   I
10   j   J
11   k   K
12   l   L
13   m   M
14   n   N
15   o   O
16   p   P
17   q   Q
18   r   R
19   s   S
20   t   T
21   u   U
22   v   V
23   w   W
24   x   X
25   y   Y
26   z   Z
ADD COMMENT
1
Entering edit mode

Actually, if either of you would do me a favor? Download the updated version of crlmm, and then install by doing

install.packages("crlmm_1.45.1.tar.gz", repos = NULL)

And then run genotype.Illumina as normal and let me know if it works.

ADD REPLY
1
Entering edit mode

testing it out right now...

ADD REPLY
0
Entering edit mode

Now works, but I've other warning message:

> genotype.Illumina(sampleSheet=NULL, arrayNames=NULL, arrayInfoColNames=NULL, path = datadir, call.method = "krlmm", cdfName="nopackage", anno = mod1, genome = "hg19", batch=NULL, saveDate = TRUE)

Instantiate CNSet container.
Initializing container for genotyping and copy number estimation
reading ~/bin/IDAT/Illumina//203175860223_R01C01_Grn.idat 
reading ~/bin/IDAT/Illumina//203175860223_R02C01_Grn.idat 
reading ~/bin/IDAT/Illumina//203175860223_R03C01_Grn.idat 
reading ~/bin/IDAT/Illumina//203175860223_R05C01_Grn.idat 
reading ~/bin/IDAT/Illumina//203175860223_R07C01_Grn.idat 
reading ~/bin/IDAT/Illumina//203175860223_R09C01_Grn.idat 
reading ~/bin/IDAT/Illumina//203175860223_R11C01_Grn.idat 
reading ~/bin/IDAT/Illumina//203175860233_R12C02_Grn.idat 
Processing sample stratum 1 of 1
Finished preprocessing.
Begin genotyping...
 -- Processing segment 1 out of 3
 -- Processing segment 2 out of 3
 -- Processing segment 3 out of 3
Leaving out non-variant SNPs
Start calculating 3-cluster parameters
Error in makePSOCKcluster(names = spec, ...) : 
  numeric 'names' must be >= 1
In addition: Warning message:
In data.frame(isSnp = isSnp, position = position, chromosome = chromosome,  :
  NAs introduced by coercion
ADD REPLY
1
Entering edit mode

What do you get when you do (after loading crlmm):

crlmm:::getNumberOfCores()

ADD REPLY
1
Entering edit mode

You appear to be getting jammed up at this step:

calculateParameters <- function(M, priormeans, numSNP, verbose) {
    if (verbose) message("Start calculating 3-cluster parameters")
    params3cluster <- calculateParameters3Cluster(M, priormeans, numSNP, verbose);
    if (verbose) message("Done calculating 3-cluster parameters")

where in calculateParameters3Cluster there is this step

 cl <- makeCluster(getNumberOfCores())

Which ends up calling parallel::makeCluster which does

> makeCluster
function (spec, type = getClusterOption("type"), ...) 
{
    switch(type, PSOCK = makePSOCKcluster(names = spec, ...), 
        FORK = makeForkCluster(nnodes = spec, ...), SOCK = snow::makeSOCKcluster(names = spec, 
            ...), MPI = snow::makeMPIcluster(count = spec, ...), 
        NWS = snow::makeNWScluster(names = spec, ...), stop("unknown cluster type"))
}

Which you can see is where your error is coming from. The 'names' argument is returned from crlmm:::getNumberOfCores, which in your case seems to be returning something wrong, like maybe zero?

> makeCluster(0)
Error in makePSOCKcluster(names = spec, ...) : 
  numeric 'names' must be >= 1
ADD REPLY
0
Entering edit mode
This is what I see:

> parallel::makeCluster
    function (spec, type = getClusterOption("type"), ...) 
    {
        switch(type, PSOCK = makePSOCKcluster(names = spec, ...), 
            FORK = makeForkCluster(nnodes = spec, ...), SOCK = snow::makeSOCKcluster(names = spec, 
                ...), MPI = snow::makeMPIcluster(count = spec, ...), 
            NWS = snow::makeNWScluster(names = spec, ...), stop("unknown cluster type"))
    }
         <bytecode: 0x5560da0080a0>
         <environment: namespace:parallel>

> makeCluster
    function (spec, type = getClusterOption("type"), ...) 
    {
        switch(type, PSOCK = makePSOCKcluster(names = spec, ...), 
            FORK = makeForkCluster(nnodes = spec, ...), SOCK = snow::makeSOCKcluster(names = spec, 
                ...), MPI = snow::makeMPIcluster(count = spec, ...), 
            NWS = snow::makeNWScluster(names = spec, ...), stop("unknown cluster type"))
    }
    <bytecode: 0x5560da0080a0>
    <environment: namespace:parallel>

> crlmm:::getNumberOfCores
function () 
{
    defaultCores = min(detectCores(), 8)
    return(getOption("krlmm.cores", defaultCores))
}
    <bytecode: 0x5560ca283080>
    <environment: namespace:crlmm>

> makeCluster(0)
Error in makePSOCKcluster(names = spec, ...) : 
  numeric 'names' must be >= 1
ADD REPLY
0
Entering edit mode

How can I fix the problem?

ADD REPLY
2
Entering edit mode

Apologies for the delay in replying and thanks to Jim for helping debug the issue. I've asked the developer of this code to look into it further. Hopefully we can replicate the problem on some of our own data and resolve the issue shortly. Best wishes, Matt

ADD REPLY
0
Entering edit mode

Thank you for your help. Kindly, Valeria

ADD REPLY
1
Entering edit mode

So what you have done is just show that your version of crlmm has the same code as mine, which is expected. You also show that running makeCluster with a value of zero gives the same error that I already showed you. These are expected results!

What would be helpful is to do

debug(crlmm:::getNumberOfCores)

and then run your code again. When you drop into the debugger, do

defaultCores <- min(detectCores(), 8)
getOption("krlmm.cores", defaultCores)

And let Matt know what you get for the default cores, and what the getOption call returns.

ADD REPLY
0
Entering edit mode

This is the result and what the get0ption call returns

> debug(crlmm:::getNumberOfCores)
> genotype.Illumina(sampleSheet=NULL, arrayNames=NULL, arrayInfoColNames=NULL, path = datadir, call.method = "krlmm", cdfName="nopackage", anno = mod1, genome = "hg19", batch=NULL, saveDate = TRUE)
    Instantiate CNSet container.
    Initializing container for genotyping and copy number estimation
    reading ~/bin/IDAT/Illumina//203175860223_R01C01_Grn.idat 
    reading ~/bin/IDAT/Illumina//203175860223_R02C01_Grn.idat 
    reading ~/bin/IDAT/Illumina//203175860223_R03C01_Grn.idat 
    reading ~/bin/IDAT/Illumina//203175860223_R05C01_Grn.idat 
    reading ~/bin/IDAT/Illumina//203175860223_R07C01_Grn.idat 
    reading ~/bin/IDAT/Illumina//203175860223_R09C01_Grn.idat 
    reading ~/bin/IDAT/Illumina//203175860223_R11C01_Grn.idat 
    reading ~/bin/IDAT/Illumina//203175860233_R12C02_Grn.idat 
    Processing sample stratum 1 of 1
    Finished preprocessing.
    Begin genotyping...
     -- Processing segment 1 out of 3
     -- Processing segment 2 out of 3
     -- Processing segment 3 out of 3
    Leaving out non-variant SNPs
    debugging in: getNumberOfCores()
    debug: {
        defaultCores = min(detectCores(), 8)
        return(getOption("krlmm.cores", defaultCores))
    }
    Browse[2]> defaultCores <- min(detectCores(), 8)
    Warning message:
    In data.frame(isSnp = isSnp, position = position, chromosome = chromosome,  :
      NAs introduced by coercion
    Browse[2]> getOption("krlmm.cores", defaultCores)
    [1] NA`enter code here`
ADD REPLY
0
Entering edit mode

I asked you for two things! The value for defaultCores and the output from the call to getOption. I don't see how you get that returned value from getOption, so you need to show the value for defaultCores.

ADD REPLY
0
Entering edit mode

Oh sorry, I misunderstood. Here's the output you requested (I hope):

>debug(crlmm:::getNumberOfCores())
debugging in: crlmm:::getNumberOfCores()
debug: {
    defaultCores = min(detectCores(), 8)
    return(getOption("krlmm.cores", defaultCores))
}
Browse[2]> defaultCores = min(detectCores(), 8)
Browse[2]> defaultCores
[1] 2
Browse[2]> return(getOption("krlmm.cores", defaultCores))
debug: defaultCores = min(detectCores(), 8)
Browse[2]> getOption("krlmm.cores", defaultCores)
[1] 2

Thank you so much for your help!

ADD REPLY
1
Entering edit mode

If the returned value from getOption("krlmm.cores", defaultCores) is 2, then the function should proceed as normal. In other words, the error you said you were getting:

Error in makePSOCKcluster(names = spec, ...) : 
  numeric 'names' must be >= 1

shouldn't occur, because 2 >= 1.

ADD REPLY
0
Entering edit mode

Apologies for the delay in replying and thank you so much for your help! I think that the problem is the format of manifest file. How it should be?

Thank you. Kindly, Valeria

ADD REPLY
0
Entering edit mode
> library(crlmm)
> crlmm:::getNumberOfCores()
[1] 2
ADD REPLY
0
Entering edit mode
@miloyanez20-22674
Last seen 4.2 years ago

Hello, thanks for your comments, have you managed to solve this? I would greatly appreciate it, since I have to genotype using this R library for Multi-EthnicGlobal arrays and had the same problem as Dean after leaving the manifest file in the same format. Then I updated to version 1.45.1 of crlmm, as discussed here, advances to the stage where it says "Finished preprocessing." and then it gives me the following error:

>   cnSet <- genotype.Illumina(sampleSheet=subset_samples,
+                              arrayNames=arrayNames,
+                              arrayInfoColNames=arrayInfo,
+                              sep = "_",
+                              cdfName="nopackage",
+                              batch=batch,
+                              call.method='krlmm',
+                              anno = Manifest_D1, 
+                              genome = "hg19",
+                              copynumber=FALSE)
Instantiate CNSet container.
path arg not set.  Assuming files are in local directory, or that complete path is provided
Initializing container for genotyping and copy number estimation
Processing sample stratum 1 of 1
'path' arg not set.  Assuming files are in local directory, or that complete path is provided in 'arrayNames'
Finished preprocessing.
Error in as.vmode.default(value, vmode) : 
  you can't coerce NULL to a different vmode
Además: There were 33 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In data.frame(isSnp = isSnp, position = position, chromosome = chromosome,  ... :
  NAs introducidos por coerción
2: In `[<-.ff_array`(`*tmp*`, snp.index, jj, value = c(`1` = 509L,  ... :
  number of elements to replace is not multiple of values for replacement
3: In `[<-.ff_array`(`*tmp*`, snp.index, jj, value = c(`1` = 509L,  ... :

There are 33 similar warnings I would greatly appreciate your help

Cristian

ADD COMMENT
0
Entering edit mode

Cristian,

I actually took a different approach outside of the bioconductor realm in order to genotype my array data.

I ended up analyzing my data with a different set of packages. I was mostly interested in crlmm as a means to avoid using the Illumina GenomeStudio GUI software. I recently discovered that Illumina provides IAAP, a command line utility that performs the GenCall algorithm that i needed to use. these results can then be parsed with the BeadArrayFiles python module. This has solved my problems so far. they also have software to convert IAAP binary output to vcf GTCtoVCF, but I haven't tested this personally.

Best of luck! -Dean

ADD REPLY
0
Entering edit mode

Please don't hijack posts like this. You have asked a question using the 'Add Answer' button, which doesn't make sense (you clicked on the button called 'Add Answer', which is surrounded by text that is intended to keep people from doing that if they aren't answering a question!).

In addition, your error is different from what the OP had. Rather than adding an answer (that is in fact a question) to an existing thread about something unrelated is sub-optimal. Instead, ask a new question. Also, after you get the error, do

traceback()

and include the output. You should also provide the output from head on your Manifest_D1 object.

ADD REPLY

Login before adding your answer.

Traffic: 652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6