Search
Question: Rsubread function buildindex cannot handle the GRCh38.p7 reference from NCBI
0
gravatar for jan.soderman
13 months ago by
jan.soderman0 wrote:

Hi,

I have tried to use the buildindex() function of the Rsubread package in conjuction with the human GRCh38.p7 reference sequence from NCBI, but get the following error:

Check the integrity of provided reference sequences ...                    ||
ERROR: repeated chromosome name 'gi' is observed in the FASTA file(s).
The index was NOT built.

The NCBI reference sequences have been downloaded using FileZilla from http://ftp-private.ncbi.nlm.nih.gov/genomes/H_sapiens. I do not get this error if I use the GRCh38.85 reference sequence from Ensembl. However, I would like to be able to use the GRCh38.p7 reference from NCBI instead, so that I can get Entrez Gene IDs and use the goana() function of the limma package.

I notice that the description lines of the fasta files from NCBI, compared to the description lines of the Ensembl file, are very different. Every description line of the NCBI fasta file start with gi, e.g.
>gi|568802037|ref|NT_187315.1| Homo sapiens chromosome 21 genomic scaffold, GRCh38.p7 Primary Assembly HSCHR21_CTG1034

wheras each description line of the Ensembl fasta file has a unique start, e.g.
>Y dna:chromosome chromosome:GRCh38:Y:2781480:56887902:1 REF
>KI270728.1 dna:scaffold scaffold:GRCh38:KI270728.1:1:1872759:1 REF

Happy for any suggestions that will solve my problem. I have found no solution by checking the relevant function / package help available in R, the Rsubread Vignette, the Bioconductor support site, or by Google the error.

Sincerely,
Jan

The output of sessionInfo() and traceback() is:
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.14.0    limma_3.28.20   Rsubread_1.22.3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.6     digest_0.6.10   assertthat_0.1  jsonlite_1.0    magrittr_1.5    evaluate_0.9   
 [7] stringi_1.1.1   rmarkdown_1.1   tools_3.3.1     stringr_1.1.0   yaml_2.1.13     base64enc_0.1-3
[13] htmltools_0.3.5 knitr_1.14      tibble_1.2


> traceback()
No traceback available

ADD COMMENTlink modified 12 months ago by Wei Shi2.7k • written 13 months ago by jan.soderman0
0
gravatar for James W. MacDonald
13 months ago by
United States
James W. MacDonald45k wrote:

You will probably get better results using the assembled genome from NCBI, or the version that you can get from UCSC.
 

ADD COMMENTlink written 13 months ago by James W. MacDonald45k

Or if you really want GCRh38.p7 instead of (the current) GCRh38.p9, you can get that here.

ADD REPLYlink written 13 months ago by James W. MacDonald45k

Many thanks for your help James. Using your supplied link to the the latest version of the assembled human genome, I have now successfully indexed the sequence using the buildindex() function of the Rsubread package. Clearly I did not know my way around the NCBI web-site, but have now discovered the assembly page for locating the latest assembly version.
Sincerely,
Jan

ADD REPLYlink written 12 months ago by jan.soderman0
0
gravatar for Wei Shi
12 months ago by
Wei Shi2.7k
Australia
Wei Shi2.7k wrote:

The buildindex() function uses separators including "|", " " and <TAB> to divide the name of a reference sequence into multiple substrings and then it uses the first substring to represent this reference sequence. As more than one of your reference sequences have name starting with "gi|", you ended up with more than one reference sequence having the same name, which is not allowed by buildindex() function.

ADD COMMENTlink written 12 months ago by Wei Shi2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 142 users visited in the last hour