Question

Embedded nul \0 in string with Biostrings::readAAStringSet()

0

Entering edit mode

daniel.magnus.bader ▴ 40

@danielmagnusbader-19953

Last seen 4.1 years ago

Dear all,

I am creating a local organism-specific version of uniproton on our server. Using the Bioconductor Biostrings package I experienced an "embedded nul string" error with the readAAStringSet() function, but only on the trembl download, not for the swissprot fasta file.

Q1: Can you reproduce the error?

Q2: How can it be fixed?

Session info:

R 3.5.2
Biostrings 2.50.2

Below I provided the error message and the code to reproduce the error for a specific protein. However, download and index creation are time- and memory-consuming steps.

Best, Daniel

# Download uniprot trembl fasta sequences 
# to server with ~100GB memory
#
dl_link <- "ftp://ftp.uniprot.org//pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz"
# file size 65GB
file_db <- "uniprot_trembl.fasta.gz"

# Use a server for that step,
# takes up to 80GB RAM during generation
library(data.table)
library(Biostrings)
fai_trembl <- as.data.table(fasta.index(file_db))

# search for a specific human protein
# and retrieve "recno" index
fai_trembl[grepl("Q5HYB6_HUMAN", desc)]

# read the sequence of this protein from file 
# using the precomputed index 
readAAStringSet(fai_trembl[12538693, ])

# ERROR MESSAGE:
#
#  A AAStringSet instance of length 1
#    width seq                                                                                names               
#Error in .Call2("new_CHARACTER_from_XString", x, xs_dec_lkup(x), PACKAGE = "Biostrings") : 
#  embedded nul in string: 'PIAALGAKLNTWTYRWMAA\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0'

Biostrings • 1.7k views

ADD COMMENT • link 4.9 years ago • updated 4.5 years ago daniel.magnus.bader ▴ 40

1

Entering edit mode

Hi,

Such big files are insane!

Ok, so I have access to a server with a lot of memory (384GB) so that should be enough. However I started the download of the uniprot_trembl.fasta.gz file (using wget from the Unix command line) and I see an ETA of more than 5h! The download speed I get is only 2 MB/s which is very slow. We have fast internet here at my institution (e.g. I easily see download speeds of 30 MB/s or more when downloading from other places) so it seems that the bottleneck is on the uniprot FTP server side.

We cannot exclude that the file actually contains embedded \0 bytes. Maybe somehow they got introduced when the file was compressed, or the file got corrupted during your download. Do the Uniprot people provide md5sums somewhere for their files so we can check them? This is pretty standard practice for institutions that provide big files for download.

Right now the AA alphabet is not enforced for AAStringSet objects so the readAAStringSet() function will accept any byte value, even the \0 byte. \0 bytes in the file are treated like any other byte value so they would end up in the AAStringSet object. Here is how such an object can be created:

library(Biostrings)
nul <- as(as(raw(1), "XRaw"), "AAString")
aa <- AAStringSet(c(rep(nul, 4), AAString("PPPLK")))

Note that, strictly speaking, there is nothing wrong with the object itself in the sense that you can do most of the usual operations on it:

length(aa)
# [1] 1

width(aa)
# [1] 9

alphabetFrequency(aa)
#      A R N D C Q E G H I L K M F P S T W Y V U O B J Z X * - + . other
# [1,] 0 0 0 0 0 0 0 0 0 0 1 1 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     4

as(matchPattern("PLK", aa[[1]]), "IRanges")
# IRanges object with 1 range and 0 metadata columns:
#           start       end     width
#       <integer> <integer> <integer>
#   [1]         6         8         3

countPattern(nul, aa[[1]])
# [1] 4

It's just that it cannot be displayed or coerced to an ordinary character vector (the show() method for these objects actually calls as.character() on the parts of the object that are displayed):

as.character(aa)
# Error in .Call2("new_CHARACTER_from_XStringSet", x, xs_dec_lkup(x), PACKAGE = "Biostrings") :
#   embedded nul in string: '\0\0\0PPPLK'

aa
#   A AAStringSet instance of length 1
#     width seq
# Error in XVector:::extract_character_from_XRaw_by_ranges(x, start, width,  : 
#   embedded nul in string: '\0\0\0PPPLK'

If we trim the first 4 bytes, then the object can be displayed:

subseq(aa, start=4)
#   A AAStringSet instance of length 1
#     width seq
# [1]     5 PPPLK

FWIW you can replace the nulls with the letter of your choice with:

chartr(nul, "x", aa)
#   A AAStringSet instance of length 1
#     width seq
# [1]     8 xxxPPPLK

OK so that was only to show you that AAStringSet objects (like BStringSet objects) are actually allowed to contain embedded nuls.

Just to discard the possibility that these nul bytes are an artefact of the compression/decompression mechanism, do you think you can uncompress the file and try to reproduce the error on the uncompressed file? I know that decompressing such a big file is going to require a lot of resources and might take a long time but you seem to have access to some powerful hardware.

Thanks,

H.

ADD REPLY • link 4.9 years ago Hervé Pagès 16k

0

Entering edit mode

My download failed after a few hours with some error message I don't remember.

Were you able to download the file again, uncompress it, and reproduce the error on the uncompressed file?

H.

ADD REPLY • link 4.8 years ago Hervé Pagès 16k

0

Entering edit mode

Hello Herve,

Sorry for sleeping so long. I am just at the next update cycle right now. I wrote my own file parser and do not use Biostrings at the moment, but I did not check the sanity of the download again.

Uniprot offers this solution for download sanity: https://www.uniprot.org/help/metalink including md5 sums.

Once I assured that the download worked, I give Biostrings another try. Compared to Rsamtools::indexFa(), I like that the complete fasta header lines are returned in the index table, which I wanted to use to search for specific organisms entries.

Side note:

I found that Rsamtools::indexFa() checks for the following errors

at different line length
at empty lines inserted between or after sequences ... but it only keeps the fasta-header-id not the whole line :-(

Best, Daniel

ADD REPLY • link 4.5 years ago daniel.magnus.bader ▴ 40

0

Entering edit mode

The way I see it, you have to download the bundle with all sequencing files again. If you want to get the "RELEASE.metalink" file that contains the information on the *fasta.gz files from an older version, e.g. "2019_03":

ftp://ftp.uniprot.org/pub/databases/uniprot/previousreleases/release-201903/knowledgebase

The minimal bundle would be the uniprot_sprot-only2019_03.tar.gz file I guess.

ADD REPLY • link 4.5 years ago daniel.magnus.bader ▴ 40