Search
Question: readVcf shape of 'skeleton' is not compatible with 'NROW(flesh)'
0
gravatar for Miss Agnieszka Aleksandra Golicz
3.4 years ago by
Australia

Hello,

I have a piece of code that have worked on three vcf files but failed on fourth. For the files the reference, samples and gff3 files are the same, only SNP calling algorithm is different.

The error I'm getting is:

Error in relist(v, part) :
  shape of 'skeleton' is not compatible with 'NROW(flesh)'

What does this error mean? Do you know what could be causing this? I just need to know what to look for. 

Best wishes,

Agnieszka

Piece of code:

compressVcf <- bgzip("all.vcf.h.sgs.flt", tempfile())
idx <- indexTabix(compressVcf, "vcf")
tab <- TabixFile(compressVcf, idx)
c <- "C1"
fn <- paste(c,".fasta", sep="")
gn <- paste(c,".gff3", sep="")
fltF <- paste("contig.info.",c, sep="")
fltT <- read.csv(fltF, sep=",", header=TRUE)
fltG <- with(fltT, GRanges(chr, IRanges(start, end)))
param <- ScanVcfParam(which=fltG)
txdb <- makeTxDbFromGFF(gn)
Prepare the 'metadata' data frame ... metadata: OK
Warning message:
In matchCircularity(seqlevels(gr), circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.
vcf2<-readVcf(tab, paste("bolep", c), param=param)
Error in relist(v, part) :
  shape of 'skeleton' is not compatible with 'NROW(flesh)'

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] VariantAnnotation_1.14.3 Rsamtools_1.20.4         Biostrings_2.36.1
 [4] XVector_0.8.0            GenomicFeatures_1.20.1   AnnotationDbi_1.30.1
 [7] Biobase_2.28.0           GenomicRanges_1.20.5     GenomeInfoDb_1.4.1
[10] IRanges_2.2.4            S4Vectors_0.6.0          BiocGenerics_0.14.0

loaded via a namespace (and not attached):
 [1] zlibbioc_1.14.0         GenomicAlignments_1.4.1 BiocParallel_1.2.3
 [4] BSgenome_1.36.1         tools_3.2.1             DBI_0.3.1
 [7] lambda.r_1.1.7          futile.logger_1.4.1     rtracklayer_1.28.5
[10] futile.options_1.0.0    bitops_1.0-6            RCurl_1.95-4.6
[13] biomaRt_2.24.0          RSQLite_1.0.0           XML_3.98-1.2

 

ADD COMMENTlink modified 3.4 years ago by Valerie Obenchain ♦♦ 6.6k • written 3.4 years ago by Miss Agnieszka Aleksandra Golicz40
0
gravatar for Valerie Obenchain
3.4 years ago by
Valerie Obenchain ♦♦ 6.6k
United States
Valerie Obenchain ♦♦ 6.6k wrote:

Hi Agnieszka,

I'm not sure what is causing that error. You could try using scanVcf() to see if it gives the same error. I'll use a sample file from VariantAnnotation:

> scn <- scanVcf(fl, param=param)
> names(scn[[1]])
[1] "rowRanges" "REF"       "ALT"       "QUAL"      "FILTER"    "INFO"     
[7] "GENO"  

It's possible one of these elements has a character that's not being parsed correctly and causing the dimensions to be off. Take a look at the length of the fields with lapply():

> lapply(scn[[1]], length)
$rowRanges
[1] 103

$REF
[1] 103

$ALT
[1] 103

$QUAL
[1] 103

$FILTER
[1] 103

$INFO
[1] 22

$GENO
[1] 3

INFO and GENO are lists of fields so we have to look at those separately. Check the dim in GENO (vs the length):

> lapply(scn[[1]]$GENO, dim)
$GT
[1] 103   5

$DS
[1] 103   5

$GL
[1] 103   5

If that doesn't help identify the problem I can take a look at the file.

Valerie

ADD COMMENTlink written 3.4 years ago by Valerie Obenchain ♦♦ 6.6k

Hi,

I have used scanVcf() and it seems to have worked:

> scn <- scanVcf(tab, param=param)
> names(scn[[1]])
[1] "rowRanges" "REF"       "ALT"       "QUAL"      "FILTER"    "INFO"
[7] "GENO"

> lapply(scn[[1]], length)
$rowRanges
[1] 413361

$REF
[1] 413361

$ALT
[1] 413361

$QUAL
[1] 413361

$FILTER
[1] 413361

$INFO
[1] 2

$GENO
[1] 0

> lapply(scn[[1]]$GENO, dim)
named list()

But the object created by scanVcf() is a list object to perform any further analysis I need CollapsedVCF

 

First few lines from vcf file:

C1      405     C10000008       A       G       41      .       DP=323;PL=Bole      GT:DP   0|0:135 1|1:34  0|0:25  .|.:00|0:67  0|0:43  0|0:12  1|1:3   1|1:2   1|1:2
C1      626     C10000009       G       A       40      .       DP=584;PL=Bole      GT:DP   0|0:269 0|0:89  1|1:40  .|.:00|0:97  0|0:52  0|0:18  0|0:5   0|0:10  0|0:4
C1      1675    C10000014       T       A       85      .       DP=757;PL=Bole      GT:DP   0|0:126 0|0:145 1|1:85  0|0:117      0|0:123 0|0:49  0|0:34  0|0:16  0|0:29  0|0:33
C1      2530    C10000015       C       T       40      .       DP=191;PL=Bole      GT:DP   1|1:3   0|0:27  0|0:9   0|0:33       1|1:36  0|0:26  1|1:1   0|0:11  0|0:32  0|0:13

 

ADD REPLYlink written 3.4 years ago by Miss Agnieszka Aleksandra Golicz40

I only suggested scanVcf() as a troubleshooting tool, not as a replacement for readVcf().

It looks like your GENO fields are not being read in (GT, DP). Do you have a GT and DP line in the header?

Can you make the file available via dropbox or other so I can take a look?

Valerie 

ADD REPLYlink written 3.4 years ago by Valerie Obenchain ♦♦ 6.6k

I have share the file with you via cloudstore. You should have a link in your email box.

Best wishes,

Agnieszka

ADD REPLYlink written 3.4 years ago by Miss Agnieszka Aleksandra Golicz40

Hi Valerie,

After consulting this thread: C: ReadVCF not reading samples

I have added FORMAT fields into the header.

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Read Depth">

But readVcf() did not work until I also fixed PL from from header INFO field.

from: ##INFO=<ID=PL,Number=0,Type=String,Description="Panel">

to: ##INFO=<ID=PL,Number=1,Type=String,Description="Panel">

I think Number=0 suggested that the field was empty while that was not true. Must have been a bug in SNP caller.

The file that did not work looked like this:

##fileformat=VCFv4.0
##filedate=20150318
##reference=C1.fasta
##phasing=allhomozygote
##filter="DP > 10"
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth over all samples">
##INFO=<ID=PL,Number=0,Type=String,Description="Panel">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S1     S2     S3     S4     S5     S6 S7 S8     S9  S10
C1      47      C10000003       A       C       5       .       DP=12;PL=Bole       GT:DP   .|.:0   .|.:0   .|.:0   .|.:0   1|1:7   0|0:5 .|.:0    .|.:0   .|.:0   .|.:0

And the file that worked looked like this:

##fileformat=VCFv4.0
##filedate=20150318
##reference=C1.fasta
##phasing=allhomozygote
##filter="DP > 10"
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth over all samples">
##INFO=<ID=PL,Number=1,Type=String,Description="Panel">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Read Depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S1     S2     S3     S4     S5     S6 S7 S8     S9  S10
C1      47      C10000003       A       C       5       .       DP=12;PL=Bole       GT:DP   .|.:0   .|.:0   .|.:0   .|.:0   1|1:7   0|0:5 .|.:0    .|.:0   .|.:0   .|.:0

If the FORMAT lines are necessary maybe it would be possible to emit a warning? From what I understand they are encouraged in the vcf 4.2 but not required.

Agnieszka

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Miss Agnieszka Aleksandra Golicz40

Hi,

Thanks for providing the file and the output of your troubleshooting. Very helpful.

As per the specs, Number = 0 should only be used when Type = flag, which represents a logical (presence / absence). I've added a check for this and an error will be thrown if Number == 0 for anything other than Type == flag.

You're right, we should have a better message indicating what fields are found in the header vs what's expected. While the headers are not mandatory, they make it possible to parse the data into correct data types and number of values. Without them we can only read in the data as a character, unparsed. This is probably not very useful or expected so the header lines are important.

I've added messages that show what fields were found in the header and will be parsed:

> fl <- system.file('extdata', 'ex2.vcf', package='VariantAnnotation')
> readVcf(fl, 'hg19')
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 6 ‘info’ fields: NS, DP, AF, AA, DB, H2 
found header lines for 4 ‘geno’ fields: GT, GQ, DP, HQ 
class: CollapsedVCF 
dim: 5 3 
...

When a field is requested in ScanVcfParam (e.g., INFO = "DP") and not found in the header, a warning is thrown. If there are no INFO header lines whatsoever, all INFO data are read in as a single unparsed column. 

On the TODO, is to parse the 'default' genotype (FORMAT) fields listed in 1.4.2 of the 4.2 spec:

https://samtools.github.io/hts-specs/VCFv4.2.pdf

This means if the file doesn't contain header lines for these fields, they will be parsed according to the Number and Type in the spec definition. If header lines are in the file, those take precedence. I'll try to get this done in the next couple of weeks.

Changes are in 1.14.4 (release) and 1.15.15 (devel).

Valerie

ADD REPLYlink written 3.4 years ago by Valerie Obenchain ♦♦ 6.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 367 users visited in the last hour