Question

readVcf shape of 'skeleton' is not compatible with 'NROW(flesh)'

0

Entering edit mode

Miss Agnieszka Aleksandra Golicz ▴ 40

@miss-agnieszka-aleksandra-golicz-6531

Last seen 11 months ago

Australia

Hello,

I have a piece of code that have worked on three vcf files but failed on fourth. For the files the reference, samples and gff3 files are the same, only SNP calling algorithm is different.

The error I'm getting is:

Error in relist(v, part) :
shape of 'skeleton' is not compatible with 'NROW(flesh)'

What does this error mean? Do you know what could be causing this? I just need to know what to look for.

Best wishes,

Agnieszka

Piece of code:

compressVcf <- bgzip("all.vcf.h.sgs.flt", tempfile())
idx <- indexTabix(compressVcf, "vcf")
tab <- TabixFile(compressVcf, idx)
c <- "C1"
fn <- paste(c,".fasta", sep="")
gn <- paste(c,".gff3", sep="")
fltF <- paste("contig.info.",c, sep="")
fltT <- read.csv(fltF, sep=",", header=TRUE)
fltG <- with(fltT, GRanges(chr, IRanges(start, end)))
param <- ScanVcfParam(which=fltG)
txdb <- makeTxDbFromGFF(gn)
Prepare the 'metadata' data frame ... metadata: OK
Warning message:
In matchCircularity(seqlevels(gr), circ_seqs) :
None of the strings in your circ_seqs argument match your seqnames.
vcf2<-readVcf(tab, paste("bolep", c), param=param)
Error in relist(v, part) :
shape of 'skeleton' is not compatible with 'NROW(flesh)'

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] VariantAnnotation_1.14.3 Rsamtools_1.20.4 Biostrings_2.36.1
[4] XVector_0.8.0 GenomicFeatures_1.20.1 AnnotationDbi_1.30.1
[7] Biobase_2.28.0 GenomicRanges_1.20.5 GenomeInfoDb_1.4.1
[10] IRanges_2.2.4 S4Vectors_0.6.0 BiocGenerics_0.14.0

loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0 GenomicAlignments_1.4.1 BiocParallel_1.2.3
[4] BSgenome_1.36.1 tools_3.2.1 DBI_0.3.1
[7] lambda.r_1.1.7 futile.logger_1.4.1 rtracklayer_1.28.5
[10] futile.options_1.0.0 bitops_1.0-6 RCurl_1.95-4.6
[13] biomaRt_2.24.0 RSQLite_1.0.0 XML_3.98-1.2

variantannotation • 2.7k views

ADD COMMENT • link updated 8.9 years ago by Valerie Obenchain ★ 6.8k • written 8.9 years ago by Miss Agnieszka Aleksandra Golicz ▴ 40

score 0 · Answer 1 · 2015-06-19

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Agnieszka,

I'm not sure what is causing that error. You could try using scanVcf() to see if it gives the same error. I'll use a sample file from VariantAnnotation:

> scn <- scanVcf(fl, param=param)
> names(scn[[1]])
[1] "rowRanges" "REF"       "ALT"       "QUAL"      "FILTER"    "INFO"     
[7] "GENO"

It's possible one of these elements has a character that's not being parsed correctly and causing the dimensions to be off. Take a look at the length of the fields with lapply():

> lapply(scn[[1]], length)
$rowRanges
[1] 103

$REF
[1] 103

$ALT
[1] 103

$QUAL
[1] 103

$FILTER
[1] 103

$INFO
[1] 22

$GENO
[1] 3

INFO and GENO are lists of fields so we have to look at those separately. Check the dim in GENO (vs the length):

> lapply(scn[[1]]$GENO, dim)
$GT
[1] 103   5

$DS
[1] 103   5

$GL
[1] 103   5

If that doesn't help identify the problem I can take a look at the file.

Valerie

ADD COMMENT • link 8.9 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi,

I have used scanVcf() and it seems to have worked:

> scn <- scanVcf(tab, param=param)
> names(scn[[1]])
[1] "rowRanges" "REF" "ALT" "QUAL" "FILTER" "INFO"
[7] "GENO"

> lapply(scn[[1]], length)
$rowRanges
[1] 413361

$REF
[1] 413361

$ALT
[1] 413361

$QUAL
[1] 413361

$FILTER
[1] 413361

$INFO
[1] 2

$GENO
[1] 0

> lapply(scn[[1]]$GENO, dim)
named list()

But the object created by scanVcf() is a list object to perform any further analysis I need CollapsedVCF

First few lines from vcf file:

C1 405 C10000008 A G 41 . DP=323;PL=Bole GT:DP 0|0:135 1|1:34 0|0:25 .|.:00|0:67 0|0:43 0|0:12 1|1:3 1|1:2 1|1:2
C1 626 C10000009 G A 40 . DP=584;PL=Bole GT:DP 0|0:269 0|0:89 1|1:40 .|.:00|0:97 0|0:52 0|0:18 0|0:5 0|0:10 0|0:4
C1 1675 C10000014 T A 85 . DP=757;PL=Bole GT:DP 0|0:126 0|0:145 1|1:85 0|0:117 0|0:123 0|0:49 0|0:34 0|0:16 0|0:29 0|0:33
C1 2530 C10000015 C T 40 . DP=191;PL=Bole GT:DP 1|1:3 0|0:27 0|0:9 0|0:33 1|1:36 0|0:26 1|1:1 0|0:11 0|0:32 0|0:13

ADD REPLY • link 8.9 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

I only suggested scanVcf() as a troubleshooting tool, not as a replacement for readVcf().

It looks like your GENO fields are not being read in (GT, DP). Do you have a GT and DP line in the header?

Can you make the file available via dropbox or other so I can take a look?

Valerie

ADD REPLY • link 8.9 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

I have share the file with you via cloudstore. You should have a link in your email box.

Best wishes,

Agnieszka

ADD REPLY • link 8.9 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

Hi Valerie,

After consulting this thread: C: ReadVCF not reading samples

I have added FORMAT fields into the header.

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Read Depth">

But readVcf() did not work until I also fixed PL from from header INFO field.

from: ##INFO=<ID=PL,Number=0,Type=String,Description="Panel">

to: ##INFO=<ID=PL,Number=1,Type=String,Description="Panel">

I think Number=0 suggested that the field was empty while that was not true. Must have been a bug in SNP caller.

The file that did not work looked like this:

##fileformat=VCFv4.0
##filedate=20150318
##reference=C1.fasta
##phasing=allhomozygote
##filter="DP > 10"
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth over all samples">
##INFO=<ID=PL,Number=0,Type=String,Description="Panel">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
C1 47 C10000003 A C 5 . DP=12;PL=Bole GT:DP .|.:0 .|.:0 .|.:0 .|.:0 1|1:7 0|0:5 .|.:0 .|.:0 .|.:0 .|.:0

And the file that worked looked like this:

##fileformat=VCFv4.0
##filedate=20150318
##reference=C1.fasta
##phasing=allhomozygote
##filter="DP > 10"
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth over all samples">
##INFO=<ID=PL,Number=1,Type=String,Description="Panel">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
C1 47 C10000003 A C 5 . DP=12;PL=Bole GT:DP .|.:0 .|.:0 .|.:0 .|.:0 1|1:7 0|0:5 .|.:0 .|.:0 .|.:0 .|.:0

If the FORMAT lines are necessary maybe it would be possible to emit a warning? From what I understand they are encouraged in the vcf 4.2 but not required.

Agnieszka

ADD REPLY • link 8.9 years ago Miss Agnieszka Aleksandra Golicz ▴ 40

0

Entering edit mode

Hi,

Thanks for providing the file and the output of your troubleshooting. Very helpful.

As per the specs, Number = 0 should only be used when Type = flag, which represents a logical (presence / absence). I've added a check for this and an error will be thrown if Number == 0 for anything other than Type == flag.

You're right, we should have a better message indicating what fields are found in the header vs what's expected. While the headers are not mandatory, they make it possible to parse the data into correct data types and number of values. Without them we can only read in the data as a character, unparsed. This is probably not very useful or expected so the header lines are important.

I've added messages that show what fields were found in the header and will be parsed:

> fl <- system.file('extdata', 'ex2.vcf', package='VariantAnnotation')
> readVcf(fl, 'hg19')
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 6 ‘info’ fields: NS, DP, AF, AA, DB, H2 
found header lines for 4 ‘geno’ fields: GT, GQ, DP, HQ 
class: CollapsedVCF 
dim: 5 3 
...

When a field is requested in ScanVcfParam (e.g., INFO = "DP") and not found in the header, a warning is thrown. If there are no INFO header lines whatsoever, all INFO data are read in as a single unparsed column.

On the TODO, is to parse the 'default' genotype (FORMAT) fields listed in 1.4.2 of the 4.2 spec:

https://samtools.github.io/hts-specs/VCFv4.2.pdf

This means if the file doesn't contain header lines for these fields, they will be parsed according to the Number and Type in the spec definition. If header lines are in the file, those take precedence. I'll try to get this done in the next couple of weeks.

Changes are in 1.14.4 (release) and 1.15.15 (devel).

Valerie

ADD REPLY • link 8.9 years ago Valerie Obenchain ★ 6.8k