readVcf shape of 'skeleton' is not compatible with 'NROW(flesh)'
1
0
Entering edit mode
@miss-agnieszka-aleksandra-golicz-6531
Last seen 3.6 years ago
Australia

Hello,

I have a piece of code that have worked on three vcf files but failed on fourth. For the files the reference, samples and gff3 files are the same, only SNP calling algorithm is different.

The error I'm getting is:

Error in relist(v, part) :
  shape of 'skeleton' is not compatible with 'NROW(flesh)'

What does this error mean? Do you know what could be causing this? I just need to know what to look for. 

Best wishes,

Agnieszka

Piece of code:

compressVcf <- bgzip("all.vcf.h.sgs.flt", tempfile())
idx <- indexTabix(compressVcf, "vcf")
tab <- TabixFile(compressVcf, idx)
c <- "C1"
fn <- paste(c,".fasta", sep="")
gn <- paste(c,".gff3", sep="")
fltF <- paste("contig.info.",c, sep="")
fltT <- read.csv(fltF, sep=",", header=TRUE)
fltG <- with(fltT, GRanges(chr, IRanges(start, end)))
param <- ScanVcfParam(which=fltG)
txdb <- makeTxDbFromGFF(gn)
Prepare the 'metadata' data frame ... metadata: OK
Warning message:
In matchCircularity(seqlevels(gr), circ_seqs) :
  None of the strings in your circ_seqs argument match your seqnames.
vcf2<-readVcf(tab, paste("bolep", c), param=param)
Error in relist(v, part) :
  shape of 'skeleton' is not compatible with 'NROW(flesh)'

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] VariantAnnotation_1.14.3 Rsamtools_1.20.4         Biostrings_2.36.1
 [4] XVector_0.8.0            GenomicFeatures_1.20.1   AnnotationDbi_1.30.1
 [7] Biobase_2.28.0           GenomicRanges_1.20.5     GenomeInfoDb_1.4.1
[10] IRanges_2.2.4            S4Vectors_0.6.0          BiocGenerics_0.14.0

loaded via a namespace (and not attached):
 [1] zlibbioc_1.14.0         GenomicAlignments_1.4.1 BiocParallel_1.2.3
 [4] BSgenome_1.36.1         tools_3.2.1             DBI_0.3.1
 [7] lambda.r_1.1.7          futile.logger_1.4.1     rtracklayer_1.28.5
[10] futile.options_1.0.0    bitops_1.0-6            RCurl_1.95-4.6
[13] biomaRt_2.24.0          RSQLite_1.0.0           XML_3.98-1.2

 

variantannotation • 1.7k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 11 months ago
United States

Hi Agnieszka,

I'm not sure what is causing that error. You could try using scanVcf() to see if it gives the same error. I'll use a sample file from VariantAnnotation:

> scn <- scanVcf(fl, param=param)
> names(scn[[1]])
[1] "rowRanges" "REF"       "ALT"       "QUAL"      "FILTER"    "INFO"     
[7] "GENO"  

It's possible one of these elements has a character that's not being parsed correctly and causing the dimensions to be off. Take a look at the length of the fields with lapply():

> lapply(scn[[1]], length)
$rowRanges
[1] 103

$REF
[1] 103

$ALT
[1] 103

$QUAL
[1] 103

$FILTER
[1] 103

$INFO
[1] 22

$GENO
[1] 3

INFO and GENO are lists of fields so we have to look at those separately. Check the dim in GENO (vs the length):

> lapply(scn[[1]]$GENO, dim)
$GT
[1] 103   5

$DS
[1] 103   5

$GL
[1] 103   5

If that doesn't help identify the problem I can take a look at the file.

Valerie

ADD COMMENT
0
Entering edit mode

Hi,

I have used scanVcf() and it seems to have worked:

> scn <- scanVcf(tab, param=param)
> names(scn[[1]])
[1] "rowRanges" "REF"       "ALT"       "QUAL"      "FILTER"    "INFO"
[7] "GENO"

> lapply(scn[[1]], length)
$rowRanges
[1] 413361

$REF
[1] 413361

$ALT
[1] 413361

$QUAL
[1] 413361

$FILTER
[1] 413361

$INFO
[1] 2

$GENO
[1] 0

> lapply(scn[[1]]$GENO, dim)
named list()

But the object created by scanVcf() is a list object to perform any further analysis I need CollapsedVCF

 

First few lines from vcf file:

C1      405     C10000008       A       G       41      .       DP=323;PL=Bole      GT:DP   0|0:135 1|1:34  0|0:25  .|.:00|0:67  0|0:43  0|0:12  1|1:3   1|1:2   1|1:2
C1      626     C10000009       G       A       40      .       DP=584;PL=Bole      GT:DP   0|0:269 0|0:89  1|1:40  .|.:00|0:97  0|0:52  0|0:18  0|0:5   0|0:10  0|0:4
C1      1675    C10000014       T       A       85      .       DP=757;PL=Bole      GT:DP   0|0:126 0|0:145 1|1:85  0|0:117      0|0:123 0|0:49  0|0:34  0|0:16  0|0:29  0|0:33
C1      2530    C10000015       C       T       40      .       DP=191;PL=Bole      GT:DP   1|1:3   0|0:27  0|0:9   0|0:33       1|1:36  0|0:26  1|1:1   0|0:11  0|0:32  0|0:13

 

0
Entering edit mode

I only suggested scanVcf() as a troubleshooting tool, not as a replacement for readVcf().

It looks like your GENO fields are not being read in (GT, DP). Do you have a GT and DP line in the header?

Can you make the file available via dropbox or other so I can take a look?

Valerie 

ADD REPLY
0
Entering edit mode

I have share the file with you via cloudstore. You should have a link in your email box.

Best wishes,

Agnieszka

0
Entering edit mode

Hi Valerie,

After consulting this thread: C: ReadVCF not reading samples

I have added FORMAT fields into the header.

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Read Depth">

But readVcf() did not work until I also fixed PL from from header INFO field.

from: ##INFO=<ID=PL,Number=0,Type=String,Description="Panel">

to: ##INFO=<ID=PL,Number=1,Type=String,Description="Panel">

I think Number=0 suggested that the field was empty while that was not true. Must have been a bug in SNP caller.

The file that did not work looked like this:

##fileformat=VCFv4.0
##filedate=20150318
##reference=C1.fasta
##phasing=allhomozygote
##filter="DP > 10"
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth over all samples">
##INFO=<ID=PL,Number=0,Type=String,Description="Panel">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S1     S2     S3     S4     S5     S6 S7 S8     S9  S10
C1      47      C10000003       A       C       5       .       DP=12;PL=Bole       GT:DP   .|.:0   .|.:0   .|.:0   .|.:0   1|1:7   0|0:5 .|.:0    .|.:0   .|.:0   .|.:0

And the file that worked looked like this:

##fileformat=VCFv4.0
##filedate=20150318
##reference=C1.fasta
##phasing=allhomozygote
##filter="DP > 10"
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth over all samples">
##INFO=<ID=PL,Number=1,Type=String,Description="Panel">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Read Depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S1     S2     S3     S4     S5     S6 S7 S8     S9  S10
C1      47      C10000003       A       C       5       .       DP=12;PL=Bole       GT:DP   .|.:0   .|.:0   .|.:0   .|.:0   1|1:7   0|0:5 .|.:0    .|.:0   .|.:0   .|.:0

If the FORMAT lines are necessary maybe it would be possible to emit a warning? From what I understand they are encouraged in the vcf 4.2 but not required.

Agnieszka

0
Entering edit mode

Hi,

Thanks for providing the file and the output of your troubleshooting. Very helpful.

As per the specs, Number = 0 should only be used when Type = flag, which represents a logical (presence / absence). I've added a check for this and an error will be thrown if Number == 0 for anything other than Type == flag.

You're right, we should have a better message indicating what fields are found in the header vs what's expected. While the headers are not mandatory, they make it possible to parse the data into correct data types and number of values. Without them we can only read in the data as a character, unparsed. This is probably not very useful or expected so the header lines are important.

I've added messages that show what fields were found in the header and will be parsed:

> fl <- system.file('extdata', 'ex2.vcf', package='VariantAnnotation')
> readVcf(fl, 'hg19')
found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 6 ‘info’ fields: NS, DP, AF, AA, DB, H2 
found header lines for 4 ‘geno’ fields: GT, GQ, DP, HQ 
class: CollapsedVCF 
dim: 5 3 
...

When a field is requested in ScanVcfParam (e.g., INFO = "DP") and not found in the header, a warning is thrown. If there are no INFO header lines whatsoever, all INFO data are read in as a single unparsed column. 

On the TODO, is to parse the 'default' genotype (FORMAT) fields listed in 1.4.2 of the 4.2 spec:

https://samtools.github.io/hts-specs/VCFv4.2.pdf

This means if the file doesn't contain header lines for these fields, they will be parsed according to the Number and Type in the spec definition. If header lines are in the file, those take precedence. I'll try to get this done in the next couple of weeks.

Changes are in 1.14.4 (release) and 1.15.15 (devel).

Valerie

ADD REPLY

Login before adding your answer.

Traffic: 442 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6