Hello,
I have a piece of code that have worked on three vcf files but failed on fourth. For the files the reference, samples and gff3 files are the same, only SNP calling algorithm is different.
The error I'm getting is:
Error in relist(v, part) :
shape of 'skeleton' is not compatible with 'NROW(flesh)'
What does this error mean? Do you know what could be causing this? I just need to know what to look for.
Best wishes,
Agnieszka
Piece of code:
compressVcf <- bgzip("all.vcf.h.sgs.flt", tempfile())
idx <- indexTabix(compressVcf, "vcf")
tab <- TabixFile(compressVcf, idx)
c <- "C1"
fn <- paste(c,".fasta", sep="")
gn <- paste(c,".gff3", sep="")
fltF <- paste("contig.info.",c, sep="")
fltT <- read.csv(fltF, sep=",", header=TRUE)
fltG <- with(fltT, GRanges(chr, IRanges(start, end)))
param <- ScanVcfParam(which=fltG)
txdb <- makeTxDbFromGFF(gn)
Prepare the 'metadata' data frame ... metadata: OK
Warning message:
In matchCircularity(seqlevels(gr), circ_seqs) :
None of the strings in your circ_seqs argument match your seqnames.
vcf2<-readVcf(tab, paste("bolep", c), param=param)
Error in relist(v, part) :
shape of 'skeleton' is not compatible with 'NROW(flesh)'
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] VariantAnnotation_1.14.3 Rsamtools_1.20.4 Biostrings_2.36.1
[4] XVector_0.8.0 GenomicFeatures_1.20.1 AnnotationDbi_1.30.1
[7] Biobase_2.28.0 GenomicRanges_1.20.5 GenomeInfoDb_1.4.1
[10] IRanges_2.2.4 S4Vectors_0.6.0 BiocGenerics_0.14.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.14.0 GenomicAlignments_1.4.1 BiocParallel_1.2.3
[4] BSgenome_1.36.1 tools_3.2.1 DBI_0.3.1
[7] lambda.r_1.1.7 futile.logger_1.4.1 rtracklayer_1.28.5
[10] futile.options_1.0.0 bitops_1.0-6 RCurl_1.95-4.6
[13] biomaRt_2.24.0 RSQLite_1.0.0 XML_3.98-1.2
Hi,
I have used scanVcf() and it seems to have worked:
> scn <- scanVcf(tab, param=param)
> names(scn[[1]])
[1] "rowRanges" "REF" "ALT" "QUAL" "FILTER" "INFO"
[7] "GENO"
> lapply(scn[[1]], length)
$rowRanges
[1] 413361
$REF
[1] 413361
$ALT
[1] 413361
$QUAL
[1] 413361
$FILTER
[1] 413361
$INFO
[1] 2
$GENO
[1] 0
> lapply(scn[[1]]$GENO, dim)
named list()
But the object created by scanVcf() is a list object to perform any further analysis I need CollapsedVCF
First few lines from vcf file:
C1 405 C10000008 A G 41 . DP=323;PL=Bole GT:DP 0|0:135 1|1:34 0|0:25 .|.:00|0:67 0|0:43 0|0:12 1|1:3 1|1:2 1|1:2
C1 626 C10000009 G A 40 . DP=584;PL=Bole GT:DP 0|0:269 0|0:89 1|1:40 .|.:00|0:97 0|0:52 0|0:18 0|0:5 0|0:10 0|0:4
C1 1675 C10000014 T A 85 . DP=757;PL=Bole GT:DP 0|0:126 0|0:145 1|1:85 0|0:117 0|0:123 0|0:49 0|0:34 0|0:16 0|0:29 0|0:33
C1 2530 C10000015 C T 40 . DP=191;PL=Bole GT:DP 1|1:3 0|0:27 0|0:9 0|0:33 1|1:36 0|0:26 1|1:1 0|0:11 0|0:32 0|0:13
I only suggested scanVcf() as a troubleshooting tool, not as a replacement for readVcf().
It looks like your GENO fields are not being read in (GT, DP). Do you have a GT and DP line in the header?
Can you make the file available via dropbox or other so I can take a look?
Valerie
I have share the file with you via cloudstore. You should have a link in your email box.
Best wishes,
Agnieszka
Hi Valerie,
After consulting this thread: C: ReadVCF not reading samples
I have added FORMAT fields into the header.
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Read Depth">
But readVcf() did not work until I also fixed PL from from header INFO field.
from: ##INFO=<ID=PL,Number=0,Type=String,Description="Panel">
to: ##INFO=<ID=PL,Number=1,Type=String,Description="Panel">
I think Number=0 suggested that the field was empty while that was not true. Must have been a bug in SNP caller.
The file that did not work looked like this:
##fileformat=VCFv4.0
##filedate=20150318
##reference=C1.fasta
##phasing=allhomozygote
##filter="DP > 10"
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth over all samples">
##INFO=<ID=PL,Number=0,Type=String,Description="Panel">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
C1 47 C10000003 A C 5 . DP=12;PL=Bole GT:DP .|.:0 .|.:0 .|.:0 .|.:0 1|1:7 0|0:5 .|.:0 .|.:0 .|.:0 .|.:0
And the file that worked looked like this:
##fileformat=VCFv4.0
##filedate=20150318
##reference=C1.fasta
##phasing=allhomozygote
##filter="DP > 10"
##INFO=<ID=DP,Number=1,Type=Integer,Description="Read depth over all samples">
##INFO=<ID=PL,Number=1,Type=String,Description="Panel">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
C1 47 C10000003 A C 5 . DP=12;PL=Bole GT:DP .|.:0 .|.:0 .|.:0 .|.:0 1|1:7 0|0:5 .|.:0 .|.:0 .|.:0 .|.:0
If the FORMAT lines are necessary maybe it would be possible to emit a warning? From what I understand they are encouraged in the vcf 4.2 but not required.
Agnieszka
Hi,
Thanks for providing the file and the output of your troubleshooting. Very helpful.
As per the specs, Number = 0 should only be used when Type = flag, which represents a logical (presence / absence). I've added a check for this and an error will be thrown if Number == 0 for anything other than Type == flag.
You're right, we should have a better message indicating what fields are found in the header vs what's expected. While the headers are not mandatory, they make it possible to parse the data into correct data types and number of values. Without them we can only read in the data as a character, unparsed. This is probably not very useful or expected so the header lines are important.
I've added messages that show what fields were found in the header and will be parsed:
When a field is requested in ScanVcfParam (e.g., INFO = "DP") and not found in the header, a warning is thrown. If there are no INFO header lines whatsoever, all INFO data are read in as a single unparsed column.
On the TODO, is to parse the 'default' genotype (FORMAT) fields listed in 1.4.2 of the 4.2 spec:
https://samtools.github.io/hts-specs/VCFv4.2.pdf
This means if the file doesn't contain header lines for these fields, they will be parsed according to the Number and Type in the spec definition. If header lines are in the file, those take precedence. I'll try to get this done in the next couple of weeks.
Changes are in 1.14.4 (release) and 1.15.15 (devel).
Valerie