Question

parseCSQToGRanges() : mcols are all factors, despite some numeric fields

1

Entering edit mode

kevin.rue ▴ 350

@kevinrue-6757

Last seen 25 days ago

University of Oxford

Dear Bioconductor Package Maintainers and Developers,

This started out as question, but I feel like I ended up answering myself, hence posting as a tutorial. Feedback welcome!

Considering a VCF file with many ensemblVEP predictions, including a mixture of factors (e.g. "IMPACT"), and numeric fields (e.g. allele frequencies from ExAC project populations). The parseCSQToGRanges() function seems to turn every field into a factor in the resulting DataFrame object.

Obviously, this is trivial to fix by hand on a small scale (e.g. when plotting a couple of predictions in ggplot2). However, I am writing some generic code that would greatly benefit of knowing what is truly a factor, from what could be plotted on a continuous scale.

While I appreciate the difficulty of the task, considering the infinite number of fields and plugins that could appear in the CSQ/ANN field, I merely wonder:

Whether there are any foreseeable plan to address this point (i.e. automatically detect factor/numeric fields)
Independently of 1., which would be the recommended way to deal with the problem. I present my current idea (I didn't want to post without contributing anything!)
- define as numeric the predictions which can be converted to numeric without warning (e.g. as.numeric(as.character(csq$DISTANCE)) ), the remaining being considered factors (see below)

This solution (applicable to the example data set at the end) ended up looking like:

mcols_names <- colnames(mcols(csq))
convertibles <- sapply(mcols_names, function(x_name){
  sum(
    suppressWarnings(!is.na(as.numeric(as.character(mcols(csq)[,x_name] )))),
    na.rm = TRUE
    ) > 0
})
which(convertibles)

In words: "which of the predictions are left with at least one value not NA after converting the factor to character and then to numeric".

The only tricky one here is the STRAND, encoded 0/1 although it is a factor.

In any case, I find the code above a little clumsy, and I genuinely wonder whether anyone has something cleaner to suggest :)

For the sake of reporting a replicable situation, I copy-paste the sample code from the parseCSQToGRanges() help page, where the resulting csq object contains field DISTANCE, clearly numeric, although displayed as a factor.

library(ensemblVEP)
file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") 
vep <- ensemblVEP(file, param=VEPParam(dataformat=c(vcf=TRUE)))
 
## The returned 'CSQ' data are unparsed.
info(vep)$CSQ
 
## Parse into a GRanges and include the 'VCFRowID' column.
vcf <- readVcf(file, "hg19")
csq <- parseCSQToGRanges(vep, VCFRowID=rownames(vcf))
csq[1:4]

Best wishes,

Kevin

ensemblVEP • 1.4k views

ADD COMMENT • link updated 8.0 years ago by Martin Morgan 25k • written 8.0 years ago by kevin.rue ▴ 350

1

Entering edit mode

Hi Kevin,

There are no plans to enforce data type on the CSQ fields. AFAIK data type is not is not provided from ensembl and guessing at the type can problematic. For example, a field might contain identifiers that mix numbers and letters. These should be considered characters but your code would mark it as 'convertible' to numeric:

> sum(suppressWarnings(!is.na(as.numeric(c("1", "1B")))), na.rm = TRUE) > 0 [1] TRUE

CSQ data are received as a single string, field separated by '|' and all are parsed as characters. See the description in the 'Data format' section, option --vcf:

http://uswest.ensembl.org/info/docs/tools/vep/script/vep_options.html

You can see the unparsed CSQ (and probably have) with

file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") vep <- ensemblVEP(file, param=VEPParam(dataformat=c(vcf=TRUE))) info(vep)

Previously all character fields were coerced to factors by default when creating the DataFrame. I've checked in a fix to version 1.13.1 (devel) that prevents this so the fields are now plain characters. This doesn't address the issue of the non-characters but I don't think we should enforce data type unless it's provided.

Valerie

ADD REPLY • link 8.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thank you Valerie.

I appreciate the little fix in the devel branch.

I have to admit that this is not a critical issue, so for the time being I'll just leave it to the user of my code to select the appropriate fields when requesting a factor, a character or a numeric type. I was hoping to provide subsetted selections of fields using "is.<class>" methods, but common sense should be enough: a line must sometimes be drawn between "blame-the-programer" and "blame-the-user" scenarios ;)

Thanks for the answer!

Kevin

ADD REPLY • link 8.0 years ago kevin.rue ▴ 350

score 0 · Answer 1 · 2016-05-11

I thought maybe a better approach would be to let R do the guessing about type, using whatever logic it has built in to read.delim(). I started by unlisting the CSQ field to a plain-old character vector

csq <- unlist(info(vep)$CSQ)

The column headers can be added from the vcf object

headers <- sub(".*Format: ", "", info(header(vep))["CSQ", "Description"])
csq <- c(headers, csq)

The resulting character vector can be read in through a textConnetion()

input <- read.delim(textConnection(csq), sep="|", stringsAsFactors=FALSE)

(reading from a text connection used to be slow for large objects; a work-around would be writeLines(csq, fl <- tempfile()); read.delim(fl, ...)). This gave me

> input[3:4, 1:13]
  Allele           Consequence   IMPACT SYMBOL            Gene Feature_type
3      G upstream_gene_variant MODIFIER  PSMF1 ENSG00000125818   Transcript
4      T upstream_gene_variant MODIFIER  PSMF1 ENSG00000125818   Transcript
          Feature        BIOTYPE EXON INTRON HGVSc HGVSp cDNA_position
3 ENST00000381899 protein_coding   NA           NA    NA            NA
4 ENST00000381899 protein_coding   NA           NA    NA            NA
> sapply(input, class)
            Allele        Consequence             IMPACT             SYMBOL 
       "character"        "character"        "character"        "character" 
              Gene       Feature_type            Feature            BIOTYPE 
       "character"        "character"        "character"        "character" 
              EXON             INTRON              HGVSc              HGVSp 
         "logical"        "character"          "logical"          "logical" 
     cDNA_position       CDS_position   Protein_position        Amino_acids 
         "logical"          "logical"          "logical"          "logical" 
            Codons Existing_variation           DISTANCE             STRAND 
         "logical"          "logical"          "integer"          "integer" 
             FLAGS      SYMBOL_SOURCE            HGNC_ID 
       "character"        "character"        "character"

(Columns with all NAs are of class logical(); some of these don't seem to make sense). I then coerced the plain-old-data.frame to a DataFrame, so that I could get back the original geometry (using relist()) and replace the CSQ field

info(vep)$CSQ <- relist(DataFrame(input), info(vep)$CSQ)

The main advantage is that the rules for guessing about column classes are consistent with R's.

(I think this has ended up more a question than tutorial, so I'll re-assign it).