parseCSQToGRanges() : mcols are all factors, despite some numeric fields
1
1
Entering edit mode
kevin.rue ▴ 300
@kevinrue-6757
Last seen 4 days ago
University of Oxford

Dear Bioconductor Package Maintainers and Developers,

This started out as question, but I feel like I ended up answering myself, hence posting as a tutorial. Feedback welcome!

Considering a VCF file with many ensemblVEP predictions, including a mixture of factors (e.g. "IMPACT"), and numeric fields (e.g. allele frequencies from ExAC project populations). The parseCSQToGRanges() function seems to turn every field into a factor in the resulting DataFrame object.

Obviously, this is trivial to fix by hand on a small scale (e.g. when plotting a couple of predictions in ggplot2). However, I am writing some generic code that would greatly benefit of knowing what is truly a factor, from what could be plotted on a continuous scale. 

While I appreciate the difficulty of the task, considering the infinite number of fields and plugins that could appear in the CSQ/ANN field, I merely wonder:

  1. Whether there are any foreseeable plan to address this point (i.e. automatically detect factor/numeric fields)
  2. Independently of 1., which would be the recommended way to deal with the problem. I present my current idea (I didn't want to post without contributing anything!)
    • ​define as numeric the predictions which can be converted to numeric without warning (e.g. as.numeric(as.character(csq$DISTANCE)) ), the remaining being considered factors (see below)

This solution (applicable to the example data set at the end) ended up looking like:

mcols_names <- colnames(mcols(csq))
convertibles <- sapply(mcols_names, function(x_name){
  sum(
    suppressWarnings(!is.na(as.numeric(as.character(mcols(csq)[,x_name] )))),
    na.rm = TRUE
    ) > 0
})
which(convertibles)

In words: "which of the predictions are left with at least one value not NA after converting the factor to character and then to numeric".

The only tricky one here is the STRAND, encoded 0/1 although it is a factor.

In any case, I find the code above a little clumsy, and I genuinely wonder whether anyone has something cleaner to suggest :)

For the sake of reporting a replicable situation, I copy-paste the sample code from the parseCSQToGRanges() help page, where the resulting csq object contains field DISTANCE, clearly numeric, although displayed as a factor.

 

library(ensemblVEP)
file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") 
vep <- ensemblVEP(file, param=VEPParam(dataformat=c(vcf=TRUE)))
 
## The returned 'CSQ' data are unparsed.
info(vep)$CSQ
 
## Parse into a GRanges and include the 'VCFRowID' column.
vcf <- readVcf(file, "hg19")
csq <- parseCSQToGRanges(vep, VCFRowID=rownames(vcf))
csq[1:4]

Best wishes,

Kevin

ensemblVEP • 622 views
ADD COMMENT
1
Entering edit mode

Hi Kevin,

There are no plans to enforce data type on the CSQ fields. AFAIK data type is not is not provided from ensembl and guessing at the type can problematic. For example, a field might contain identifiers that mix numbers and letters. These should be considered characters but your code would mark it as 'convertible' to numeric:

>  sum(suppressWarnings(!is.na(as.numeric(c("1", "1B")))), na.rm = TRUE) > 0
[1] TRUE


CSQ data are received as a single string, field separated by '|' and all are parsed as characters. See the description in the 'Data format' section, option --vcf:

http://uswest.ensembl.org/info/docs/tools/vep/script/vep_options.html

You can see the unparsed CSQ (and probably have) with

  file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
  vep <- ensemblVEP(file, param=VEPParam(dataformat=c(vcf=TRUE)))
  info(vep)


Previously all character fields were coerced to factors by default when creating the DataFrame. I've checked in a fix to version 1.13.1 (devel) that prevents this so the fields are now plain characters. This doesn't address the issue of the non-characters but I don't think we should enforce data type unless it's provided.

Valerie

ADD REPLY
0
Entering edit mode

Thank you Valerie.

I appreciate the little fix in the devel branch.

I have to admit that this is not a critical issue, so for the time being I'll just leave it to the user of my code to select the appropriate fields when requesting a factor, a character or a numeric type. I was hoping to provide subsetted selections of fields using "is.<class>" methods, but common sense should be enough: a line must sometimes be drawn between "blame-the-programer" and "blame-the-user" scenarios ;)

Thanks for the answer!

Kevin 

 

ADD REPLY
0
Entering edit mode
@martin-morgan-1513
Last seen 41 minutes ago
United States

I thought maybe a better approach would be to let R do the guessing about type, using whatever logic it has built in to read.delim(). I started by unlisting the CSQ field to a plain-old character vector

csq <- unlist(info(vep)$CSQ)

The column headers can be added from the vcf object

headers <- sub(".*Format: ", "", info(header(vep))["CSQ", "Description"])
csq <- c(headers, csq)

The resulting character vector can be read in through a textConnetion()

input <- read.delim(textConnection(csq), sep="|", stringsAsFactors=FALSE)

(reading from a text connection used to be slow for large objects; a work-around would be writeLines(csq, fl <- tempfile()); read.delim(fl, ...)). This gave me

> input[3:4, 1:13]
  Allele           Consequence   IMPACT SYMBOL            Gene Feature_type
3      G upstream_gene_variant MODIFIER  PSMF1 ENSG00000125818   Transcript
4      T upstream_gene_variant MODIFIER  PSMF1 ENSG00000125818   Transcript
          Feature        BIOTYPE EXON INTRON HGVSc HGVSp cDNA_position
3 ENST00000381899 protein_coding   NA           NA    NA            NA
4 ENST00000381899 protein_coding   NA           NA    NA            NA
> sapply(input, class)
            Allele        Consequence             IMPACT             SYMBOL 
       "character"        "character"        "character"        "character" 
              Gene       Feature_type            Feature            BIOTYPE 
       "character"        "character"        "character"        "character" 
              EXON             INTRON              HGVSc              HGVSp 
         "logical"        "character"          "logical"          "logical" 
     cDNA_position       CDS_position   Protein_position        Amino_acids 
         "logical"          "logical"          "logical"          "logical" 
            Codons Existing_variation           DISTANCE             STRAND 
         "logical"          "logical"          "integer"          "integer" 
             FLAGS      SYMBOL_SOURCE            HGNC_ID 
       "character"        "character"        "character" 

(Columns with all NAs are of class logical(); some of these don't seem to make sense). I then coerced the plain-old-data.frame to a DataFrame, so that I could get back the original geometry (using relist()) and replace the CSQ field

info(vep)$CSQ <- relist(DataFrame(input), info(vep)$CSQ)

The main advantage is that the rules for guessing about column classes are consistent with R's.

(I think this has ended up more a question than tutorial, so I'll re-assign it).

 

ADD COMMENT
0
Entering edit mode

Thank you Martin. (apologies for the delay)

As mentioned in my reply to Valerie, instead of converting every column to "what it should be", I decided to let the user provide any field key, and then I only check the values in the context of what is supposed to be done with them (plot, table, etc.).

Thanks for the answer! 

Kevin 

ADD REPLY

Login before adding your answer.

Traffic: 439 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6