Question: parseCSQToGRanges() : mcols are all factors, despite some numeric fields
1
3.5 years ago by
kevin.rue230
University of Oxford
kevin.rue230 wrote:

Dear Bioconductor Package Maintainers and Developers,

This started out as question, but I feel like I ended up answering myself, hence posting as a tutorial. Feedback welcome!

Considering a VCF file with many ensemblVEP predictions, including a mixture of factors (e.g. "IMPACT"), and numeric fields (e.g. allele frequencies from ExAC project populations). The parseCSQToGRanges() function seems to turn every field into a factor in the resulting DataFrame object.

Obviously, this is trivial to fix by hand on a small scale (e.g. when plotting a couple of predictions in ggplot2). However, I am writing some generic code that would greatly benefit of knowing what is truly a factor, from what could be plotted on a continuous scale.

While I appreciate the difficulty of the task, considering the infinite number of fields and plugins that could appear in the CSQ/ANN field, I merely wonder:

1. Whether there are any foreseeable plan to address this point (i.e. automatically detect factor/numeric fields)
2. Independently of 1., which would be the recommended way to deal with the problem. I present my current idea (I didn't want to post without contributing anything!)
• ​define as numeric the predictions which can be converted to numeric without warning (e.g. as.numeric(as.character(csq$DISTANCE)) ), the remaining being considered factors (see below) This solution (applicable to the example data set at the end) ended up looking like: mcols_names <- colnames(mcols(csq)) convertibles <- sapply(mcols_names, function(x_name){ sum( suppressWarnings(!is.na(as.numeric(as.character(mcols(csq)[,x_name] )))), na.rm = TRUE ) > 0 }) which(convertibles) In words: "which of the predictions are left with at least one value not NA after converting the factor to character and then to numeric". The only tricky one here is the STRAND, encoded 0/1 although it is a factor. In any case, I find the code above a little clumsy, and I genuinely wonder whether anyone has something cleaner to suggest :) For the sake of reporting a replicable situation, I copy-paste the sample code from the parseCSQToGRanges() help page, where the resulting csq object contains field DISTANCE, clearly numeric, although displayed as a factor. library(ensemblVEP) file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") vep <- ensemblVEP(file, param=VEPParam(dataformat=c(vcf=TRUE))) ## The returned 'CSQ' data are unparsed. info(vep)$CSQ

## Parse into a GRanges and include the 'VCFRowID' column.
vcf <- readVcf(file, "hg19")
csq <- parseCSQToGRanges(vep, VCFRowID=rownames(vcf))
csq[1:4]

Best wishes,

Kevin

ensemblvep • 523 views
modified 3.5 years ago by Martin Morgan ♦♦ 24k • written 3.5 years ago by kevin.rue230
1

Hi Kevin,

There are no plans to enforce data type on the CSQ fields. AFAIK data type is not is not provided from ensembl and guessing at the type can problematic. For example, a field might contain identifiers that mix numbers and letters. These should be considered characters but your code would mark it as 'convertible' to numeric:

>  sum(suppressWarnings(!is.na(as.numeric(c("1", "1B")))), na.rm = TRUE) > 0
[1] TRUE

CSQ data are received as a single string, field separated by '|' and all are parsed as characters. See the description in the 'Data format' section, option --vcf:

http://uswest.ensembl.org/info/docs/tools/vep/script/vep_options.html

You can see the unparsed CSQ (and probably have) with

file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vep <- ensemblVEP(file, param=VEPParam(dataformat=c(vcf=TRUE)))
info(vep)

Previously all character fields were coerced to factors by default when creating the DataFrame. I've checked in a fix to version 1.13.1 (devel) that prevents this so the fields are now plain characters. This doesn't address the issue of the non-characters but I don't think we should enforce data type unless it's provided.

Valerie

Thank you Valerie.

I appreciate the little fix in the devel branch.

I have to admit that this is not a critical issue, so for the time being I'll just leave it to the user of my code to select the appropriate fields when requesting a factor, a character or a numeric type. I was hoping to provide subsetted selections of fields using "is.<class>" methods, but common sense should be enough: a line must sometimes be drawn between "blame-the-programer" and "blame-the-user" scenarios ;)

Thanks for the answer!

Kevin

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by kevin.rue230
Answer: parseCSQToGRanges() : mcols are all factors, despite some numeric fields
0
3.5 years ago by
Martin Morgan ♦♦ 24k
United States
Martin Morgan ♦♦ 24k wrote:

I thought maybe a better approach would be to let R do the guessing about type, using whatever logic it has built in to read.delim(). I started by unlisting the CSQ field to a plain-old character vector

csq <- unlist(info(vep)$CSQ) The column headers can be added from the vcf object headers <- sub(".*Format: ", "", info(header(vep))["CSQ", "Description"]) csq <- c(headers, csq) The resulting character vector can be read in through a textConnetion() input <- read.delim(textConnection(csq), sep="|", stringsAsFactors=FALSE) (reading from a text connection used to be slow for large objects; a work-around would be writeLines(csq, fl <- tempfile()); read.delim(fl, ...)). This gave me > input[3:4, 1:13] Allele Consequence IMPACT SYMBOL Gene Feature_type 3 G upstream_gene_variant MODIFIER PSMF1 ENSG00000125818 Transcript 4 T upstream_gene_variant MODIFIER PSMF1 ENSG00000125818 Transcript Feature BIOTYPE EXON INTRON HGVSc HGVSp cDNA_position 3 ENST00000381899 protein_coding NA NA NA NA 4 ENST00000381899 protein_coding NA NA NA NA > sapply(input, class) Allele Consequence IMPACT SYMBOL "character" "character" "character" "character" Gene Feature_type Feature BIOTYPE "character" "character" "character" "character" EXON INTRON HGVSc HGVSp "logical" "character" "logical" "logical" cDNA_position CDS_position Protein_position Amino_acids "logical" "logical" "logical" "logical" Codons Existing_variation DISTANCE STRAND "logical" "logical" "integer" "integer" FLAGS SYMBOL_SOURCE HGNC_ID "character" "character" "character" (Columns with all NAs are of class logical(); some of these don't seem to make sense). I then coerced the plain-old-data.frame to a DataFrame, so that I could get back the original geometry (using relist()) and replace the CSQ field info(vep)$CSQ <- relist(DataFrame(input), info(vep)\$CSQ)

The main advantage is that the rules for guessing about column classes are consistent with R's.

(I think this has ended up more a question than tutorial, so I'll re-assign it).

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Martin Morgan ♦♦ 24k

Thank you Martin. (apologies for the delay)

As mentioned in my reply to Valerie, instead of converting every column to "what it should be", I decided to let the user provide any field key, and then I only check the values in the context of what is supposed to be done with them (plot, table, etc.).

Thanks for the answer!

Kevin