Question

summarizeVariants from VariantAnnotation reordering genes in rownames

0

Entering edit mode

Todd Creasy • 0

@todd-creasy-9699

Last seen 5.2 years ago

When I use summarizeVariants with a GRangesList of genes, the genes are then reordered in the RangedSummarizedExperiment object and the assay(counts) values are incorrect. Is this a bug or can I control this reordering?

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
    
    genes <- genes(txdb)

    # get the gene symbol for corresponding ENTREZID
    gene.map <- select(org.Hs.eg.db, keys=elementMetadata(genes)$gene_id, keytype="ENTREZID", column="SYMBOL")
    genes.df <- merge(genes, gene.map, by.x="gene_id", by.y="ENTREZID")
    elementMetadata(genes)$SYMBOL <- genes.df$SYMBOL
    
    # get the gene symbols GRanges and push into a GRangesList
    grl <- split(genes, elementMetadata(genes)$SYMBOL)

    # subset by a few genes
    grl <- grl[c('STK11','TP53','GCLM','SLC6A3')]

    

   print(grl)

    

GRangesList object of length 4:
$STK11
GRanges object with 1 range and 2 metadata columns:
       seqnames             ranges strand |     gene_id      SYMBOL
          <Rle>          <IRanges>  <Rle> | <character> <character>
  6794    chr19 [1205798, 1228434]      + |        6794       STK11

$TP53
GRanges object with 1 range and 2 metadata columns:
       seqnames             ranges strand | gene_id SYMBOL
  7157    chr17 [7565097, 7590868]      - |    7157   TP53

$GCLM
GRanges object with 1 range and 2 metadata columns:
       seqnames               ranges strand | gene_id SYMBOL
  2730     chr1 [94352590, 94375012]      - |    2730   GCLM

...
<1 more element>
-------


    # read in the vcf file
    print("Read filtered vcf...")
    vcf <- readVcf(vcf.file, "hg19", row.names=TRUE)
    
    # change sample headers
    header(vcf)@samples <- sub("(.*)_S.*", "\\1", header(vcf)@samples)
    rownames(colData(vcf)) <- sub("(.*)_S.*", "\\1", rownames(colData(vcf)))
    
     

    # summarize variant counts for coding regions (this will take a long time)
    print("Summarize variant counts...")
    x <- summarizeVariants(grl, vcf, CodingVariants())
    print(x)

    seqinfo: 93 sequences (1 circular) from hg19 genome
[1] "Summarize variant counts..."
class: RangedSummarizedExperiment
dim: 4 118
metadata(1): header
assays(1): counts
rownames(4): GCLM TP53 STK11 SLC6A3
rowData names(0):
colnames(118): D02914TB04 D06699TB04_2nd ... D31681TB02 D33818TB02
colData names(1): Samples    

    print(assays(x)$counts)

variantannotation summarizevariants • 1.2k views

ADD COMMENT • link updated 8.6 years ago by Valerie Obenchain ★ 6.8k • written 8.6 years ago by Todd Creasy • 0

score 0 · Answer 1 · 2017-06-27

Hi,

I think results in rowRanges() and assays()$counts are consistent. The show method for RangedSummarizedExperiment displays the column names for assays()$counts instead of the row names and maybe that's the point of confusion. The contract for the class is that the rows of rowRanges() match the rows of assays().

Here is an example with a vcf from VariantAnnotation:

Create a GRangesList with chr22 ranges in desending order:

grl <- GRangesList(
  STK11 = GRanges("chr22", IRanges(50300166, width=1)),
  TP53 = GRanges("chr22", IRanges(50300101, width=1)),
  CGLM = GRanges("chr22", IRanges(50300078, width=1)))

Read variants from VCF:

fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")

Rename seqlevels to match 'grl' and confirm the match:

seqlevels(vcf) <- paste0("chr", seqlevels(vcf)) 

> intersect(seqlevels(vcf), seqlevels(grl))
[1] "chr22"

Underneath, summarizeVariants() performs an overlap operation between the vcf file and annotation. The canonical order comes from the vcf file and is usually ascending order by chromosome. Results in rowRanges() and assays()$counts are ordered the same as the ranges in the vcf file.

res <- summarizeVariants(grl, vcf, CodingVariants())

> res
class: RangedSummarizedExperiment
dim: 3 5
metadata(1): header
assays(1): counts
rownames(3): CGLM TP53 STK11
rowData names(0):
colnames(5): HG00096 HG00097 HG00099 HG00100 HG00101
colData names(1): Samples
> names(rowRanges(res))
[1] "CGLM"  "TP53"  "STK11"
> assays(res)$counts

        HG00096 HG00097 HG00099 HG00100 HG00101
  CGLM        0       0       0       0       0
  TP53        0       0       0       0       0
  STK11       0       0       1       0       0

If this doesn't answer your question please show a small piece of output from rowRanges() and assays()$counts so I can see how they are inconsistent.

Valerie