Question

Restricting seqlevels to chr1-chr22, chrX, chrY in TxDb from UCSC returns more genes than without restriction

1

Entering edit mode

kajetan.bentele ▴ 10

@kajetanbentele-9724

Last seen 8.2 years ago

Hi,

I create a TxDb from UCSC for the genome hg19 and then retrieve all genes via 'genes()'. In case I restrict the seqlevels to chr1-chr22, chrX, chrY, more genes are returned for these chromosomes, which clearly should not be the case. See code below for reproducibility. Thanks a lot for any help sorting this out.

> library(GenomicFeatures)

> TxDb.refseq.hg19 <- makeTxDbFromUCSC(genome="hg19", tablename="refGene")

> my.chr <- c(paste0('chr', seq(1,22)), 'chrX', 'chrY')

> genes.all_contigs <- genes(TxDb.refseq.hg19)
> table.genes.all_contigs <- table(seqnames(genes.all_contigs))

> seqlevels(TxDb.refseq.hg19) <- my.chr
> genes.chr <- genes(TxDb.refseq.hg19)
> table.genes.chr <- table(seqnames(genes.chr))

> table.genes.all_contigs[my.chr] - table.genes.chr[my.chr]

  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY
   -1     0    -4    -6     0  -229     0     0     0     0     0    -3     0     0     0     0   -23     0    -1    -1    -3     0     0     0

> sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.22.13 AnnotationDbi_1.32.3    Biobase_2.30.0          GenomicRanges_1.22.4    GenomeInfoDb_1.6.3     
[6] IRanges_2.4.7           S4Vectors_0.8.11        BiocGenerics_0.16.1    

loaded via a namespace (and not attached):
 [1] XVector_0.10.0             zlibbioc_1.16.0            GenomicAlignments_1.6.3    BiocParallel_1.4.3         tools_3.2.3               
 [6] SummarizedExperiment_1.0.2 DBI_0.3.1                  lambda.r_1.1.7             futile.logger_1.4.1        rtracklayer_1.30.2        
[11] futile.options_1.0.0       bitops_1.0-6               RCurl_1.95-4.7             biomaRt_2.26.1             RSQLite_1.0.0             
[16] Biostrings_2.38.4          Rsamtools_1.22.0           XML_3.98-1.3

GenomicFeatures • 1.7k views

ADD COMMENT • link updated 8.2 years ago by Martin Morgan 25k • written 8.2 years ago by kajetan.bentele ▴ 10

score 3 · Accepted Answer · 2016-02-19

Thanks for giving a nice reproducible example to work from.

It looks to me that this has something to do with the placement of ambiguous features i.e. things that could be assigned to either one of the main chromosomes or an unplaced contig. When all the contigs are present genes() doesn't know where to assign a gene and it is dropped. However when you restrict the seq levels there's only one choice and so the gene is assigned there. This explains the huge jump on chromosome 6, where the HLA region has multiple haplotype contigs included in the reference, so there's lots of ambiguity.

The example below demonstrates a single instance of this behaviour. It turned out a bit longer than I expected, so if you just want the answer set the argument single.strand.genes.only = FALSE in your call to genes() and this behaviour will change.

Working from your example lets restrict our two gene sets to only chr21, where the difference is three genes:

chr21.a <- genes.all_contigs[seqnames(genes.all_contigs) == "chr21"]
chr21.b <- genes.chr[seqnames(genes.chr) == "chr21"]

We can find the EntrezID of the three genes by asking what's different in the lists.

newgenes <- chr21.b[ !mcols(chr21.b)[,"gene_id"] %in% mcols(chr21.a)[,"gene_id"] ]

We'll now work with just one example gene.

example_id <- mcols(newgenes)[1,'gene_id']

> example_id
[1] "100423008"

If we restrict ourselves to chr21 we get the gene assigned here

seqlevels(TxDb.refseq.hg19) <- "chr21"
genes(TxDb.refseq.hg19.chr21, vals = list("gene_id" = example_id))

GRanges object with 1 range and 1 metadata column:
            seqnames               ranges strand |     gene_id
               <Rle>            <IRanges>  <Rle> | <character>
  100423008    chr21 [15017096, 15017171]      - |   100423008
  -------
  seqinfo: 1 sequence from hg19 genome

Likewise if we limit ourselves to chrUn_gl000213 it is placed here

seqlevels(TxDb.refseq.hg19) <- "chrUn_gl000213"
genes(TxDb.refseq.hg19.chrUn, vals = list("gene_id" = example_id))

GRanges object with 1 range and 1 metadata column:
                  seqnames               ranges strand |     gene_id
                     <Rle>            <IRanges>  <Rle> | <character>
  100423008 chrUn_gl000213 [15017096, 15017171]      - |   100423008
  -------
  seqinfo: 1 sequence from hg19 genome

If we set the levels to both the gene goes missing again

seqlevels(TxDb.refseq.hg19) <- c("chr21", "chrUn_gl000213")
genes(TxDb.refseq.hg19.chrUn, vals = list("gene_id" = example_id))

GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     gene_id
      <Rle> <IRanges>  <Rle> | <character>
  -------
  seqinfo: 2 sequences from hg19 genome

If you want to see all the possibily assignments, you need to set the argument single.strand.genes.only = FALSE in your call to genes(). Now we get both returned:

genes(TxDb.refseq.hg19.chrUn, vals = list("gene_id"=example_id), single.strand.genes.only = FALSE)

GRangesList object of length 1:
$100423008 
GRanges object with 2 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
  [1]          chr21 [15017096, 15017171]      -
  [2] chrUn_gl000213 [  104742,   104817]      +

-------
seqinfo: 2 sequences from hg19 genome