Search
Question: Restricting seqlevels to chr1-chr22, chrX, chrY in TxDb from UCSC returns more genes than without restriction
1
gravatar for kajetan.bentele
21 months ago by
kajetan.bentele10 wrote:

Hi,

I create a TxDb from UCSC for the genome hg19 and then retrieve all genes via 'genes()'. In case I restrict the seqlevels to chr1-chr22, chrX, chrY, more genes are returned for these chromosomes, which clearly should not be the case.  See code below for reproducibility. Thanks a lot for any help sorting this out.

> library(GenomicFeatures)

> TxDb.refseq.hg19 <- makeTxDbFromUCSC(genome="hg19", tablename="refGene")

> my.chr <- c(paste0('chr', seq(1,22)), 'chrX', 'chrY')

> genes.all_contigs <- genes(TxDb.refseq.hg19)
> table.genes.all_contigs <- table(seqnames(genes.all_contigs))

> seqlevels(TxDb.refseq.hg19) <- my.chr
> genes.chr <- genes(TxDb.refseq.hg19)
> table.genes.chr <- table(seqnames(genes.chr))

> table.genes.all_contigs[my.chr] - table.genes.chr[my.chr]

  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY
   -1     0    -4    -6     0  -229     0     0     0     0     0    -3     0     0     0     0   -23     0    -1    -1    -3     0     0     0

> sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicFeatures_1.22.13 AnnotationDbi_1.32.3    Biobase_2.30.0          GenomicRanges_1.22.4    GenomeInfoDb_1.6.3     
[6] IRanges_2.4.7           S4Vectors_0.8.11        BiocGenerics_0.16.1    

loaded via a namespace (and not attached):
 [1] XVector_0.10.0             zlibbioc_1.16.0            GenomicAlignments_1.6.3    BiocParallel_1.4.3         tools_3.2.3               
 [6] SummarizedExperiment_1.0.2 DBI_0.3.1                  lambda.r_1.1.7             futile.logger_1.4.1        rtracklayer_1.30.2        
[11] futile.options_1.0.0       bitops_1.0-6               RCurl_1.95-4.7             biomaRt_2.26.1             RSQLite_1.0.0             
[16] Biostrings_2.38.4          Rsamtools_1.22.0           XML_3.98-1.3             
ADD COMMENTlink modified 21 months ago by Martin Morgan ♦♦ 20k • written 21 months ago by kajetan.bentele10
3
gravatar for Mike Smith
21 months ago by
Mike Smith2.1k
EMBL Heidelberg / de.NBI
Mike Smith2.1k wrote:

Thanks for giving a nice reproducible example to work from.

It looks to me that this has something to do with the placement of ambiguous features i.e. things that could be assigned to either one of the main chromosomes or an unplaced contig.  When all the contigs are present genes() doesn't know where to assign a gene and it is dropped.  However when you restrict the seq levels there's only one choice and so the gene is assigned there.  This explains the huge jump on chromosome 6, where the HLA region has multiple haplotype contigs included in the reference, so there's lots of ambiguity.  

The example below demonstrates a single instance of this behaviour.  It turned out a bit longer than I expected, so if you just want the answer set the argument single.strand.genes.only = FALSE in your call to genes() and this behaviour will change.

Working from your example lets restrict our two gene sets to only chr21, where the difference is three genes:

chr21.a <- genes.all_contigs[seqnames(genes.all_contigs) == "chr21"]
chr21.b <- genes.chr[seqnames(genes.chr) == "chr21"]

We can find the EntrezID of the three genes by asking what's different in the lists. 

newgenes <- chr21.b[ !mcols(chr21.b)[,"gene_id"] %in% mcols(chr21.a)[,"gene_id"] ]

We'll now work with just one example gene.

example_id <- mcols(newgenes)[1,'gene_id']
> example_id
[1] "100423008"

If we restrict ourselves to chr21 we get the gene assigned here

seqlevels(TxDb.refseq.hg19) <- "chr21"
genes(TxDb.refseq.hg19.chr21, vals = list("gene_id" = example_id))
GRanges object with 1 range and 1 metadata column:
            seqnames               ranges strand |     gene_id
               <Rle>            <IRanges>  <Rle> | <character>
  100423008    chr21 [15017096, 15017171]      - |   100423008
  -------
  seqinfo: 1 sequence from hg19 genome

Likewise if we limit ourselves to chrUn_gl000213 it is placed here

seqlevels(TxDb.refseq.hg19) <- "chrUn_gl000213"
genes(TxDb.refseq.hg19.chrUn, vals = list("gene_id" = example_id))
GRanges object with 1 range and 1 metadata column:
                  seqnames               ranges strand |     gene_id
                     <Rle>            <IRanges>  <Rle> | <character>
  100423008 chrUn_gl000213 [15017096, 15017171]      - |   100423008
  -------
  seqinfo: 1 sequence from hg19 genome

If we set the levels to both the gene goes missing again

seqlevels(TxDb.refseq.hg19) <- c("chr21", "chrUn_gl000213")
genes(TxDb.refseq.hg19.chrUn, vals = list("gene_id" = example_id))
GRanges object with 0 ranges and 1 metadata column:
   seqnames    ranges strand |     gene_id
      <Rle> <IRanges>  <Rle> | <character>
  -------
  seqinfo: 2 sequences from hg19 genome

If you want to see all the possibily assignments, you need to set the argument single.strand.genes.only = FALSE in your call to genes(). Now we get both returned:

genes(TxDb.refseq.hg19.chrUn, vals = list("gene_id"=example_id), single.strand.genes.only = FALSE)
GRangesList object of length 1:
$100423008 
GRanges object with 2 ranges and 0 metadata columns:
            seqnames               ranges strand
               <Rle>            <IRanges>  <Rle>
  [1]          chr21 [15017096, 15017171]      -
  [2] chrUn_gl000213 [  104742,   104817]      +

-------
seqinfo: 2 sequences from hg19 genome
ADD COMMENTlink modified 21 months ago • written 21 months ago by Mike Smith2.1k

Hi Mike,

thanks a lot for this great and detailed answer!

Best, Kajetan


 

ADD REPLYlink written 21 months ago by kajetan.bentele10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 155 users visited in the last hour