Hello, I want to find gene on chromosome 6 using TxDb.Hsapiens.UCSC.hg19.knownGene package, and I use two different methods, but the genes found were not identical.
Following was the code:
load the library
library(TxDb.Hsapiens.UCSC.hg19.knownGene) TxDb.Hsapiens.UCSC.hg19.knownGene->txdb
and find gens at chromosome 6
transcriptsBy(txdb,by='gene')->ts ts[seqnames(ts)=='chr6']->onchr6 onchr6[elementNROWS(onchr6)>0]->onchr6 length(onchr6) ####1207 genes on chromosome 6 sum(duplicated(names(onchr6)))
And then I use seqlevels command to limit the txdb to the chromosome 6.
seqlevels(txdb)<-'chr6' transcriptsBy(txdb,by='gene')->ts2 runValue(unlist(seqnames(ts2))) ###on all chromosome 6
and I find genes found by the above two methods were not fully identical.
setdiff(names(onchr6),names(ts2))->missed length(missed) #####207 genes were missed ###for one example onchr6[["81696"]] ts2[["81696"]]
Do anyone point out any wrong or missing in my code? or this is a bug of the package?
> sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 locale: [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 LC_CTYPE=Chinese (Simplified)_People's Republic of China.936 [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936 LC_NUMERIC=C [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.24.5 AnnotationDbi_1.34.4 Biobase_2.32.0 [5] GenomicRanges_1.24.2 GenomeInfoDb_1.8.3 IRanges_2.6.1 S4Vectors_0.10.2 [9] BiocGenerics_0.18.0 loaded via a namespace (and not attached): [1] XML_3.98-1.4 Rsamtools_1.24.0 Biostrings_2.40.2 GenomicAlignments_1.8.4 bitops_1.0-6 DBI_0.4-1 [7] RSQLite_1.0.0 zlibbioc_1.18.0 XVector_0.12.0 BiocParallel_1.6.3 tools_3.3.1 biomaRt_2.28.0 [13] RCurl_1.95-4.8 rtracklayer_1.32.2 SummarizedExperiment_1.2.3
And indeed I found the 1000 gene found by both methods were located at chromosome 6
while the 207 genes missed by the 2nd methods were located at
chr6_apd_hap1 ,chr6_cox_hap2 ,chr6_dbb_hap3 ,
chr6_mann_hap4 ,chr6_mcf_hap5 ,chr6_qbl_hap6 ,chr6_ssto_hap7
besides chr6 (all 207 genes on chrormosome 6 and subsets of the 207 genes located at the chr6_XX_hapX )
And further problems might be occur:
we want to find some overlaps with particular genomic fragement
load the library
the fragment
and we want found genes overlapping with the above fragment:
and using seqlevel to limit to chromsomome 6 and then find overlaps:
Although we know that genes on chromosome 6 filtered by this methods were less than the real number of genes (1000 vs 1207 genes pointed as above explanations ). We still use these 1000 genes to get the overlaps, and further mistakes occured!:
only 2 genes (compared with the 164 genes detected above) were found to overlap the fragment. And this might be an obvious impossibility.
Any suggestions?