Is it possible to annotate a GRanges Object that has scaffolding names and not chromosome numbers?
Entering edit mode
Last seen 11 hours ago
United States

I have a .bed file that I am importing into R which converts it to a GRanges object:

GRanges object with 15693 ranges and 2 metadata columns:
                seqnames          ranges strand |             name     score
                   <Rle>       <IRanges>  <Rle> |      <character> <numeric>
      [1] Scaffold307346           1-165      + | Scaffold307346-1         1
      [2] Scaffold107276          94-216      + | Scaffold107276-1         1
      [3] Scaffold235249 1439844-1440040      + | Scaffold235249-1         1
      [4] Scaffold103774           1-153      + | Scaffold103774-1         1
      [5] Scaffold210168          43-176      + | Scaffold210168-1         1
      ...            ...             ...    ... .              ...       ...
  [15689]  Scaffold86967 3265701-3265897      + | Scaffold86967-39         1
  [15690]  Scaffold99337   311003-311199      + |  Scaffold99337-8         1
  [15691]  Scaffold86352   854221-854417      + | Scaffold86352-16         1
  [15692]  Scaffold92984         216-412      + |  Scaffold92984-1         1
  [15693]  Scaffold96151       1582-1766      + |  Scaffold96151-1         1
  seqinfo: 1849 sequences from an unspecified genome; no seqlengths

This file was created and aligned to X. laevis. It is possible to annotate the peaks? I tried running the following code:

E2F4_Frog <- annotatePeak(E2F4_Frog, 
                          TxDb = xlaevis.db, annoDb = "",
                          tssRegion = c(-5000, 5000))

and got the following error:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'transcriptsBy' for signature '"ChipDb"'

sessionInfo( ) R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] bedr_1.0.7 clusterProfiler_4.4.4
[3] ChIPseeker_1.32.1 fastqcr_0.1.2
[5] rtracklayer_1.56.1
[7] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.20.2
[9] AnnotationFilter_1.20.0 xlaevis.db_3.2.3
[11] TxDb.Hsapiens.UCSC.hg38.knownGene_3.15.0 [13] GenomicFeatures_1.48.4 AnnotationDbi_1.58.0
[15] Rsamtools_2.12.0 Biostrings_2.64.1
[17] XVector_0.36.0 ChIPQC_1.32.2
[19] BiocParallel_1.30.4 DiffBind_3.6.5
[21] SummarizedExperiment_1.26.1 Biobase_2.56.0
[23] MatrixGenerics_1.8.1 matrixStats_0.63.0
[25] GenomicRanges_1.48.0 GenomeInfoDb_1.32.4
[27] IRanges_2.30.1 S4Vectors_0.34.0
[29] BiocGenerics_0.42.0 ggplot2_3.4.0
[31] BiocManager_1.30.19

loaded via a namespace (and not attached): [1] utf8_1.2.2 R.utils_2.12.2
[3] tidyselect_1.2.0 RSQLite_2.2.18
[5] htmlwidgets_1.6.1 grid_4.2.2
[7] scatterpie_0.1.8 munsell_0.5.0
[9] codetools_0.2-18 interp_1.1-3
[11] systemPipeR_2.2.2 withr_2.5.0
[13] colorspace_2.0-3 GOSemSim_2.22.0
[15] filelock_1.0.2 rstudioapi_0.14
[17] rJava_1.0-6 DOSE_3.22.1
[19] bbmle_1.0.25 GenomeInfoDbData_1.2.8
[21] mixsqp_0.3-48 hwriter_1.3.2.1
[23] polyclip_1.10-4 bit64_4.0.5
[25] farver_2.1.1 downloader_0.4
[27] treeio_1.20.2 coda_0.19-4
[29] vctrs_0.5.0 TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2
[31] generics_0.1.3 lambda.r_1.2.4
[33] BiocFileCache_2.4.0 R6_2.5.1
[35] apeglm_1.18.0 graphlayouts_0.8.4
[37] invgamma_1.1 RVenn_1.1.0
[39] locfit_1.5-9.7 gridGraphics_0.5-1
[41] bitops_1.0-7 cachem_1.0.6
[43] fgsea_1.22.0 DelayedArray_0.22.0
[45] assertthat_0.2.1 BiocIO_1.6.0
[47] scales_1.2.1 ggraph_2.1.0
[49] enrichplot_1.16.2 gtable_0.3.1
[51] tidygraph_1.2.2 xlsx_0.6.5
[53] rlang_1.0.6 splines_4.2.2
[55] lazyeval_0.2.2 yaml_2.3.6
[57] reshape2_1.4.4 TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2 [59] qvalue_2.28.0 tools_4.2.2
[61] ggplotify_0.1.0 ellipsis_0.3.2
[63] gplots_3.1.3 RColorBrewer_1.1-3
[65] Rcpp_1.0.9 plyr_1.8.8
[67] progress_1.2.2 zlibbioc_1.42.0
[69] purrr_1.0.1 RCurl_1.98-1.9
[71] prettyunits_1.1.1 deldir_1.0-6
[73] viridis_0.6.2 ashr_2.2-54
[75] chipseq_1.46.0 ggrepel_0.9.2
[77] magrittr_2.0.3 futile.options_1.0.1
[79] data.table_1.14.6 TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2
[81] DO.db_2.9 openxlsx_4.2.5.1
[83] truncnorm_1.0-8 mvtnorm_1.1-3
[85] SQUAREM_2021.1 amap_0.8-19
[87] ProtGenerics_1.28.0 TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2
[89] patchwork_1.1.2 hms_1.1.2
[91] xlsxjars_0.6.1 XML_3.99-0.13
[93] VennDiagram_1.7.3 emdbook_1.3.12
[95] jpeg_0.1-10 gridExtra_2.3
[97] testthat_3.1.6 compiler_4.2.2
[99] biomaRt_2.52.0 bdsmatrix_1.3-6
[101] tibble_3.1.8 shadowtext_0.1.2
[103] KernSmooth_2.23-20 crayon_1.5.2
[105] R.oo_1.25.0 htmltools_0.5.4
[107] ggfun_0.0.9 ggVennDiagram_1.2.2
[109] tidyr_1.2.1 aplot_0.1.9
[111] DBI_1.1.3 formatR_1.13
[113] tweenr_2.0.2 dbplyr_2.3.0
[115] MASS_7.3-58.1 rappdirs_0.3.3
[117] boot_1.3-28 ShortRead_1.54.0
[119] Matrix_1.5-3 brio_1.1.3
[121] cli_3.4.1 R.methodsS3_1.8.2
[123] parallel_4.2.2 igraph_1.3.5
[125] pkgconfig_2.0.3 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[127] GenomicAlignments_1.32.1 numDeriv_2016.8-1.1
[129] TxDb.Celegans.UCSC.ce6.ensGene_3.2.2 xml2_1.3.3
[131] ggtree_3.4.4 yulab.utils_0.0.6
[133] stringr_1.5.0 digest_0.6.31
[135] fastmatch_1.1-3 tidytree_0.4.2
[137] restfulr_0.0.15 GreyListChIP_1.28.1
[139] curl_5.0.0 gtools_3.9.4
[141] rjson_0.2.21 jsonlite_1.8.4
[143] nlme_3.1-160 lifecycle_1.0.3
[145] futile.logger_1.4.3 viridisLite_0.4.1
[147] limma_3.52.4 BSgenome_1.64.0
[149] fansi_1.0.3 pillar_1.8.1
[151] lattice_0.20-45 Nozzle.R1_1.1-1.1
[153] plotrix_3.8-2 KEGGREST_1.36.3
[155] fastmap_1.1.0 httr_1.4.4
[157] GO.db_3.15.0 glue_1.6.2
[159] zip_2.2.2 png_0.1-7
[161] bit_4.0.4 ggforce_0.4.1
[163] stringi_1.7.8 blob_1.2.3
[165] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 latticeExtra_0.6-30
[167] caTools_1.18.2 memoise_2.0.1
[169] dplyr_1.0.10 ape_5.6-2
[171] irlba_2.3.5.1


Annotation • 212 views
Entering edit mode
Last seen 8 hours ago
United States

The error you are getting is due to the fact that the 'TxDb' argument is pointing to a package intended for the annotation of the Affymetrix X. laevis array, rather than an actual TxDb object. It's easy enough to make the required TxDb object, except your bed file doesn't appear to have the expected seqnames (basically chromosome names). If you have a GFF file for X. laevis that has those same chromosome names, you could use makeTxDbFromGFF to make a TxDb object. I did that with the NCBI GFF:

> txdb <- makeTxDbFromGFF("")
Import genomic features from the file as a GRanges object ... trying URL ''
Content type 'application/x-gzip' length 28339040 bytes (27.0 MB)
downloaded 27.0 MB

Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  some transcripts have no "transcript_id" attribute ==> their name
  ("tx_name" column in the TxDb object) was set to NA
2: In .extract_transcripts_from_GRanges(tx_IDX, gr, mcols0$type, mcols0$ID,  :
  the transcript names ("tx_name" column in the TxDb object) imported
  from the "transcript_id" attribute are not unique
3: In .find_exon_cds(exons, cds) :
  The following transcripts have exons that contain more than one CDS
  (only the first CDS was kept for each exon): rna-NM_001086576.1,
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source:
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# Nb of transcripts: 85570
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2023-01-17 14:49:47 -0500 (Tue, 17 Jan 2023)
# GenomicFeatures version at creation time: 1.50.3
# RSQLite version at creation time: 2.2.19
> genes(txdb)
GRanges object with 44454 ranges and 1 metadata column:
              seqnames              ranges strand |     gene_id
                 <Rle>           <IRanges>  <Rle> | <character>
    1a11.L NC_054372.1 138156254-138160657      + |      1a11.L
  42sp43.L NC_054387.1     2482521-2487966      - |    42sp43.L
  42sp50.L NC_054381.1 153195418-153201763      + |    42sp50.L
      ATP6 NC_001573.1         10031-10711      + |        ATP6
      ATP8 NC_001573.1          9873-10040      + |        ATP8
       ...         ...                 ...    ... .         ...
  zyg11b.S NC_054378.1   76962468-76986657      + |    zyg11b.S
     zyx.S NC_054384.1       183650-192727      + |       zyx.S
   zzef1.S NC_054374.1   92745376-92838553      + |     zzef1.S
    zzz3.L NC_054377.1   75089720-75125144      + |      zzz3.L
    zzz3.S NC_054378.1   61210057-61260301      + |      zzz3.S
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

But as you can see, the seqnames are NCBI IDs. UCSC has these data as well, and you could use makeTxDbPackageFromUCSC, but they appear to have actual chromosome seqnames.

Entering edit mode

Thank you!

Entering edit mode

James - follow up question. I'm wondering if there is a way to avoid all of this. Ultimately I need to convert the genes where these peaks occur to human genes. Would it make more sense to somehow convert the genes in my .bed file to human genes (I'm not sure if this is possible with the bed file I currently have) and once it's converted to human, then annotate it? Would that be easier?

Entering edit mode

Ideallly you would use the same annotations throughout. I would tend not to use an annotation that is still calling things 'Scaffold' because that implies to me that it is a more preliminary version than the genome you can get from NCBI or UCSC.

In other words, you have a multi-step process. You first align all the reads to a genome, and then try to find regions where they pile up into peaks, usually with macs2, although you could use the Bioconductor csaw package as well. After you get the peaks, you then want to find out what genes those peaks are adjacent to, in order to say that a given peak infers that a particular gene is being affected.

To do this last step, you have to be able to match the peak location to the gene locations. If you have a GFF file based on the scaffolds you used originally, then you could use that. But if you don't have that, you would probably be better off to go back to square one and align the reads to either the UCSC or NCBI genome, because you will then have peak locations that can be located in reference to the gene locations for that genome.

Does that make sense?

Entering edit mode

It does...thank you. I'm grabbing this .bed file from a 2014 paper but I can get their .fastq file. Maybe it's better to get that and the USCS genome for x. laevis and run macs2 myself. If I do that will I have seqnames? Then I can easily make the TxDb using makeTxDbPackageFromUCSC(). Is that correct?

I really appreciate your help.

Entering edit mode

Oh, a 2014 paper? That's pretty old in this biz. If I were you, I'd get the FASTQ files and re-align using the genome from UCSC (you have to align to the genome before running macs2). Then you will have a bed file with genomic regions based on UCSC chromosomes, and the TxDb package you generate will then be useful for annotatePeak.

Entering edit mode

Thank you, thank you, thank you

Entering edit mode

I've gotten the xenLav2.fa file from USCS and indexed it. I aligned the .fastq file to it, then sorted and indexed the resulting .bam files. Finally I ran macs2 callpeaks and got my .narrowPeak file. Now I have to generate the TxDb file for xenopus laevis. I have:

TxDb.xlaevis.USCS <- makeTxDbPackageFromUCSC(version="0.01", 
                                             maintainer = "my email",
                                             author = "my email",

I'm not sure what to put for tablename. When I do:


I get the following:

            tablename         track           subtrack
1              refGene   NCBI RefSeq               <NA>
2          xenoRefGene  Other RefSeq               <NA>
3              genscan Genscan Genes               <NA>
4         augustusGene      AUGUSTUS               <NA>
5           ncbiRefSeq   NCBI RefSeq         RefSeq All
6    ncbiRefSeqCurated   NCBI RefSeq     RefSeq Curated
7  ncbiRefSeqPredicted   NCBI RefSeq   RefSeq Predicted
8      ncbiRefSeqOther   NCBI RefSeq       RefSeq Other
9     ncbiRefSeqSelect   NCBI RefSeq RefSeq Select+MANE
10      ncbiRefSeqHgmd   NCBI RefSeq        RefSeq HGMD
11 ncbiRefSeqPredicted   NCBI RefSeq   RefSeq Predicted

Any advice on which one to pick? I tried refGene and it runs (with some warnings) but how do I know if the results are correct? Do you think that's the correct one to use?

Once that runs I do the following:

E2F4_Frog_peak <- annotatePeak(E2F4_Frog_peak, TxDb = TxDb.xlaevis.USCS, annoDb = "", tssRegion = c(-5000, 5000))

but I'm not getting genes names...I'm getting the following error:

In getGeneAnno(annoDb,$geneId, type) :
  ID type not matched, gene annotation will not be added...

I really can't thank you enough for all the help. You've saved me hours and hours of trying to figure this out alone.

Entering edit mode

I believe you want the ncbiRefSeq track name.

Login before adding your answer.

Traffic: 274 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6