I have a .bed file that I am importing into R which converts it to a GRanges object:
GRanges object with 15693 ranges and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] Scaffold307346 1-165 + | Scaffold307346-1 1
[2] Scaffold107276 94-216 + | Scaffold107276-1 1
[3] Scaffold235249 1439844-1440040 + | Scaffold235249-1 1
[4] Scaffold103774 1-153 + | Scaffold103774-1 1
[5] Scaffold210168 43-176 + | Scaffold210168-1 1
... ... ... ... . ... ...
[15689] Scaffold86967 3265701-3265897 + | Scaffold86967-39 1
[15690] Scaffold99337 311003-311199 + | Scaffold99337-8 1
[15691] Scaffold86352 854221-854417 + | Scaffold86352-16 1
[15692] Scaffold92984 216-412 + | Scaffold92984-1 1
[15693] Scaffold96151 1582-1766 + | Scaffold96151-1 1
-------
seqinfo: 1849 sequences from an unspecified genome; no seqlengths
This file was created and aligned to X. laevis. It is possible to annotate the peaks? I tried running the following code:
E2F4_Frog <- annotatePeak(E2F4_Frog,
TxDb = xlaevis.db, annoDb = "org.Xl.eg.db",
tssRegion = c(-5000, 5000))
and got the following error:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'transcriptsBy' for signature '"ChipDb"'
sessionInfo( ) R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] bedr_1.0.7 clusterProfiler_4.4.4
[3] ChIPseeker_1.32.1 fastqcr_0.1.2
[5] rtracklayer_1.56.1 org.Hs.eg.db_3.15.0
[7] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.20.2
[9] AnnotationFilter_1.20.0 xlaevis.db_3.2.3
[11] org.Xl.eg.db_3.15.0 TxDb.Hsapiens.UCSC.hg38.knownGene_3.15.0
[13] GenomicFeatures_1.48.4 AnnotationDbi_1.58.0
[15] Rsamtools_2.12.0 Biostrings_2.64.1
[17] XVector_0.36.0 ChIPQC_1.32.2
[19] BiocParallel_1.30.4 DiffBind_3.6.5
[21] SummarizedExperiment_1.26.1 Biobase_2.56.0
[23] MatrixGenerics_1.8.1 matrixStats_0.63.0
[25] GenomicRanges_1.48.0 GenomeInfoDb_1.32.4
[27] IRanges_2.30.1 S4Vectors_0.34.0
[29] BiocGenerics_0.42.0 ggplot2_3.4.0
[31] BiocManager_1.30.19
loaded via a namespace (and not attached):
[1] utf8_1.2.2 R.utils_2.12.2
[3] tidyselect_1.2.0 RSQLite_2.2.18
[5] htmlwidgets_1.6.1 grid_4.2.2
[7] scatterpie_0.1.8 munsell_0.5.0
[9] codetools_0.2-18 interp_1.1-3
[11] systemPipeR_2.2.2 withr_2.5.0
[13] colorspace_2.0-3 GOSemSim_2.22.0
[15] filelock_1.0.2 rstudioapi_0.14
[17] rJava_1.0-6 DOSE_3.22.1
[19] bbmle_1.0.25 GenomeInfoDbData_1.2.8
[21] mixsqp_0.3-48 hwriter_1.3.2.1
[23] polyclip_1.10-4 bit64_4.0.5
[25] farver_2.1.1 downloader_0.4
[27] treeio_1.20.2 coda_0.19-4
[29] vctrs_0.5.0 TxDb.Rnorvegicus.UCSC.rn4.ensGene_3.2.2
[31] generics_0.1.3 lambda.r_1.2.4
[33] BiocFileCache_2.4.0 R6_2.5.1
[35] apeglm_1.18.0 graphlayouts_0.8.4
[37] invgamma_1.1 RVenn_1.1.0
[39] locfit_1.5-9.7 gridGraphics_0.5-1
[41] bitops_1.0-7 cachem_1.0.6
[43] fgsea_1.22.0 DelayedArray_0.22.0
[45] assertthat_0.2.1 BiocIO_1.6.0
[47] scales_1.2.1 ggraph_2.1.0
[49] enrichplot_1.16.2 gtable_0.3.1
[51] tidygraph_1.2.2 xlsx_0.6.5
[53] rlang_1.0.6 splines_4.2.2
[55] lazyeval_0.2.2 yaml_2.3.6
[57] reshape2_1.4.4 TxDb.Dmelanogaster.UCSC.dm3.ensGene_3.2.2
[59] qvalue_2.28.0 tools_4.2.2
[61] ggplotify_0.1.0 ellipsis_0.3.2
[63] gplots_3.1.3 RColorBrewer_1.1-3
[65] Rcpp_1.0.9 plyr_1.8.8
[67] progress_1.2.2 zlibbioc_1.42.0
[69] purrr_1.0.1 RCurl_1.98-1.9
[71] prettyunits_1.1.1 deldir_1.0-6
[73] viridis_0.6.2 ashr_2.2-54
[75] chipseq_1.46.0 ggrepel_0.9.2
[77] magrittr_2.0.3 futile.options_1.0.1
[79] data.table_1.14.6 TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2
[81] DO.db_2.9 openxlsx_4.2.5.1
[83] truncnorm_1.0-8 mvtnorm_1.1-3
[85] SQUAREM_2021.1 amap_0.8-19
[87] ProtGenerics_1.28.0 TxDb.Mmusculus.UCSC.mm9.knownGene_3.2.2
[89] patchwork_1.1.2 hms_1.1.2
[91] xlsxjars_0.6.1 XML_3.99-0.13
[93] VennDiagram_1.7.3 emdbook_1.3.12
[95] jpeg_0.1-10 gridExtra_2.3
[97] testthat_3.1.6 compiler_4.2.2
[99] biomaRt_2.52.0 bdsmatrix_1.3-6
[101] tibble_3.1.8 shadowtext_0.1.2
[103] KernSmooth_2.23-20 crayon_1.5.2
[105] R.oo_1.25.0 htmltools_0.5.4
[107] ggfun_0.0.9 ggVennDiagram_1.2.2
[109] tidyr_1.2.1 aplot_0.1.9
[111] DBI_1.1.3 formatR_1.13
[113] tweenr_2.0.2 dbplyr_2.3.0
[115] MASS_7.3-58.1 rappdirs_0.3.3
[117] boot_1.3-28 ShortRead_1.54.0
[119] Matrix_1.5-3 brio_1.1.3
[121] cli_3.4.1 R.methodsS3_1.8.2
[123] parallel_4.2.2 igraph_1.3.5
[125] pkgconfig_2.0.3 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[127] GenomicAlignments_1.32.1 numDeriv_2016.8-1.1
[129] TxDb.Celegans.UCSC.ce6.ensGene_3.2.2 xml2_1.3.3
[131] ggtree_3.4.4 yulab.utils_0.0.6
[133] stringr_1.5.0 digest_0.6.31
[135] fastmatch_1.1-3 tidytree_0.4.2
[137] restfulr_0.0.15 GreyListChIP_1.28.1
[139] curl_5.0.0 gtools_3.9.4
[141] rjson_0.2.21 jsonlite_1.8.4
[143] nlme_3.1-160 lifecycle_1.0.3
[145] futile.logger_1.4.3 viridisLite_0.4.1
[147] limma_3.52.4 BSgenome_1.64.0
[149] fansi_1.0.3 pillar_1.8.1
[151] lattice_0.20-45 Nozzle.R1_1.1-1.1
[153] plotrix_3.8-2 KEGGREST_1.36.3
[155] fastmap_1.1.0 httr_1.4.4
[157] GO.db_3.15.0 glue_1.6.2
[159] zip_2.2.2 png_0.1-7
[161] bit_4.0.4 ggforce_0.4.1
[163] stringi_1.7.8 blob_1.2.3
[165] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 latticeExtra_0.6-30
[167] caTools_1.18.2 memoise_2.0.1
[169] dplyr_1.0.10 ape_5.6-2
[171] irlba_2.3.5.1
```
Thank you!
James - follow up question. I'm wondering if there is a way to avoid all of this. Ultimately I need to convert the genes where these peaks occur to human genes. Would it make more sense to somehow convert the genes in my .bed file to human genes (I'm not sure if this is possible with the bed file I currently have) and once it's converted to human, then annotate it? Would that be easier?
Ideallly you would use the same annotations throughout. I would tend not to use an annotation that is still calling things 'Scaffold' because that implies to me that it is a more preliminary version than the genome you can get from NCBI or UCSC.
In other words, you have a multi-step process. You first align all the reads to a genome, and then try to find regions where they pile up into peaks, usually with macs2, although you could use the Bioconductor csaw package as well. After you get the peaks, you then want to find out what genes those peaks are adjacent to, in order to say that a given peak infers that a particular gene is being affected.
To do this last step, you have to be able to match the peak location to the gene locations. If you have a GFF file based on the scaffolds you used originally, then you could use that. But if you don't have that, you would probably be better off to go back to square one and align the reads to either the UCSC or NCBI genome, because you will then have peak locations that can be located in reference to the gene locations for that genome.
Does that make sense?
It does...thank you. I'm grabbing this .bed file from a 2014 paper but I can get their .fastq file. Maybe it's better to get that and the USCS genome for x. laevis and run macs2 myself. If I do that will I have seqnames? Then I can easily make the TxDb using makeTxDbPackageFromUCSC(). Is that correct?
I really appreciate your help.
Oh, a 2014 paper? That's pretty old in this biz. If I were you, I'd get the FASTQ files and re-align using the genome from UCSC (you have to align to the genome before running macs2). Then you will have a bed file with genomic regions based on UCSC chromosomes, and the
TxDb
package you generate will then be useful forannotatePeak
.Thank you, thank you, thank you
I've gotten the xenLav2.fa file from USCS and indexed it. I aligned the .fastq file to it, then sorted and indexed the resulting .bam files. Finally I ran macs2 callpeaks and got my .narrowPeak file. Now I have to generate the TxDb file for xenopus laevis. I have:
I'm not sure what to put for tablename. When I do:
I get the following:
Any advice on which one to pick? I tried refGene and it runs (with some warnings) but how do I know if the results are correct? Do you think that's the correct one to use?
Once that runs I do the following:
but I'm not getting genes names...I'm getting the following error:
I really can't thank you enough for all the help. You've saved me hours and hours of trying to figure this out alone.
I believe you want the ncbiRefSeq track name.