Hello,
I am trying to annotate TP53 ChIP peaks with their nearest TSS and divide them into proximal and distal. For proximal, they should be within 2kb upstream to 500bp downstream, for distal they should be more than 2kb upstream to 50kb upstream.
I have tried two methods, but there are discrepancies between the results.
First method is, (1) annotate the peaks with annotatePeakInBatch function of ChIPpeakAnno, then filtering out the result based on my defined criteria of proximal and distal. Using this method I have got 9293 proximal peaks, and 7622 distal peaks. FYI, I had a total of 31000 peaks in my MACS outputs.
The code I have used is this,
gr1 <- toGRanges("HEPG2_P53_Binding_macs_output_peaks.narrowPeak", format = "narrowPeak", header = FALSE)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
anno <- toGRanges(txdb, format='gene')
annotation<- toGRanges(EnsDb.Hsapiens.v75, feature = "gene")
peak_annotation <- annotatePeakInBatch(gr1, anno, featureType = "TSS")
annotated_chippeaks <- annotatePeakInBatch(gr1, annotation, featureType = "TSS")
Using this code, I got output like this, feature seqnames start end width strand score signalValue pValue qValue peak start_position end_position feature_strand insideFeature distancetoFeature shortestDistance fromOverlappingOrNearest gene_name ENSG00000002822 chr7 2170055 2170212 158 106 8.74181 14.95189 10.63416 HepG2_TP53_Control_vs_Input_ChIPseq_peak_1300 1815793 2233243 - inside 63188 63031 NearestLocation MAD1L1 ENSG00000004487 chr1 23019302 23019460 159 76 3.90186 11.81663 7.60173 HepG2_TP53_Control_vs_Input_ChIPseq_peak_34 23019448 23083689 + overlapStart -146 12 NearestLocation KDM1A
Further, I filtered out the peaks based on distancetoFeature using Excel and got the corresponding distal and proximal peaks.
The second method is, I have annoPeaks function of ChIPpeakAnno and applied filter. Using this method, although the proximal peak number are almost consistent (+-10-15 peaks), but the distal peaks are very less. This is the code I have tried,
hepg2_treated <- toGRanges("HepG2-5FU_TP53_vs_Input_ChIPseq_peaks.narrowPeak", format = "narrowPeak")
#for proximal
anno_treated_prox <- annoPeaks(hepg2_ctreated, annoData = annoData,
output = "shortestDistance",
bindingRegion = c(-2000, 500))
hepg2_treated_anno <- annoPeaks(hepg2_ctreated, annoData = annoData,
output = "shortestDistance")
#for distal
distal_peaks <- subset(hepg2_treated_anno,
hepg2_treated_anno$distanceToSite > 2000 &
hepg2_treated_anno$distanceToSite <= 50000 &
hepg2_treated_anno$insideFeature %in% c("upstream"))
Using this code, I am getting around 9444 proximal peaks, and 2678 distal peaks.
Could you please let me know, which one seems more accurate?
sessionInfo( )
```
sessionInfo() R version 4.0.2 (2020-06-22) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=Finnish_Finland.1252 LC_CTYPE=Finnish_Finland.1252
[3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C
[5] LC_TIME=Finnish_Finland.1252
attached base packages:
[1] grid stats4 parallel stats graphics grDevices utils datasets methods
[10] base
other attached packages:
[1] org.Hs.eg.db_3.11.4 openxlsx_4.2.3
[3] genomation_1.20.0 rtracklayer_1.48.0
[5] TxDb.Hsapiens.UCSC.hg38.knownGene_3.10.0 EnsDb.Hsapiens.v86_2.99.0
[7] ensembldb_2.12.1 AnnotationFilter_1.12.0
[9] GenomicFeatures_1.40.1 AnnotationDbi_1.50.3
[11] Biobase_2.48.0 ChIPpeakAnno_3.22.4
[13] GenomicRanges_1.40.0 GenomeInfoDb_1.24.2
[15] Biostrings_2.56.0 XVector_0.28.0
[17] IRanges_2.22.2 S4Vectors_0.26.1
[19] BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] ProtGenerics_1.20.0 bitops_1.0-6 matrixStats_0.58.0
[4] bit64_4.0.5 progress_1.2.2 httr_1.4.2
[7] tools_4.0.2 utf8_1.1.4 R6_2.5.0
[10] KernSmooth_2.23-17 colorspace_2.0-0 DBI_1.1.1
[13] lazyeval_0.2.2 seqPattern_1.20.0 ade4_1.7-19
[16] tidyselect_1.1.0 prettyunits_1.1.1 bit_4.0.4
[19] curl_4.3 compiler_4.0.2 VennDiagram_1.6.20
[22] graph_1.66.0 formatR_1.8 xml2_1.3.2
[25] DelayedArray_0.14.1 scales_1.1.1 readr_1.4.0
[28] RBGL_1.64.0 askpass_1.1 rappdirs_0.3.3
[31] stringr_1.4.0 Rsamtools_2.4.0 pkgconfig_2.0.3
[34] plotrix_3.8-4 dbplyr_2.1.1 fastmap_1.1.0
[37] limma_3.44.3 BSgenome_1.56.0 regioneR_1.20.1
[40] rlang_0.4.10 impute_1.62.0 rstudioapi_0.13
[43] RSQLite_2.2.4 generics_0.1.0 BiocParallel_1.22.0
[46] zip_2.1.1 dplyr_1.0.5 RCurl_1.98-1.2
[49] magrittr_2.0.1 GO.db_3.11.4 GenomeInfoDbData_1.2.3
[52] futile.logger_1.4.3 Matrix_1.4-1 munsell_0.5.0
[55] Rcpp_1.0.6 fansi_0.4.2 lifecycle_1.0.0
[58] stringi_1.5.3 MASS_7.3-51.6 SummarizedExperiment_1.18.2
[61] zlibbioc_1.34.0 plyr_1.8.6 BiocFileCache_1.12.1
[64] blob_1.2.1 crayon_1.4.1 lattice_0.20-41
[67] splines_4.0.2 multtest_2.44.0 hms_1.0.0
[70] pillar_1.5.1 seqinr_4.2-8 reshape2_1.4.4
[73] biomaRt_2.44.4 futile.options_1.0.1 XML_3.99-0.5
[76] glue_1.4.2 data.table_1.14.0 lambda.r_1.2.4
[79] BiocManager_1.30.10 idr_1.2 vctrs_0.3.6
[82] gtable_0.3.0 openssl_1.4.3 purrr_0.3.4
[85] assertthat_0.2.1 ggplot2_3.3.3 cachem_1.0.4
[88] gridBase_0.4-7 survival_3.1-12 tibble_3.1.0
[91] GenomicAlignments_1.24.0 memoise_2.0.0 ellipsis_0.3.1