Hello Matthew,
Thanks for the detailed description-this sounds like a classic challenge in sncRNA analysis, where piRNAs (and other tsRNAs/rsRNAs) often evade standard Ensembl/miRBase annotations due to their derivation from repetitive or non-canonical sources. SPORTS1.1 is a solid choice for initial quantification (it's great for rRNA/tRNA-derived fragments), but its reliance on RNACentral can indeed leave gaps, especially for piRNA clusters. Since you're already pulling some Ensembl IDs via biomaRt and using miRanda for targets, you're on the right track; the "missing" piece is likely just better integration of piRNA-specific resources and downstream functional mapping. I'll outline some targeted guidance below, focusing on R/Bioconductor where possible to keep it streamlined.
1. Enhancing Annotation of Your Sequences
piRNAs aren't always "gene-like" (many arise from transposon clusters), so biomaRt hits will be sparse. Layer in piRNA-specific databases to enrich your RNACentral IDs:
Use piRNABank or piRNA-Quest for lookup: These are piRNA-focused repos. You can fetch mature piRNA sequences programmatically.
- In R, use
AnnotationHub (Bioconductor) to pull related data, or script a simple fetch from piRNABank's FASTA (e.g., via Biostrings).
Recommended Bioconductor workflow with Seqpac: This package is tailored for sRNA-seq (including piRNAs) and handles poor annotations by integrating multiple sources. It can re-annotate your SPORTS output counts against RNACentral + piRNA banks.
# Install if needed
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("seqpac")
# Load your SPORTS count matrix (assume 'counts' is a matrix with rows=sequences, cols=samples)
library(seqpac)
# Create a count object (adapt to your FASTQ/BAM-derived seqs)
cnt_obj <- makeCountObject(counts = counts,
design = your_design_df, # e.g., cols: sample, condition
mrvi = FALSE) # Skip miRNA-specific if focusing on piRNAs
# Annotate with multi-source pipeline (includes RNACentral, piRNABank)
cnt_obj <- annotate(cnt_obj,
genome = "hg38", # Or your ref
pattern = "piRNA", # Filter for piRNA-like
db_source = c("rnacentral", "pirnabank"))
# Extract enriched annotations
annotated_seqs <- getAnnotations(cnt_obj)
head(annotated_seqs) # Now with piRNA cluster info, origins
This should flag more piRNA-like sequences (e.g., 24-31nt, ping-pong signatures if your BAMs show multimappers). For multimappers in your BAMs, Seqpac can re-align and assign fractional counts to clusters-similar to how you'd resolve ChIP peaks.
Alternative: WIND integration: Since you mentioned prior WIND use, its output (piRNA clusters from genomic loci) could be merged here. Align your SPORTS seqs back to WIND clusters using GenomicAlignments on your BAMs.
If many seqs still lack hits, they're likely novel/de novo piRNAs-treat them as such and focus on genomic context (see below).
2. Target Prediction Beyond miRanda
miRanda is fine for seed-based prediction (it'll work on your 20-150nt seqs), but piRNAs often act via imperfect base-pairing or cluster-specific mechanisms, not just 6-8mer seeds. To avoid over-prediction on short/high-quality seqs:
piRNA-specific predictors: Use pirScan (web tool, but scriptable) or piRTA (command-line) for targets. These score piRNA-mRNA pairs better than miRanda for transposon silencing.
- In R: Wrap via
system() calls, or use multiMiR (Bioconductor) which includes some piRNA targets from databases like piRTarBase.
BiocManager::install("multiMiR")
library(multiMiR)
# Assume 'pirna_seqs' is your vector of sequences
targets <- getMirnaTargets(mirna = pirna_seqs, # Or RNACentral IDs
species = "hsa",
method = c("mirdip", "diana", "targetscan"), # Extend with piRNA dbs if available
site.type = "all")
# Filter high-confidence (score > 0.8)
high_conf <- targets[targets$score >= 0.8, ]
Leverage your BAMs for context: Like ChIP-seq peak-to-gene assignment, use ChIPseeker (Bioconductor) to annotate piRNA alignments to nearest genes/transposons. This correlates "binding" (i.e., piRNA expression) to genomic features without pure sequence prediction.
BiocManager::install("ChIPseeker")
library(ChIPseeker)
library(TxDb.Hs.eg.db) # Or your TxDb
# Read BAM (your multimapper alignments)
peak_file <- "your_pirna.bam" # Convert to BED if needed via bedtools
peak <- readPeakFile(peak_file)
# Annotate to genes (promoter=1kb, etc.)
peakAnno <- annotatePeak(peak, TxDb=TxDb.Hs.eg.db,
annoDb="org.Hs.eg.db",
tssRegion = c(-1000, 1000))
plotAnnoPie(peakAnno) # Viz: % in promoters, exons, etc.
gene_ids <- unique(peakAnno@anno$geneId) # For GO below
This is especially useful for piRNAs, as many "target" via genomic proximity (e.g., clusters near genes).
3. Correlating Targets/Binding to Gene Functions
Once you have targets (from miRanda/multiMiR) or gene associations (from ChIPseeker), map to functions via enrichment:
GO/KEGG with clusterProfiler: Standard for this-takes your target gene list and tests for enriched terms.
BiocManager::install("clusterProfiler")
library(clusterProfiler)
# Your target genes (e.g., from above)
target_genes <- unique(high_conf$target_gene) # Or gene_ids
# GO enrichment (BP=biological process)
go_res <- enrichGO(gene = target_genes,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
qvalueCutoff = 0.05)
# Viz
dotplot(go_res, showCategory=10)
# KEGG if relevant
kegg_res <- enrichKEGG(gene = target_genes, organism='hsa')
For ncRNA-specific enrichment: Try NoRCE (Bioconductor)-it does cis-enrichment of piRNA targets against gene sets (e.g., pathways).
BiocManager::install("NoRCE")
library(NoRCE)
# Input: your piRNA seqs as GRanges (from BAM/alignments)
pirna_gr <- makeGRangesFromData(...) # Your alignments
enrichment <- NoRCE(pirna_gr, target_genes,
annotation = "org.Hs.eg.db",
type = "GO") # Or KEGG
This should give you statistically backed functions (e.g., "transposon silencing" or "germ cell development" for piRNAs).
Kevin
WOW! Thank you so much for the information. I will do this.