Question

What R/Bioconductor tools would you recommend for the analysis of sncRNA, specifically piRNA?

0

Entering edit mode

Matthew Thornton ▴ 390

@matthew-thornton-5564

Last seen 4 weeks ago

USA, Los Angeles, USC

Hello!

I have a question about analysis of sncRNA. I have used WIND previously and now we are using SPORTS1.1, to match a previously published analysis. The output of SPORTS is the sequence and some (poor) annotation and count data. With the count data we were able to get the statistically significant sncRNA using standard methods, but now I have a set of sequences with very little annotation, mainly RNACentral identifiers. So I can get some Ensembl IDs, using biomaRt, but many don't exist for these sncRNA sequences.

I need to correlate the binding site(s) to some gene function. I have been using miRanda3.3 which will work with any sequence that is input, but I feel like I am not doing something important.

The sequences are short from ~20nt to 150nt and the Phred scores are all max. I have the alignment bam files with the multimappers and I was thinking about how ChiP-seq peaks can be localized to genes.

Any ideas or guidance would be greatly appreciated

Thank you

Bioconductor • 547 views

ADD COMMENT • link 12 weeks ago • updated 6 weeks ago Matthew Thornton ▴ 390

score 1 · Answer 1 · 2025-11-06

Hello Matthew,

Thanks for the detailed description-this sounds like a classic challenge in sncRNA analysis, where piRNAs (and other tsRNAs/rsRNAs) often evade standard Ensembl/miRBase annotations due to their derivation from repetitive or non-canonical sources. SPORTS1.1 is a solid choice for initial quantification (it's great for rRNA/tRNA-derived fragments), but its reliance on RNACentral can indeed leave gaps, especially for piRNA clusters. Since you're already pulling some Ensembl IDs via biomaRt and using miRanda for targets, you're on the right track; the "missing" piece is likely just better integration of piRNA-specific resources and downstream functional mapping. I'll outline some targeted guidance below, focusing on R/Bioconductor where possible to keep it streamlined.

1. Enhancing Annotation of Your Sequences

piRNAs aren't always "gene-like" (many arise from transposon clusters), so biomaRt hits will be sparse. Layer in piRNA-specific databases to enrich your RNACentral IDs:

Use piRNABank or piRNA-Quest for lookup: These are piRNA-focused repos. You can fetch mature piRNA sequences programmatically.
- In R, use AnnotationHub (Bioconductor) to pull related data, or script a simple fetch from piRNABank's FASTA (e.g., via Biostrings).

Recommended Bioconductor workflow with Seqpac: This package is tailored for sRNA-seq (including piRNAs) and handles poor annotations by integrating multiple sources. It can re-annotate your SPORTS output counts against RNACentral + piRNA banks.

# Install if needed
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("seqpac")

# Load your SPORTS count matrix (assume 'counts' is a matrix with rows=sequences, cols=samples)
library(seqpac)

# Create a count object (adapt to your FASTQ/BAM-derived seqs)
cnt_obj <- makeCountObject(counts = counts, 
                           design = your_design_df,  # e.g., cols: sample, condition
                           mrvi = FALSE)  # Skip miRNA-specific if focusing on piRNAs

# Annotate with multi-source pipeline (includes RNACentral, piRNABank)
cnt_obj <- annotate(cnt_obj, 
                    genome = "hg38",  # Or your ref
                    pattern = "piRNA",  # Filter for piRNA-like
                    db_source = c("rnacentral", "pirnabank"))

# Extract enriched annotations
annotated_seqs <- getAnnotations(cnt_obj)
head(annotated_seqs)  # Now with piRNA cluster info, origins

This should flag more piRNA-like sequences (e.g., 24-31nt, ping-pong signatures if your BAMs show multimappers). For multimappers in your BAMs, Seqpac can re-align and assign fractional counts to clusters-similar to how you'd resolve ChIP peaks.

Alternative: WIND integration: Since you mentioned prior WIND use, its output (piRNA clusters from genomic loci) could be merged here. Align your SPORTS seqs back to WIND clusters using GenomicAlignments on your BAMs.

If many seqs still lack hits, they're likely novel/de novo piRNAs-treat them as such and focus on genomic context (see below).

2. Target Prediction Beyond miRanda

miRanda is fine for seed-based prediction (it'll work on your 20-150nt seqs), but piRNAs often act via imperfect base-pairing or cluster-specific mechanisms, not just 6-8mer seeds. To avoid over-prediction on short/high-quality seqs:

piRNA-specific predictors: Use pirScan (web tool, but scriptable) or piRTA (command-line) for targets. These score piRNA-mRNA pairs better than miRanda for transposon silencing.

In R: Wrap via system() calls, or use multiMiR (Bioconductor) which includes some piRNA targets from databases like piRTarBase.

BiocManager::install("multiMiR")
library(multiMiR)

# Assume 'pirna_seqs' is your vector of sequences
targets <- getMirnaTargets(mirna = pirna_seqs,  # Or RNACentral IDs
                           species = "hsa",
                           method = c("mirdip", "diana", "targetscan"),  # Extend with piRNA dbs if available
                           site.type = "all")

# Filter high-confidence (score > 0.8)
high_conf <- targets[targets$score >= 0.8, ]

Leverage your BAMs for context: Like ChIP-seq peak-to-gene assignment, use ChIPseeker (Bioconductor) to annotate piRNA alignments to nearest genes/transposons. This correlates "binding" (i.e., piRNA expression) to genomic features without pure sequence prediction.

BiocManager::install("ChIPseeker")
library(ChIPseeker)
library(TxDb.Hs.eg.db)  # Or your TxDb

# Read BAM (your multimapper alignments)
peak_file <- "your_pirna.bam"  # Convert to BED if needed via bedtools
peak <- readPeakFile(peak_file)

# Annotate to genes (promoter=1kb, etc.)
peakAnno <- annotatePeak(peak, TxDb=TxDb.Hs.eg.db, 
                         annoDb="org.Hs.eg.db",
                         tssRegion = c(-1000, 1000))

plotAnnoPie(peakAnno)  # Viz: % in promoters, exons, etc.
gene_ids <- unique(peakAnno@anno$geneId)  # For GO below

This is especially useful for piRNAs, as many "target" via genomic proximity (e.g., clusters near genes).

3. Correlating Targets/Binding to Gene Functions

Once you have targets (from miRanda/multiMiR) or gene associations (from ChIPseeker), map to functions via enrichment:

GO/KEGG with clusterProfiler: Standard for this-takes your target gene list and tests for enriched terms.

BiocManager::install("clusterProfiler")
library(clusterProfiler)

# Your target genes (e.g., from above)
target_genes <- unique(high_conf$target_gene)  # Or gene_ids

# GO enrichment (BP=biological process)
go_res <- enrichGO(gene = target_genes,
                   OrgDb = org.Hs.eg.db,
                   ont = "BP",
                   pAdjustMethod = "BH",
                   qvalueCutoff = 0.05)

# Viz
dotplot(go_res, showCategory=10)

# KEGG if relevant
kegg_res <- enrichKEGG(gene = target_genes, organism='hsa')

For ncRNA-specific enrichment: Try NoRCE (Bioconductor)-it does cis-enrichment of piRNA targets against gene sets (e.g., pathways).

BiocManager::install("NoRCE")
library(NoRCE)

# Input: your piRNA seqs as GRanges (from BAM/alignments)
pirna_gr <- makeGRangesFromData(...)  # Your alignments
enrichment <- NoRCE(pirna_gr, target_genes, 
                    annotation = "org.Hs.eg.db",
                    type = "GO")  # Or KEGG

This should give you statistically backed functions (e.g., "transposon silencing" or "germ cell development" for piRNAs).

Kevin