What R/Bioconductor tools would you recommend for the analysis of sncRNA, specifically piRNA?
1
0
Entering edit mode
@matthew-thornton-5564
Last seen 3 hours ago
USA, Los Angeles, USC

Hello!

I have a question about analysis of sncRNA. I have used WIND previously and now we are using SPORTS1.1, to match a previously published analysis. The output of SPORTS is the sequence and some (poor) annotation and count data. With the count data we were able to get the statistically significant sncRNA using standard methods, but now I have a set of sequences with very little annotation, mainly RNACentral identifiers. So I can get some Ensembl IDs, using biomaRt, but many don't exist for these sncRNA sequences.

I need to correlate the binding site(s) to some gene function. I have been using miRanda3.3 which will work with any sequence that is input, but I feel like I am not doing something important.

The sequences are short from ~20nt to 150nt and the Phred scores are all max. I have the alignment bam files with the multimappers and I was thinking about how ChiP-seq peaks can be localized to genes.

Any ideas or guidance would be greatly appreciated

Thank you

Bioconductor • 362 views
ADD COMMENT
1
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 1 day ago
The Cave, 181 Longwood Avenue, Boston, …

Hello Matthew,

Thanks for the detailed description-this sounds like a classic challenge in sncRNA analysis, where piRNAs (and other tsRNAs/rsRNAs) often evade standard Ensembl/miRBase annotations due to their derivation from repetitive or non-canonical sources. SPORTS1.1 is a solid choice for initial quantification (it's great for rRNA/tRNA-derived fragments), but its reliance on RNACentral can indeed leave gaps, especially for piRNA clusters. Since you're already pulling some Ensembl IDs via biomaRt and using miRanda for targets, you're on the right track; the "missing" piece is likely just better integration of piRNA-specific resources and downstream functional mapping. I'll outline some targeted guidance below, focusing on R/Bioconductor where possible to keep it streamlined.

1. Enhancing Annotation of Your Sequences

piRNAs aren't always "gene-like" (many arise from transposon clusters), so biomaRt hits will be sparse. Layer in piRNA-specific databases to enrich your RNACentral IDs:

  • Use piRNABank or piRNA-Quest for lookup: These are piRNA-focused repos. You can fetch mature piRNA sequences programmatically.

    • In R, use AnnotationHub (Bioconductor) to pull related data, or script a simple fetch from piRNABank's FASTA (e.g., via Biostrings).
  • Recommended Bioconductor workflow with Seqpac: This package is tailored for sRNA-seq (including piRNAs) and handles poor annotations by integrating multiple sources. It can re-annotate your SPORTS output counts against RNACentral + piRNA banks.

    # Install if needed
    if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")
    BiocManager::install("seqpac")
    
    # Load your SPORTS count matrix (assume 'counts' is a matrix with rows=sequences, cols=samples)
    library(seqpac)
    
    # Create a count object (adapt to your FASTQ/BAM-derived seqs)
    cnt_obj <- makeCountObject(counts = counts, 
                               design = your_design_df,  # e.g., cols: sample, condition
                               mrvi = FALSE)  # Skip miRNA-specific if focusing on piRNAs
    
    # Annotate with multi-source pipeline (includes RNACentral, piRNABank)
    cnt_obj <- annotate(cnt_obj, 
                        genome = "hg38",  # Or your ref
                        pattern = "piRNA",  # Filter for piRNA-like
                        db_source = c("rnacentral", "pirnabank"))
    
    # Extract enriched annotations
    annotated_seqs <- getAnnotations(cnt_obj)
    head(annotated_seqs)  # Now with piRNA cluster info, origins
    

    This should flag more piRNA-like sequences (e.g., 24-31nt, ping-pong signatures if your BAMs show multimappers). For multimappers in your BAMs, Seqpac can re-align and assign fractional counts to clusters-similar to how you'd resolve ChIP peaks.

  • Alternative: WIND integration: Since you mentioned prior WIND use, its output (piRNA clusters from genomic loci) could be merged here. Align your SPORTS seqs back to WIND clusters using GenomicAlignments on your BAMs.

If many seqs still lack hits, they're likely novel/de novo piRNAs-treat them as such and focus on genomic context (see below).

2. Target Prediction Beyond miRanda

miRanda is fine for seed-based prediction (it'll work on your 20-150nt seqs), but piRNAs often act via imperfect base-pairing or cluster-specific mechanisms, not just 6-8mer seeds. To avoid over-prediction on short/high-quality seqs:

  • piRNA-specific predictors: Use pirScan (web tool, but scriptable) or piRTA (command-line) for targets. These score piRNA-mRNA pairs better than miRanda for transposon silencing.

    • In R: Wrap via system() calls, or use multiMiR (Bioconductor) which includes some piRNA targets from databases like piRTarBase.
    BiocManager::install("multiMiR")
    library(multiMiR)
    
    # Assume 'pirna_seqs' is your vector of sequences
    targets <- getMirnaTargets(mirna = pirna_seqs,  # Or RNACentral IDs
                               species = "hsa",
                               method = c("mirdip", "diana", "targetscan"),  # Extend with piRNA dbs if available
                               site.type = "all")
    
    # Filter high-confidence (score > 0.8)
    high_conf <- targets[targets$score >= 0.8, ]
    
  • Leverage your BAMs for context: Like ChIP-seq peak-to-gene assignment, use ChIPseeker (Bioconductor) to annotate piRNA alignments to nearest genes/transposons. This correlates "binding" (i.e., piRNA expression) to genomic features without pure sequence prediction.

    BiocManager::install("ChIPseeker")
    library(ChIPseeker)
    library(TxDb.Hs.eg.db)  # Or your TxDb
    
    # Read BAM (your multimapper alignments)
    peak_file <- "your_pirna.bam"  # Convert to BED if needed via bedtools
    peak <- readPeakFile(peak_file)
    
    # Annotate to genes (promoter=1kb, etc.)
    peakAnno <- annotatePeak(peak, TxDb=TxDb.Hs.eg.db, 
                             annoDb="org.Hs.eg.db",
                             tssRegion = c(-1000, 1000))
    
    plotAnnoPie(peakAnno)  # Viz: % in promoters, exons, etc.
    gene_ids <- unique(peakAnno@anno$geneId)  # For GO below
    

    This is especially useful for piRNAs, as many "target" via genomic proximity (e.g., clusters near genes).

3. Correlating Targets/Binding to Gene Functions

Once you have targets (from miRanda/multiMiR) or gene associations (from ChIPseeker), map to functions via enrichment:

  • GO/KEGG with clusterProfiler: Standard for this-takes your target gene list and tests for enriched terms.

    BiocManager::install("clusterProfiler")
    library(clusterProfiler)
    
    # Your target genes (e.g., from above)
    target_genes <- unique(high_conf$target_gene)  # Or gene_ids
    
    # GO enrichment (BP=biological process)
    go_res <- enrichGO(gene = target_genes,
                       OrgDb = org.Hs.eg.db,
                       ont = "BP",
                       pAdjustMethod = "BH",
                       qvalueCutoff = 0.05)
    
    # Viz
    dotplot(go_res, showCategory=10)
    
    # KEGG if relevant
    kegg_res <- enrichKEGG(gene = target_genes, organism='hsa')
    
  • For ncRNA-specific enrichment: Try NoRCE (Bioconductor)-it does cis-enrichment of piRNA targets against gene sets (e.g., pathways).

    BiocManager::install("NoRCE")
    library(NoRCE)
    
    # Input: your piRNA seqs as GRanges (from BAM/alignments)
    pirna_gr <- makeGRangesFromData(...)  # Your alignments
    enrichment <- NoRCE(pirna_gr, target_genes, 
                        annotation = "org.Hs.eg.db",
                        type = "GO")  # Or KEGG
    

This should give you statistically backed functions (e.g., "transposon silencing" or "germ cell development" for piRNAs).

Kevin

ADD COMMENT
0
Entering edit mode

WOW! Thank you so much for the information. I will do this.

ADD REPLY

Login before adding your answer.

Traffic: 998 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6