Hi, I think I might have found a bug in featureCounts from Rsubread (v2.12.3). I am trying to find reads overlapping exon junctions from a personalised reference, using Nanopore long read BAMs.
I am afraid I cannot share fully reproducible code as I am using my own reference, but this is is what I run and what I get
#Get exon-specific read counts with featureCounts
library(Rsubread)
sample <- commandArgs(trailingOnly = TRUE)
gtf_file <- "personalised.gtf"
#Generate a simplified annotation format (SAF) object from the gtf to get only unique exonic regions
gtf_df <- read.table(gtf_file, sep = "\t", header = FALSE, stringsAsFactors = FALSE)
saf_df <- gtf_df[gtf_df$V3 == "exon", ]
#Keep columns of interest only
saf_df <- saf_df[, c(9, 1, 4, 5, 7)]
#Rename
colnames(saf_df) <- c("GeneID", "Chr", "Start", "End", "Strand")
#Clean gene ID
saf_df$GeneID <- gsub(pattern = "(.+ gene_name )(.+)(;)",
replacement = "\\2",
x = saf_df$GeneID)
saf_df <- unique(saf_df)
featureCounts_out <- featureCounts(files = "full.bam",
annot.ext = saf_df, useMetaFeatures = FALSE, isGTFAnnotationFile = FALSE,
minOverlap = 3, isLongRead = TRUE,
isPairedEnd = FALSE, juncCounts = TRUE,
genome = "personalised_reference.fa",
countMultiMappingReads = FALSE, allowMultiOverlap = TRUE)
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
Rsubread 2.12.3
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| ||
|| full.bam ||
|| ||
|| Paired-end : no ||
|| Count read pairs : no ||
|| Annotation : R data.frame ||
|| Dir for temp files : . ||
|| Junction Counting : <output_file>.jcounts ||
|| Threads : 1 ||
|| Level : feature level ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : counted ||
|| Min overlapping bases : 3 ||
|| Long read mode : yes ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file .Rsubread_UserProvidedAnnotation_pid29999 ... ||
|| Features : 46 ||
|| Meta-features : 3 ||
|| Chromosomes/contigs : 6 ||
|| ||
|| Load FASTA contigs from personalised_reference.fa... ||
|| 5 contigs were loaded ||
|| ||
|| Process BAM file full.bam... ||
[1] 29999 segmentation fault R
Now, if I run it with the option juncCounts = FALSE, it does run, but I get the following error:
ERROR: sequence length in the BAM record is out of the expected region: 11207, 7084
From this post, I understand this means that a read is longer than the contig it is meant to align to. So I tried generating a bam without reads that long using
samtools view -h full.bam | awk 'length($10) < 11000 || $1 ~ /^@/' > test.bam
Using the bam without the long read then works with no issues, but I think it would be worth for future releases to either ignore such cases for junction counts or to output a more informative error.
Many thanks!
Carlos
> sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /flask/apps/eb/software/FlexiBLAS/3.0.4-GCC-11.2.0/lib64/libflexiblas.so.3.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rsubread_2.12.3
loaded via a namespace (and not attached):
[1] compiler_4.2.0 Matrix_1.5-1 tools_4.2.0 grid_4.2.0
[5] lattice_0.20-45