Question

Issues loading a BAM file resulted from featurecounts

1

Entering edit mode

Konstantinos Yeles ▴ 90

@konstantinos-yeles-8961

Last seen 5 days ago

Italy

Hello Bioc community, (this have been issued first on GitHub) I used Rsubread::featurescounts to quantify some BAM files with a gtf file and I would like to rerun the unassigned reads with a different gtf. For this reason I used the option reportReads = "BAM" to get the reads.

EDIT: feature counts code

nc_RNA_list[1] %>% map(~ featureCounts(files = bam_file,
                        annot.ext =  gtf_files[str_detect(gtf_files,str_glue("_{.x}_"))],
                        isGTFAnnotationFile =TRUE,
                        GTF.featureType = "exon",
                        GTF.attrType.extra = c("gene_type", "sRNA_id", "dna"),
                        nthreads = 8,
                        useMetaFeatures = TRUE,
                        allowMultiOverlap = TRUE,
                        largestOverlap = TRUE,
                        fraction = TRUE,
                        reportReads = "BAM",
                        reportReadsPath = as.character(str_glue("feature_counts_res_new/{.x}/")),
                        strandSpecific = 0,
                        verbose = TRUE) %>% write_rds(str_glue("feature_counts_{todate}_{.x}.rds"))
    )

When I tried to load them on R with plyranges::read_bam which internally uses GenomicAlignments::readGAlignments() I got this error:

 Error in value[[3L]](cond) : 
  failed to open BamFile: failed to open SAM/BAM file
  file: './feature_counts_res_new/rRNA/R1_001_trimmed.fq.gz_sorted.featureCounts.BAM'

I tried also with Rsamtools::indexBam(the_bam) and got:

Error in FUN(X[[i]], ...) : 'filename' is not a BAM file
  file: ./feature_counts_res_new/rRNA/R1_001_trimmed.fq.gz_sorted.featureCounts.BAM

I just need to extract the "XS:Z:Unassigned_NoFeatures" reads and save them to another bam file so as to rerun them.

Thank you for your time Konstantinos

sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] edgeR_3.28.1         limma_3.42.2         tximport_1.14.0      plyranges_1.6.10    
 [5] GenomicRanges_1.38.0 GenomeInfoDb_1.22.0  IRanges_2.20.2       S4Vectors_0.24.3    
 [9] BiocGenerics_0.32.0  data.table_1.12.8    forcats_0.5.0        stringr_1.4.0       
[13] dplyr_0.8.4          purrr_0.3.3          readr_1.3.1          tidyr_1.0.2         
[17] tibble_2.1.3         ggplot2_3.3.0        tidyverse_1.3.0     

loaded via a namespace (and not attached):
 [1] Biobase_2.46.0              httr_1.4.1                  jsonlite_1.6.1             
 [4] modelr_0.1.6                assertthat_0.2.1            BiocManager_1.30.10        
 [7] GenomeInfoDbData_1.2.2      cellranger_1.1.0            Rsamtools_2.2.3            
[10] pillar_1.4.3                backports_1.1.5             lattice_0.20-38            
[13] glue_1.3.1                  XVector_0.26.0              rvest_0.3.5                
[16] colorspace_1.4-1            Matrix_1.2-18               XML_3.99-0.3               
[19] pkgconfig_2.0.3             broom_0.5.5                 haven_2.2.0                
[22] zlibbioc_1.32.0             scales_1.1.0                BiocParallel_1.20.1        
[25] generics_0.0.2              withr_2.1.2                 SummarizedExperiment_1.16.1
[28] cli_2.0.2                   magrittr_1.5                crayon_1.3.4               
[31] readxl_1.3.1                fs_1.3.1                    fansi_0.4.1                
[34] nlme_3.1-142                xml2_1.2.2                  tools_3.6.2                
[37] hms_0.5.3                   lifecycle_0.2.0             matrixStats_0.55.0         
[40] munsell_0.5.0               reprex_0.3.0                locfit_1.5-9.1             
[43] DelayedArray_0.12.2         Biostrings_2.54.0           compiler_3.6.2             
[46] rlang_0.4.5                 grid_3.6.2                  RCurl_1.98-1.1             
[49] rstudioapi_0.11             bitops_1.0-6                gtable_0.3.0               
[52] DBI_1.1.0                   R6_2.4.1                    GenomicAlignments_1.22.1   
[55] lubridate_1.7.4             rtracklayer_1.46.0          utf8_1.1.4                 
[58] stringi_1.4.6               Rcpp_1.0.3                  vctrs_0.2.3                
[61] dbplyr_1.4.2                tidyselect_1.0.0

featurecount rsubread GenomicAlignments • 2.3k views

ADD COMMENT • link 5.3 years ago Konstantinos Yeles ▴ 90

1

Entering edit mode

You will need to show the Rsubread code you used as well.

ADD REPLY • link 5.3 years ago James W. MacDonald 68k

score 0 · Answer 1 · 2020-03-20

I found a workaround but it is not convenient enough as I cannot use it properly on multi mapped alignments and seems quite slow. I would like also to share some thoughts regarding the counting (possible multi counting from this approach).

"tRNA"%>% 
map(~ featureCounts(files = bam_file,
                    annot.ext =  "new_with_different_types.gtf",
                    isGTFAnnotationFile =TRUE,
                    GTF.featureType = .x, #use a custom #type instead of exon
                    GTF.attrType.extra = c("gene_type", "sRNA_id", "dna"),
                    nthreads = 2,
                    useMetaFeatures = TRUE,
                    allowMultiOverlap = TRUE,
                    largestOverlap = TRUE,
                    fraction = TRUE,
                    reportReads = "CORE", # use CORE instead of BAM to subset the input BAM afterwards
                    reportReadsPath = as.character(str_glue("feature_counts_res_new/{.x}/")),
                    strandSpecific = 0,
                    verbose = TRUE) %>% write_rds(str_glue("feature_counts_{todate}_{.x}.rds")) )

load the CORE file resulted from featurecounts

  my_reads <- data.table::fread("feature_counts_res_new/tRNA/R1_001_trimmed.fq.gz_sorted.bam.featureCounts",
                    header = FALSE,
                    colClasses = c("character","character","numeric","character"))

subset it

     my_reads <-  my_reads[V2 =="Unassigned_NoFeatures"][,.(V1)]

check how many reads are

     data.table::uniqueN(my_reads)

make the BAM file name for exporting

     con <- str_replace(string = bam_file,pattern = ".bam",replacement = "_not_rRNA.bam")

load and subset the input BAM file and then export it for further use.

     my_filter <- plyranges::read_bam(bam_file) %>%   
     select(qname,flag,rname,strand,
        pos,qwidth,mapq,cigar,
        mrnm,mpos,isize,seq ,qual) %>%
     filter(qname %in% my_reads$X1) %>%    
     rtracklayer::export(con,format = "BAM")

EDIT. Suggested function Rsamtools::filterBam used as found on StackOverflow

Creating a new subsetted bam using the filtered reads from the input bam

library(Rsamtools)
want <- my_reads$V1
filter_factory <- function(want) {
    list(KeepQname = function(x) x$qname %in% want)
}
filter <- FilterRules(filter_factory(want))
dest <- filterBam(bam_file, con,
                  filter=filter,
                  param=ScanBamParam(what="qname"))

Negatives:

Slow
Memory demanding

My strategy is to perform this operation to a list of ncRNAs (tRNA,snoRNA,miRNA,etc) sequentially as it has been used from various workflows for the analysis of ncRNAs.(examples: here and here) so as to get counts respect to the categories and then summarize them to a count table to perform DE.

My input BAM file has multi mapping alignments. With the aforementioned strategy, the reads that are multi mapped probably are counted multiple times for each run of feature counts? If that's true then I have to align my reads without multi mapping and then an aligned read should not pass to the next subsetted BAM file (and so not counted multiple times for each category)
Maybe this strategy is not optimal for the use of featurecounts, alternatively one gtf file should be used with all categories of ncRNAs. In this approach how could be possible to defend the underrepresentation of ncRNAs that fall in the same genomic regions?