Question

[BUG] Rsubread::featureCounts quirk in CORE files: different target list ordering for same target lists

0

Entering edit mode

isaac.vock • 0

@45e1494b

Last seen 3 months ago

United States

Hello,

I am trying to use featureCounts() to assign RNA-seq reads to the set of annotated transcript isoforms with which they are completely consistent. When outputting this assignment to the "CORE" format (tab-separated table with columns detailing the assignment of each individual sequencing read), I found a weird bug. Occasionally, featureCounts would assign reads to the identical set of isoforms, but would output different comma-separated target list strings. For example, I had a read which was assigned to "CHS.46558.6,CHS.46558.5,CHS.46558.2,CHS.46558.11,CHS.46558.1,CHS.46558.16" and another read assigned to "CHS.46558.6,CHS.46558.5,CHS.46558.2,CHS.46558.16,CHS.46558.11,CHS.46558.1". Notice that the isoform content is identical in these two cases, its just the order they are listed in the string that differs. I created a reproducible example where I took 100 reads from the bam file with one of these two target lists and 100 with the other target list. Notably, when running featureCounts as detailed below on this downsampled bam file, the reads get assigned to the same exact target list as in the full bam file. I am hosting the data necessary to run this reproducible example, as well as the output .featureCounts file from this example, on a personal, public github repo.


### Load dependencies
library(Rsubread)
library(data.table)


### Paths to run example
samfile_path <- "path/to/example.sam"
gtf_path <- "path/to/mini_annotation.gtf"
output_dir <- "directory/to/output/CORE/file/"



### Run FeatureCounts, assigning reads to set of transcript IDs with which they are completely consistent
featureCounts(file = samfile_path,
              annot.ext = gtf_path,
              isGTFAnnotationFile = TRUE,
              strandSpecific = 1,
              isPairedEnd = TRUE,
              countReadPairs = TRUE,
              nonOverlap = 0,
              reportReads = "CORE",
              reportReadsPath = output_dir,
              nthreads = 4,
              primaryOnly = TRUE,
              allowMultiOverlap = TRUE,
              GTF.featureType = "exon",
              GTF.attrType = "transcript_id")


### Assess results
core_file <- list.files(path = output_dir,
                   pattern = "featureCounts",
                   full.names = TRUE)[1]

results <- fread(core_file,
                 sep = "\t")

# Print counts for different target lists
results[,.N, by = V4]

                                                                          V4   N
1: CHS.46558.6,CHS.46558.5,CHS.46558.2,CHS.46558.11,CHS.46558.1,CHS.46558.16 100
2: CHS.46558.6,CHS.46558.5,CHS.46558.2,CHS.46558.16,CHS.46558.11,CHS.46558.1 100

### Get session information
sessionInfo( )

R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.10 Rsubread_2.16.0   

loaded via a namespace (and not attached):
  [1] tidyselect_1.2.0            dplyr_1.1.4                
  [3] blob_1.2.4                  filelock_1.0.3             
  [5] Biostrings_2.70.1           bitops_1.0-7               
  [7] fastmap_1.1.1               RCurl_1.98-1.13            
  [9] BiocFileCache_2.10.1        GenomicAlignments_1.38.0   
 [11] promises_1.2.1              XML_3.99-0.15              
 [13] digest_0.6.34               mime_0.12                  
 [15] lifecycle_1.0.4             ellipsis_0.3.2             
 [17] KEGGREST_1.42.0             RSQLite_2.3.5              
 [19] magrittr_2.0.3              compiler_4.3.1             
 [21] rlang_1.1.2                 progress_1.2.3             
 [23] tools_4.3.1                 utf8_1.2.4                 
 [25] yaml_2.3.7                  rtracklayer_1.62.0         
 [27] htmlwidgets_1.6.4           prettyunits_1.2.0          
 [29] S4Arrays_1.2.0              pkgbuild_1.4.3             
 [31] bit_4.0.5                   curl_5.2.0                 
 [33] DelayedArray_0.28.0         xml2_1.3.6                 
 [35] pkgload_1.3.4               abind_1.4-5                
 [37] BiocParallel_1.36.0         miniUI_0.1.1.1             
 [39] purrr_1.0.2                 BiocGenerics_0.48.1        
 [41] grid_4.3.1                  stats4_4.3.1               
 [43] fansi_1.0.5                 urlchecker_1.0.1           
 [45] profvis_0.3.8               xtable_1.8-4               
 [47] MASS_7.3-60                 biomaRt_2.58.0             
 [49] SummarizedExperiment_1.32.0 cli_3.6.1                  
 [51] crayon_1.5.2                remotes_2.4.2.1            
 [53] generics_0.1.3              rstudioapi_0.15.0          
 [55] httr_1.4.7                  rjson_0.2.21               
 [57] sessioninfo_1.2.2           DBI_1.2.1                  
 [59] cachem_1.0.8                stringr_1.5.1              
 [61] zlibbioc_1.48.0             parallel_4.3.1             
 [63] AnnotationDbi_1.64.1        XVector_0.42.0             
 [65] restfulr_0.0.15             matrixStats_1.1.0          
 [67] vctrs_0.6.4                 devtools_2.4.5             
 [69] Matrix_1.6-3                IRanges_2.36.0             
 [71] hms_1.1.3                   S4Vectors_0.40.1           
 [73] bit64_4.0.5                 glue_1.6.2                 
 [75] codetools_0.2-19            stringi_1.8.1              
 [77] later_1.3.2                 GenomeInfoDb_1.38.5        
 [79] BiocIO_1.12.0               GenomicRanges_1.54.1       
 [81] tibble_3.2.1                pillar_1.9.0               
 [83] rappdirs_0.3.3              htmltools_0.5.7            
 [85] GenomeInfoDbData_1.2.11     R6_2.5.1                   
 [87] dbplyr_2.4.0                shiny_1.8.0                
 [89] lattice_0.22-5              Biobase_2.62.0             
 [91] png_0.1-8                   Rsamtools_2.18.0           
 [93] memoise_2.0.1               httpuv_1.6.13              
 [95] Rcpp_1.0.12                 SparseArray_1.2.2          
 [97] usethis_2.2.2               MatrixGenerics_1.14.0      
 [99] fs_1.6.3                    pkgconfig_2.0.3

Other notes:

1) I originally discovered this error using the command line interface through the featureCounts Snakemake wrapper. That's just to say the bug is not specific to the R-interface. 2) It is an oddly infrequent error. Most instances where a set of reads is consistent with several annotated isoforms yields a single unique target list. In addition, in the case I used for the reproducible example, there is a pretty large imbalance in terms of reads getting assigned to one target list or the other for the gene I saw this quirk at. Assigning reads from the full bam file yielded 18,827 reads getting assigned to "CHS.46558.6,CHS.46558.5,CHS.46558.2,CHS.46558.16,CHS.46558.11,CHS.46558.1" and only 3889 getting assigned to "CHS.46558.6,CHS.46558.5,CHS.46558.2,CHS.46558.11,CHS.46558.1,CHS.46558.16". 3) Below is an IGV screenshot of the alignments that were assigned to one target list vs. the other. Interestingly, there is some striking separation along the length of the genes for the two sets of reads. The data is data simulated from a pipeline I am developing which combines polyester with some custom scripts to simulate NR-seq data (TimeLapse-seq, SLAM-seq, etc.), so there are an abundance of T-to-C or G-to-A mutations in some of the simulated reads:

enter image description here

4) The quirk reproduces on both a Linux and Windows operating system.

Thank you in advance, Isaac

featurecounts Rsubread • 1.1k views

ADD COMMENT • link 12 months ago • updated 11 months ago isaac.vock • 0

1

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

Yang Liao or Wei Shi will need to give a definitive answer, but my understanding is that there is nothing in the Rsubread documentation to say that the read assignments need to be listed in any particular order. So it is not an error or a bug for the order to vary between reads.

ADD COMMENT • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

Understood, though this quirk makes it quite a hassle to work with for downstream applications, which is why I would still greatly appreciate a new release to address it. For example, I would love to be able to take the CORE file and group reads by their target list to perform analyses on sets of reads mapping to the same set of transcripts. Since these strings can differ in some cases though this analysis is non-trivial as all strings need to be reordered to avoid splitting up reads into separate groups which should be analyzed together.

ADD REPLY • link 12 months ago isaac.vock • 0

1

Entering edit mode

It's quite easy to manipulate strings like this in R though. For example,

x <- c(
  "CHS.46558.6,CHS.46558.5,CHS.46558.2,CHS.46558.11,CHS.46558.1,CHS.46558.16",
  "CHS.46558.6,CHS.46558.5,CHS.46558.2,CHS.46558.16,CHS.46558.11,CHS.46558.1"
)
> vapply(lapply(strsplit(x,split=","),sort),paste,"",collapse=",")
[1] "CHS.46558.1,CHS.46558.11,CHS.46558.16,CHS.46558.2,CHS.46558.5,CHS.46558.6"
[2] "CHS.46558.1,CHS.46558.11,CHS.46558.16,CHS.46558.2,CHS.46558.5,CHS.46558.6"

Or alternatively, would it be relevant for you to make an occurrence matrix of 0/1 for each read and transcript isoform?

Regarding featureCounts itself, my naive understanding is that the different computational threads run asynchronously, so the hits for different isoforms may arrive in the C code in different orders. The only way for featureCounts to enforce a consistent isoform order in the results would be to resort at the end, but that's something you can also do yourself almost as easily.

ADD REPLY • link 12 months ago Gordon Smyth 52k

0

Entering edit mode

I know, but doing that on anywhere from 10 to 100 million strings per processed bam file is a massive computational cost I would like to avoid if possible. I could do something like:

library(data.table)

core_file <- fread("path/to/core.featureCounts")

unique_strings <- unique(core_file$target_list)
unique_strings_sorted <- vapply(lapply(strsplit(unique_strings,split=","),sort),paste,"",collapse=",")
target_dict <- data.table(target_list = unique_strings, new_target_list = unique_strings_sorted)

core_file[target_dict, on = .(target_list), nomatch = NULL]

but I would need to do some benchmarking to see how much this strategy helps. Granted, the performance gain from only sorting the unique target lists depends on the annotation complexity, read length, paired-end vs. single-end, etc., so that's a lot of cases I need to investigate.

That being said, I'm still wondering how much of a hassle it would be on the Rsubread side to make the target list consistent? If it's a major hassle that would come with a performance hit for featureCounts, then I will accept the cost and do the string resorting as you suggested (and to answer your question, the reordering strategy as you laid it out is exactly what I would need to do. The occurrence matrix idea would be less useful in my case). If it would be an easy change though, I would love to see it implemented.

EDIT: I missed the last paragraph of your most recent reply explaining the potential cause of the behavior. Makes sense, I'll do the resorting myself, thank you very much for taking the time to field my queries!

ADD REPLY • link 12 months ago isaac.vock • 0

score 2 · Accepted Answer · 2024-01-29

2

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 5 weeks ago

Australia/Melbourne

The reason why the order can vary is because the reads having the same target list may have slightly different mapping locations. The assignment algorithm in featureCounts groups exons into bins and compares read mapping coordinates against these bins. When the mapping locations of reads differ, the order to search these bins might be different even the reads map to the same set of transcripts.

We will add a target sorting procedure before saving them to a file. This will make the same target lists exactly the same. This should be done in a week or two.