Variance between runs of Rsubread
1
0
Entering edit mode
@isaac-virshup-11641
Last seen 7.8 years ago
University of Melbourne

Hello,

I've noticed some differences between the count tables generated by repeated runs of RSubread. Most of the differences are a read or two, but for a few particular genes (so far, only ones in repetitive areas) the discrepancies can be hundreds of counts.

Is this expected behavior? If so, has there been any profiling which explores where/ how much variance can be expected?

This behavior was first noticed with RSubread 1.20.6 from bioconductor using R v3.2.2, but can be replicated with the current release.

The command to align follows the following convention:

align(index=indexname,
    readfile1=readfile1,
    readfile2=readfile2,
    input_format="FASTQ",
    output_format="BAM",
    output_file=outfile,
    unique=TRUE,
    nBestLocations=1,
    maxMismatches=5,
    indels=5,
    nthreads=6)

Here is the command used to generate count tables:

featureCounts(inputfiles, 
    annot.ext=gtf, 
    isGTFAnnotationFile=T, 
    useMetaFeatures=T, 
    isPairedEnd=ispaired, 
    allowMultiOverlap=T, 
    reportReads=T, 
    minMQS=30, 
    nthreads=ncpu, 
    ignoreDup=T)

Though md5 sums of the bam files output by align change between runs. In all test cases R was built with gcc v4.4.7.

Thanks,

–Isaac

UPDATE

The discrepancies I'm asking about were from alignments made using a single install of RSubread. Essentially, I would like to know if the output of RSubread is deterministic, or if there is some stochasticity. My impression had been the program was deterministic, though results I'm seeing don't match that.

Here's a gist containing the contents of the $stats field for a couple runs using a single install of release Rsubread. For each run output looks similar, though counts for each entry do vary by a few.

rsubread alignment reproducibility • 2.9k views
ADD COMMENT
0
Entering edit mode

There could be slight differences in counting results between different Rsubread versions. But you need to provide more details, including your commands, versions and counting summaries, so that the differences you observed can be better understood.

ADD REPLY
0
Entering edit mode

I've updated the question to include commands and versions used. I'm not sure what you mean by counting summaries, would you mind clarifying?

ADD REPLY
0
Entering edit mode
Wei Shi ★ 3.6k
@wei-shi-2183
Last seen 3 days ago
Australia/Melbourne

The difference you observed is likely to be caused by the changes in the align() function. We made improvement on the reporting of multi-mapping reads (this is why you saw differences for genes in repetitive regions) and also on the reporting of mapping quality scores.

The 'stat' component included in the object returned by featureCounts includes counting summary, which gives the breakdown of counting results for different groups of reads (multimapping reads, reads passing MQS cutoff etc). This summary is helpful for you to understand what caused the counting differences between your different runs.

ADD COMMENT
0
Entering edit mode

Sorry, but I think there has a been a misunderstanding about my question. I'm seeing differences between alignments generated from a single install of RSubread. I'll add an update to my question to clarify this.

ADD REPLY
0
Entering edit mode

Could you run sessionInfo() command and provide its output?

ADD REPLY
0
Entering edit mode

Session info associated with the linked gist above:

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.22.3

loaded via a namespace (and not attached):
[1] tools_3.3.1
ADD REPLY
0
Entering edit mode

Hi Isaac, could you provide the bam file so we can take a close look? Could you also let us know which genome you mapped your reads to?

ADD REPLY
0
Entering edit mode

The alignment was against GRCh37.69. How should I send you a link to the files?

ADD REPLY
0
Entering edit mode

You may send us the files via dropbox.

ADD REPLY
0
Entering edit mode

Not sure if there's quite enough space in my dropbox at the moment... I've uploaded them to CloudStor (should be similar) here: https://cloudstor.aarnet.edu.au/plus/index.php/s/gh2AdmTmhr6Vdox

ADD REPLY
0
Entering edit mode

Wei, have you had a chance to retrieve those files? I'd like to take those files down from my cloudstor drive.

Best,

–Isaac

ADD REPLY
0
Entering edit mode

Yes we have downloaded the files. Thanks.

ADD REPLY
0
Entering edit mode

Wei,

On the recommendation of a co-worker I tried running subread using a single thread to see if that produced deterministic output. I've run it a few times now and have been getting the same output – while multi-threaded output still varies by run. I ran this using conda installed R and Rsubread using the human genome from ensembl release 76. Hope this helps resolve this issue!

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.8 (Final)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.23.0
ADD REPLY
0
Entering edit mode

Hi Isaac, this will be fixed in the next version which should be released soon.

ADD REPLY
0
Entering edit mode

That's great! Thanks for looking into this.

Will the new version be available with the next release of bioconductor or will it be out separately?

ADD REPLY
0
Entering edit mode

We aim to release it in the next few weeks. Cheers, Wei

ADD REPLY

Login before adding your answer.

Traffic: 456 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6