Error: Input Files have different Amount of Reads
1
0
Entering edit mode
abano • 0
@abano-23787
Last seen 14 months ago

When I try to run the Rsubread commands subjunc or align in Studio, I get the error "Input files have different amount of reads.

ERROR: two input files have different amounts of reads! The program has to terminate and no alignment results were generated!

Error in .load.delete.summary(output_file[i]) : Summary file BT549.bam.summary was not generated! The program terminated wrongly!

sessionInfo() R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Catalina 10.15.5

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler4.0.1 Matrix1.2-18 tools4.0.1 tinytex0.24 grid4.0.1
[6] xfun
0.15 lattice_0.20-41

I get the same error message for several fastq files that I downloaded from GEO. Is it true that all these files have a data problem, or is this some other issue?

Thanks AB

SubRead Align Subjunc • 299 views
0
Entering edit mode

Looking at the file names it appears that you might have used fastq-dump to get these data? Or did you download the original format files from Google or AWS and rename? If the former, did you ensure you did things correctly? Like, did you do

zcat ~/Documents/HarvardLincs/SRR120607501.fastq.gz | wc -l
zcat ~/Documents/HarvardLincs/SRR120607502.fastq.gz | wc -l



and ensure that you get the same number of reads?

0
Entering edit mode

Yes, I used SRA toolkit to download the data on cluster. Then I downloaded it to my laptop to process it on RStudio.

I just ran the commands that you provided and this is what I get

[abano@sabine Harvard]$zcat SRR120607501.fastq.gz | wc -l 101032420 [abano@sabine Harvard]$ zcat SRR120607502.fastq.gz | wc -l 101029168

So the 2 source files have different number of reads. Is there a way to fix this? what are my options?

0
Entering edit mode

Yes, I used SRA toolkit to download the data on cluster. Then I downloaded it to my laptop to process it on RStudio.

I just ran the commands that you provided and this is what I get

[abano@sabine Harvard]$zcat SRR120607501.fastq.gz | wc -l 101032420 [abano@sabine Harvard]$ zcat SRR120607502.fastq.gz | wc -l 101029168

So the 2 source files have different number of reads. Is there a way to fix this? what are my options?

0
Entering edit mode

Yes, I used SRA toolkit to download the data on cluster. Then I downloaded it to my laptop to process it on RStudio.

I just ran the commands that you provided and this is what I get

[abano@sabine Harvard]$zcat SRR120607501.fastq.gz | wc -l 101032420 [abano@sabine Harvard]$ zcat SRR120607502.fastq.gz | wc -l 101029168

So the 2 source files have different number of reads. Is there a way to fix this? what are my options?

1
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

You don't need to post the same comment three times. And this question is now off-topic for this site, having to do with getting data and stuff rather than using Bioconductor tools. Howeva...

You could have known about this issue by paying attention to the messages you get from fastq-dump

\$ fastq-dump --split-files SRR12060750
2020-07-02T15:36:38 fastq-dump.2.9.6 sys: error unknown while reading file within network system module - mbedtls_ssl_read returned -76 ( NET - Reading information from the socket failed )
##<snip of lots more errors that don't matter since fastq-dump will just keep chugging>



You see that last bit about Rejecting 813 reads? And do note that (101032420 - 101029168)/4 = 813

You could have avoided this by using --split-3:

  --split-3                        Legacy 3-file splitting for mate-pairs:
conditions are placed in files *_1.fastq and
*_2.fastq If only one biological read is
present it is placed in *.fastq Biological


or you could just get the original FASTQ files from Google or AWS see here, under the Data Access tab.