Hello,
I am working with RNA-Seq on very long reference chromosomes (T. aestivum). In the mapping stage I have used STAR to generate SAM files instead of BAM files because it seems that BAM files cannot represent reference positions greater than 2^29-1 (512Mb) and the chromosomes I have used are longer (~700-800Mb).
In the next stage, I want to use DESeq2 to obtain DE genes and, previously, I need to use summarizeOverlaps to calculate the counts. Unfortunately, summarizeOverlaps allowes only BAM files and, when SAM files are specified as parameters, errors are shown:
> se <- summarizeOverlaps(features=ebg, reads=bamfiles, + mode="Union", + singleEnd=FALSE, + ignore.strand=TRUE, + fragments=TRUE ) Error in value[[3L]](cond) : failed to open BamFile: SAM/BAM header missing or empty file: 'G:/Data_JIC/IWGSC/against_WGAv1.0/STARWGAv1.0/Max1.sam' In addition: Warning messages: 1: In doTryCatch(return(expr), name, parentenv, handler) : [bam_header_read] EOF marker is absent. The input is probably truncated. 2: In doTryCatch(return(expr), name, parentenv, handler) : [bam_header_read] invalid BAM binary header (this is not a BAM file).
The question is, how could I overcome this problem, please? I cannot convert SAM files into BAM files because I would lose information... Is there any way to input SAM files into summarizeOverlaps?
Hi,
This isn't a DESeq2 question, I would recommend adding the tag of the relevant package, GenomicRanges.
There are numerous ways to produce count tables for DESeq2, see the vignette or workflow.
Note that this isn't a GenomicRanges question either. The
summarizeOverlaps()
function is defined in the GenomicAlignments package.summarizeOverlaps()
usesscanBam()
from the Rsamtools package behind the scene to load the aligned reads. CallingscanBam()
on a SAM file wrapped in a BamFile object is what is causing the error you got:I don't think the Rsamtools package supports reading SAM files directly. I'm not quite convinced it would make a lot of sense to try to support this. The real problem I see is that the conversion from BAM to SAM looses information. Are you sure about this? Aren't there samtools options to control this? That's something that would be worth bringing to the SAMtools folks.
H.
Yes, you're right. I've been developing these tasks in R in Windows and I didn't want to change to Python (HTSeq) or Linux (Rsubreads). In the end, I have had to execute featureCounts in Linux, save the matrix into a file and, then in Windows, load the matrix and continue with DESeqDataSetFromMatrix.
In a previous version of my chromosomes, I received them split into two parts to avoid the BAM problem so I assume that BAM looses information in these specific cases. In addition, I found this link.
Thank you for your answers,