Search
Question: NGS with summarizeOverlaps and SAM files
0
13 months ago by
sergiogalvezrojas0 wrote:

Hello,

I am working with RNA-Seq on very long reference chromosomes (T. aestivum). In the mapping stage I have used STAR to generate SAM files instead of BAM files because it seems that BAM files cannot represent reference positions greater than 2^29-1 (512Mb) and the chromosomes I have used are longer (~700-800Mb).

In the next stage, I want to use DESeq2 to obtain DE genes and, previously, I need to use summarizeOverlaps to calculate the counts. Unfortunately, summarizeOverlaps allowes only BAM files and, when SAM files are specified as parameters, errors are shown:

> se <- summarizeOverlaps(features=ebg, reads=bamfiles,
+                         mode="Union",
+                         singleEnd=FALSE,
+                         ignore.strand=TRUE,
+                         fragments=TRUE )
Error in value[[3L]](cond) :
failed to open BamFile: SAM/BAM header missing or empty
file: 'G:/Data_JIC/IWGSC/against_WGAv1.0/STARWGAv1.0/Max1.sam'
1: In doTryCatch(return(expr), name, parentenv, handler) :
2: In doTryCatch(return(expr), name, parentenv, handler) :
[bam_header_read] invalid BAM binary header (this is not a BAM file).

The question is, how could I overcome this problem, please? I cannot convert SAM files into BAM files because I would lose information... Is there any way to input SAM files into summarizeOverlaps?

modified 13 months ago by James W. MacDonald46k • written 13 months ago by sergiogalvezrojas0
1

Hi,

This isn't a DESeq2 question, I would recommend adding the tag of the relevant package, GenomicRanges.

There are numerous ways to produce count tables for DESeq2, see the vignette or workflow.

1

Note that this isn't a GenomicRanges question either. The summarizeOverlaps() function is defined in the GenomicAlignments package. summarizeOverlaps() uses scanBam() from the Rsamtools package behind the scene to load the aligned reads. Calling scanBam() on a SAM file wrapped in a BamFile object is what is causing the error you got:

> library(Rsamtools)
> bamfile <- BamFile(system.file("extdata", "ex1.sam", package="Rsamtools"))
> scanBam(bamfile)
Error in value[[3L]](cond) :
failed to open BamFile: SAM/BAM header missing or empty
file: '/home/hpages/R/R-3.4.r72630/library/Rsamtools/extdata/ex1.sam'
1: In doTryCatch(return(expr), name, parentenv, handler) :
2: In doTryCatch(return(expr), name, parentenv, handler) :
[bam_header_read] invalid BAM binary header (this is not a BAM file).

I don't think the Rsamtools package supports reading SAM files directly. I'm not quite convinced it would make a lot of sense to try to support this. The real problem I see is that the conversion from BAM to SAM looses information. Are you sure about this? Aren't there samtools options to control this? That's something that would be worth bringing to the SAMtools folks.

H.

Yes, you're right. I've been developing these tasks in R in Windows and I didn't want to change to Python (HTSeq) or Linux (Rsubreads). In the end, I have had to execute featureCounts in Linux, save the matrix into a file and, then in Windows, load the matrix and continue with DESeqDataSetFromMatrix.

In a previous version of my chromosomes, I received them split into two parts to avoid the BAM problem so I assume that BAM looses information in these specific cases. In addition, I found this link.

0
13 months ago by
United States
James W. MacDonald46k wrote:

Your problem arises because you are aligning reads to a genome that has chromosomes that are too long for the BAM format. But that's only one way to do things. You could alternatively use either salmon or kallisto to do the (much faster) alignment against the wheat transcriptome, and then use tximport to import those reads for use with DESeq2. The tximport package has a detailed vignette for the last part, and you can use your google-fu to figure out how to get/use either salmon or kallisto.

Salmon tutorial from FASTQ:

https://combine-lab.github.io/salmon/getting_started/