NGS with summarizeOverlaps and SAM files
1
0
Entering edit mode
@sergiogalvezrojas-13044
Last seen 7.5 years ago

Hello,

I am working with RNA-Seq on very long reference chromosomes (T. aestivum). In the mapping stage I have used STAR to generate SAM files instead of BAM files because it seems that BAM files cannot represent reference positions greater than 2^29-1 (512Mb) and the chromosomes I have used are longer (~700-800Mb).

In the next stage, I want to use DESeq2 to obtain DE genes and, previously, I need to use summarizeOverlaps to calculate the counts. Unfortunately, summarizeOverlaps allowes only BAM files and, when SAM files are specified as parameters, errors are shown:

> se <- summarizeOverlaps(features=ebg, reads=bamfiles,
+                         mode="Union",
+                         singleEnd=FALSE,
+                         ignore.strand=TRUE,
+                         fragments=TRUE )
Error in value[[3L]](cond) : 
  failed to open BamFile: SAM/BAM header missing or empty
  file: 'G:/Data_JIC/IWGSC/against_WGAv1.0/STARWGAv1.0/Max1.sam'
In addition: Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] EOF marker is absent. The input is probably truncated.
2: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] invalid BAM binary header (this is not a BAM file).

The question is, how could I overcome this problem, please? I cannot convert SAM files into BAM files because I would lose information... Is there any way to input SAM files into summarizeOverlaps?

genomicalignments • 2.0k views
ADD COMMENT
1
Entering edit mode

Hi,

This isn't a DESeq2 question, I would recommend adding the tag of the relevant package, GenomicRanges.

There are numerous ways to produce count tables for DESeq2, see the vignette or workflow.

ADD REPLY
1
Entering edit mode

Note that this isn't a GenomicRanges question either. The summarizeOverlaps() function is defined in the GenomicAlignments package. summarizeOverlaps() uses scanBam() from the Rsamtools package behind the scene to load the aligned reads. Calling scanBam() on a SAM file wrapped in a BamFile object is what is causing the error you got:

> library(Rsamtools)
> bamfile <- BamFile(system.file("extdata", "ex1.sam", package="Rsamtools"))
> scanBam(bamfile)
Error in value[[3L]](cond) : 
  failed to open BamFile: SAM/BAM header missing or empty
  file: '/home/hpages/R/R-3.4.r72630/library/Rsamtools/extdata/ex1.sam'
In addition: Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] EOF marker is absent. The input is probably truncated.
2: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] invalid BAM binary header (this is not a BAM file).

 

I don't think the Rsamtools package supports reading SAM files directly. I'm not quite convinced it would make a lot of sense to try to support this. The real problem I see is that the conversion from BAM to SAM looses information. Are you sure about this? Aren't there samtools options to control this? That's something that would be worth bringing to the SAMtools folks.

H. 

ADD REPLY
0
Entering edit mode

Yes, you're right. I've been developing these tasks in R in Windows and I didn't want to change to Python (HTSeq) or Linux (Rsubreads). In the end, I have had to execute featureCounts in Linux, save the matrix into a file and, then in Windows, load the matrix and continue with DESeqDataSetFromMatrix.

In a previous version of my chromosomes, I received them split into two parts to avoid the BAM problem so I assume that BAM looses information in these specific cases. In addition, I found this link.

Thank you for your answers,

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 13 hours ago
United States

Your problem arises because you are aligning reads to a genome that has chromosomes that are too long for the BAM format. But that's only one way to do things. You could alternatively use either salmon or kallisto to do the (much faster) alignment against the wheat transcriptome, and then use tximport to import those reads for use with DESeq2. The tximport package has a detailed vignette for the last part, and you can use your google-fu to figure out how to get/use either salmon or kallisto.
 

ADD COMMENT
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 947 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6