Search
Question: NGS with summarizeOverlaps and SAM files
0
gravatar for sergiogalvezrojas
13 months ago by
sergiogalvezrojas0 wrote:

Hello,

I am working with RNA-Seq on very long reference chromosomes (T. aestivum). In the mapping stage I have used STAR to generate SAM files instead of BAM files because it seems that BAM files cannot represent reference positions greater than 2^29-1 (512Mb) and the chromosomes I have used are longer (~700-800Mb).

In the next stage, I want to use DESeq2 to obtain DE genes and, previously, I need to use summarizeOverlaps to calculate the counts. Unfortunately, summarizeOverlaps allowes only BAM files and, when SAM files are specified as parameters, errors are shown:

> se <- summarizeOverlaps(features=ebg, reads=bamfiles,
+                         mode="Union",
+                         singleEnd=FALSE,
+                         ignore.strand=TRUE,
+                         fragments=TRUE )
Error in value[[3L]](cond) : 
  failed to open BamFile: SAM/BAM header missing or empty
  file: 'G:/Data_JIC/IWGSC/against_WGAv1.0/STARWGAv1.0/Max1.sam'
In addition: Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] EOF marker is absent. The input is probably truncated.
2: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] invalid BAM binary header (this is not a BAM file).

The question is, how could I overcome this problem, please? I cannot convert SAM files into BAM files because I would lose information... Is there any way to input SAM files into summarizeOverlaps?

ADD COMMENTlink modified 13 months ago by James W. MacDonald46k • written 13 months ago by sergiogalvezrojas0
1

Hi,

This isn't a DESeq2 question, I would recommend adding the tag of the relevant package, GenomicRanges.

There are numerous ways to produce count tables for DESeq2, see the vignette or workflow.

ADD REPLYlink written 13 months ago by Michael Love18k
1

Note that this isn't a GenomicRanges question either. The summarizeOverlaps() function is defined in the GenomicAlignments package. summarizeOverlaps() uses scanBam() from the Rsamtools package behind the scene to load the aligned reads. Calling scanBam() on a SAM file wrapped in a BamFile object is what is causing the error you got:

> library(Rsamtools)
> bamfile <- BamFile(system.file("extdata", "ex1.sam", package="Rsamtools"))
> scanBam(bamfile)
Error in value[[3L]](cond) : 
  failed to open BamFile: SAM/BAM header missing or empty
  file: '/home/hpages/R/R-3.4.r72630/library/Rsamtools/extdata/ex1.sam'
In addition: Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] EOF marker is absent. The input is probably truncated.
2: In doTryCatch(return(expr), name, parentenv, handler) :
  [bam_header_read] invalid BAM binary header (this is not a BAM file).

 

I don't think the Rsamtools package supports reading SAM files directly. I'm not quite convinced it would make a lot of sense to try to support this. The real problem I see is that the conversion from BAM to SAM looses information. Are you sure about this? Aren't there samtools options to control this? That's something that would be worth bringing to the SAMtools folks.

H. 

ADD REPLYlink written 13 months ago by Hervé Pagès ♦♦ 13k

Yes, you're right. I've been developing these tasks in R in Windows and I didn't want to change to Python (HTSeq) or Linux (Rsubreads). In the end, I have had to execute featureCounts in Linux, save the matrix into a file and, then in Windows, load the matrix and continue with DESeqDataSetFromMatrix.

In a previous version of my chromosomes, I received them split into two parts to avoid the BAM problem so I assume that BAM looses information in these specific cases. In addition, I found this link.

Thank you for your answers,

ADD REPLYlink written 13 months ago by sergiogalvezrojas0
0
gravatar for James W. MacDonald
13 months ago by
United States
James W. MacDonald46k wrote:

Your problem arises because you are aligning reads to a genome that has chromosomes that are too long for the BAM format. But that's only one way to do things. You could alternatively use either salmon or kallisto to do the (much faster) alignment against the wheat transcriptome, and then use tximport to import those reads for use with DESeq2. The tximport package has a detailed vignette for the last part, and you can use your google-fu to figure out how to get/use either salmon or kallisto.
 

ADD COMMENTlink written 13 months ago by James W. MacDonald46k

Salmon tutorial from FASTQ:

https://combine-lab.github.io/salmon/getting_started/

ADD REPLYlink written 13 months ago by Michael Love18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 118 users visited in the last hour