Read counts generated by SGSeq are always 0
1
1
Entering edit mode
@vakulmohanty-8232
Last seen 4.2 years ago
United States

Hello,

I'm running SGSeq on a set of bam files that have reads extracted from a region on chromosome 22 (29190548,29196560). When I run SGSseq on my bam files I always end up with read counts of 0 across all the features. I have sorted and indexed my bam files and the chromosome name is the same in my granges object and bam files. I will be grateful for any explanation for what I'm doing wrong.

I have my code below.

Thanking You,

Vakul

f1 = list.files(path= "/media/trantor/9E76CA7076CA492B/out/sort",pattern = "bam$",full.names = FALSE)
f2 = list.files(path= "/media/trantor/9E76CA7076CA492B/out/sort",pattern = "bam$",full.names = TRUE)
files = cbind(f1,f2)
files = as.data.frame(files)
colnames(files) = c("sample_name","file_bam")
files[,1] = as.character(files[,1])
files[,2] = as.character(files[,2])
#getting summary of information about bam files required by the package
sc = getBamInfo(files,cores =4)

#GRanges object of the region of intere
gr = GRanges(seqnames = "chr22",ranges = IRanges(29190548,29196560),strand = "*")

#TxDb Object
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene

txdb = keepSeqlevels(txdb,"chr22")

#extracting features
txf <- convertToTxFeatures(txdb)
txf_f <- txf[txf %over% gr]
sgfc <- analyzeFeatures(sc, features = txf_f)

Session Info

 

R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] XVector_0.8.0                           TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
 [3] GenomicFeatures_1.20.6                  AnnotationDbi_1.30.1                   
 [5] SGSeq_1.2.2                             DEXSeq_1.14.2                          
 [7] DESeq2_1.8.2                            RcppArmadillo_0.6.500.4.0              
 [9] Rcpp_0.12.3                             GenomicRanges_1.20.8                   
[11] GenomeInfoDb_1.4.3                      IRanges_2.2.9                          
[13] S4Vectors_0.6.6                         Biobase_2.28.0                         
[15] BiocGenerics_0.14.0                     BiocParallel_1.2.22                    

loaded via a namespace (and not attached):
 [1] genefilter_1.50.0       statmod_1.4.24          locfit_1.5-9.1          splines_3.2.1          
 [5] lattice_0.20-33         colorspace_1.2-6        rtracklayer_1.28.10     survival_2.38-3        
 [9] XML_3.98-1.4            foreign_0.8-66          DBI_0.3.1               RColorBrewer_1.1-2     
[13] lambda.r_1.1.7          plyr_1.8.3              stringr_1.0.0           zlibbioc_1.14.0        
[17] Biostrings_2.36.4       munsell_0.4.3           gtable_0.2.0            futile.logger_1.4.1    
[21] hwriter_1.3.2           latticeExtra_0.6-28     geneplotter_1.46.0      biomaRt_2.24.1         
[25] acepack_1.3-3.3         xtable_1.8-2            scales_0.4.0            Hmisc_3.17-2           
[29] annotate_1.46.1         Rsamtools_1.20.5        gridExtra_2.2.1         ggplot2_2.1.0          
[33] stringi_1.0-1           numDeriv_2014.2-1       grid_3.2.1              tools_3.2.1            
[37] bitops_1.0-6            magrittr_1.5            RCurl_1.95-4.8          RSQLite_1.0.0          
[41] Formula_1.2-1           cluster_2.0.3           futile.options_1.0.0    ABSOLUTE_1.0.6         
[45] rpart_4.1-10            igraph_1.0.1            GenomicAlignments_1.4.2 nnet_7.3-12

sgseq splicing alternative splicing • 1.5k views
ADD COMMENT
0
Entering edit mode
@leonard-goldstein-6845
Last seen 11 months ago
United States

Hi Vakul,

Your code looks fine. I assume your BAM files actually contain reads that map to the the XBP1 gene? Note that if you are working with paired-end data, only concordant read pairs will be considered (i.e. read pairs in the the expected orientation). Also, when obtaining counts for junctions or exons bins, SGSeq only considers reads (or read pairs) that are structurally compatible with the feature. For example, when obtaining counts for exon bins, overlapping reads have to spliced correctly (be compatible with the exon bin) to be considered. If these points don’t explain the zero counts, could you make the BAM files available, so I can have a look? Also I noticed you are working with Bioc 3.1. I suggest to update to the latest Bioc release. Thanks!

Leonard

ADD COMMENT
0
Entering edit mode

Hi Leonard, 

My BAM files contain reads that map to the genomic region corresponding to XBP1, I extracted them using samtools.  I'll check for the features you mentioned in my BAM files and update my Bioc version.

If you could provide me a way to send you a couple of my BAM files I will be grateful if you could take a look at them. 

Thanking You, 

Vakul 

ADD REPLY
0
Entering edit mode

Hi Leonard, 

My BAM files contain reads that map to the genomic region corresponding to XBP1, I extracted them using samtools.  I'll check for the features you mentioned in my BAM files and update my Bioc version.

If you could provide me a way to send you a couple of my BAM files I will be grateful if you could take a look at them. 

Thanking You, 

Vakul 

ADD REPLY
0
Entering edit mode

It’s probably easiest if you can send test files via dropbox or as an email attachment (assuming they are not too large if they only contain reads for XBP1). You can find my email address on the SGSeq landing page. 

 

ADD REPLY
0
Entering edit mode

Thank you,

I mailed you a text file of the reads from one of my BAM files.

ADD REPLY
0
Entering edit mode

Thanks for providing the test file. It looks like SGSeq, and the Bioc functions that SGSeq uses to read the data, cannot pair the mate reads in your file, because values in the QNAME field have suffix /1 and /2  (i.e the QNAME for mate reads are not identical). When stripping off the suffix everything works fine. Looking at the SAM specifications and online forums, it seems like the QNAME should be identical for mate reads. So the easiest solution would be to remove the suffixes in your file, or maybe your alignment program has an option that can remove them during read mapping? I hope this helps.

ADD REPLY
1
Entering edit mode

BamFile() has arguments qnamePrefixEnd, qnameSuffixStart, which can be used to strip the prefix. The idea would be to support input as BamFile() / BamFileList(); I don't know if that's what SGSeq does.

ADD REPLY
0
Entering edit mode

Hi Martin, thanks for the comments. Yes I saw BamFile() has these arguments, which should make it straightforward to strip suffixes. However I don’t want to hard-code this behavior. So supporting it will require changes in the API of several high-level functions. Probably not a good idea to do this in the release version, but I can do it in devel. Vakul, if you are happy to work with bioc-devel, or wait until the next release, this is an option. I can let you know when I have implemented it. 

ADD REPLY
0
Entering edit mode

I'll make the changes to my BAM files and give it another go.

Thank you for the help.

Also It will be awesome if you let me know about the changes Thanx!

ADD REPLY
0
Entering edit mode

Based on Martin’s suggestion I made the following change in SGSeq 1.5.9 (available via svn now). In the ‘sample_info’ data frame, column ‘file_bam’ can now be either a character vector as previously, or it can be a BamFileList object.

If you can switch to bioc-devel, this allows you to use your current BAM files. All you have to do is change the input in your example as follows:

library(Rsamtools)
files <- DataFrame(files)
files$file_bam < - BamFileList(files$file_bam, asMates = TRUE, qnameSuffixStart = “/“)

Then run getBamInfo() and analyzeFeatures() and you should get the correct counts. 

ADD REPLY
0
Entering edit mode

Thank you, 

I'll give it a shot!

ADD REPLY
0
Entering edit mode

Hi I got the R_devel and also found the svn page for SGSeq (https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/SGSeq/), I'm however not familiar with working on the bioc-devel I will be grateful on a pointer to how I can get the package and run it. 

Thank you.

 

ADD REPLY
0
Entering edit mode

Hi, in case you haven't seen it already, there are instructions on

https://www.bioconductor.org/install/

In order to switch to bioc-devel, run function useDevel() in the BiocInstaller package, then you can use biocLite() for installing packages as usual. You don't normally have to checkout packages via SVN. When developers commit changes for a package, the latest version is not available via biocLite() immediately (it usually becomes available on the following day, depending on the time that changes were committed to the repository). If you don't want to wait until then, you can check out the most recent version via SVN immediately. However, at this point SGSeq 1.5.9 is also available via biocLite().

ADD REPLY
0
Entering edit mode

Thank you that was really informative. 

I managed to obtain and run SGSeq 1.5.9., The code runs perfectly for one sample but fails when I input multiple BAM files. Is it possible for me to use one of the apply family of functions and coerce it into the required datastructure?  

files$file_bam = BamFileList(files$file_bam, asMates = TRUE, qnameSuffixStart = "/")

> sc = getBamInfo(files[1,])
TCGA-3C-AAAU-01A.bam complete.

> sc = getBamInfo(files)
TCGA-3C-AAAU-01A.bam complete.
TCGA-3C-AALK-01A.bam complete.
TCGA-4H-AAAK-01A.bam complete.
TCGA-5L-AAT0-01A.bam complete.
TCGA-5L-AAT1-01A.bam complete.
TCGA-5T-A9QA-01A.bam complete.
Error in DataFrame(df, check.names = FALSE) : 
  cannot coerce class "list" to a DataFrame

ADD REPLY
1
Entering edit mode

Sorry - this is a bug I introduced when adding support for BamFileLists. Please try again with SGSeq 1.5.10 (you can obtain it via svn now or biocLite() tomorrow). If you have any other problems please let me know. 

To answer your question - the 'sample_info' input is just a data.frame or DataFrame object. So you can run getBamInfo() for individual samples and then create the full data frame yourself. 

ADD REPLY
0
Entering edit mode

Thank you for all the help Leonard, SGSeq work like a charm now.

ADD REPLY

Login before adding your answer.

Traffic: 390 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6