Question: Read counts generated by SGSeq are always 0
1
gravatar for vakul.mohanty
3.5 years ago by
United States
vakul.mohanty10 wrote:

Hello,

I'm running SGSeq on a set of bam files that have reads extracted from a region on chromosome 22 (29190548,29196560). When I run SGSseq on my bam files I always end up with read counts of 0 across all the features. I have sorted and indexed my bam files and the chromosome name is the same in my granges object and bam files. I will be grateful for any explanation for what I'm doing wrong.

I have my code below.

Thanking You,

Vakul

f1 = list.files(path= "/media/trantor/9E76CA7076CA492B/out/sort",pattern = "bam$",full.names = FALSE)
f2 = list.files(path= "/media/trantor/9E76CA7076CA492B/out/sort",pattern = "bam$",full.names = TRUE)
files = cbind(f1,f2)
files = as.data.frame(files)
colnames(files) = c("sample_name","file_bam")
files[,1] = as.character(files[,1])
files[,2] = as.character(files[,2])
#getting summary of information about bam files required by the package
sc = getBamInfo(files,cores =4)

#GRanges object of the region of intere
gr = GRanges(seqnames = "chr22",ranges = IRanges(29190548,29196560),strand = "*")

#TxDb Object
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene

txdb = keepSeqlevels(txdb,"chr22")

#extracting features
txf <- convertToTxFeatures(txdb)
txf_f <- txf[txf %over% gr]
sgfc <- analyzeFeatures(sc, features = txf_f)

Session Info

 

R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] XVector_0.8.0                           TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
 [3] GenomicFeatures_1.20.6                  AnnotationDbi_1.30.1                   
 [5] SGSeq_1.2.2                             DEXSeq_1.14.2                          
 [7] DESeq2_1.8.2                            RcppArmadillo_0.6.500.4.0              
 [9] Rcpp_0.12.3                             GenomicRanges_1.20.8                   
[11] GenomeInfoDb_1.4.3                      IRanges_2.2.9                          
[13] S4Vectors_0.6.6                         Biobase_2.28.0                         
[15] BiocGenerics_0.14.0                     BiocParallel_1.2.22                    

loaded via a namespace (and not attached):
 [1] genefilter_1.50.0       statmod_1.4.24          locfit_1.5-9.1          splines_3.2.1          
 [5] lattice_0.20-33         colorspace_1.2-6        rtracklayer_1.28.10     survival_2.38-3        
 [9] XML_3.98-1.4            foreign_0.8-66          DBI_0.3.1               RColorBrewer_1.1-2     
[13] lambda.r_1.1.7          plyr_1.8.3              stringr_1.0.0           zlibbioc_1.14.0        
[17] Biostrings_2.36.4       munsell_0.4.3           gtable_0.2.0            futile.logger_1.4.1    
[21] hwriter_1.3.2           latticeExtra_0.6-28     geneplotter_1.46.0      biomaRt_2.24.1         
[25] acepack_1.3-3.3         xtable_1.8-2            scales_0.4.0            Hmisc_3.17-2           
[29] annotate_1.46.1         Rsamtools_1.20.5        gridExtra_2.2.1         ggplot2_2.1.0          
[33] stringi_1.0-1           numDeriv_2014.2-1       grid_3.2.1              tools_3.2.1            
[37] bitops_1.0-6            magrittr_1.5            RCurl_1.95-4.8          RSQLite_1.0.0          
[41] Formula_1.2-1           cluster_2.0.3           futile.options_1.0.0    ABSOLUTE_1.0.6         
[45] rpart_4.1-10            igraph_1.0.1            GenomicAlignments_1.4.2 nnet_7.3-12

ADD COMMENTlink modified 3.5 years ago by Leonard Goldstein80 • written 3.5 years ago by vakul.mohanty10
Answer: Read counts generated by SGSeq are always 0
0
gravatar for Leonard Goldstein
3.5 years ago by
United States
Leonard Goldstein80 wrote:

Hi Vakul,

Your code looks fine. I assume your BAM files actually contain reads that map to the the XBP1 gene? Note that if you are working with paired-end data, only concordant read pairs will be considered (i.e. read pairs in the the expected orientation). Also, when obtaining counts for junctions or exons bins, SGSeq only considers reads (or read pairs) that are structurally compatible with the feature. For example, when obtaining counts for exon bins, overlapping reads have to spliced correctly (be compatible with the exon bin) to be considered. If these points don’t explain the zero counts, could you make the BAM files available, so I can have a look? Also I noticed you are working with Bioc 3.1. I suggest to update to the latest Bioc release. Thanks!

Leonard

ADD COMMENTlink written 3.5 years ago by Leonard Goldstein80

Hi Leonard, 

My BAM files contain reads that map to the genomic region corresponding to XBP1, I extracted them using samtools.  I'll check for the features you mentioned in my BAM files and update my Bioc version.

If you could provide me a way to send you a couple of my BAM files I will be grateful if you could take a look at them. 

Thanking You, 

Vakul 

ADD REPLYlink written 3.5 years ago by vakul.mohanty10

Hi Leonard, 

My BAM files contain reads that map to the genomic region corresponding to XBP1, I extracted them using samtools.  I'll check for the features you mentioned in my BAM files and update my Bioc version.

If you could provide me a way to send you a couple of my BAM files I will be grateful if you could take a look at them. 

Thanking You, 

Vakul 

ADD REPLYlink written 3.5 years ago by vakul.mohanty10

It’s probably easiest if you can send test files via dropbox or as an email attachment (assuming they are not too large if they only contain reads for XBP1). You can find my email address on the SGSeq landing page. 

 

ADD REPLYlink written 3.5 years ago by Leonard Goldstein80

Thank you,

I mailed you a text file of the reads from one of my BAM files.

ADD REPLYlink written 3.5 years ago by vakul.mohanty10

Thanks for providing the test file. It looks like SGSeq, and the Bioc functions that SGSeq uses to read the data, cannot pair the mate reads in your file, because values in the QNAME field have suffix /1 and /2  (i.e the QNAME for mate reads are not identical). When stripping off the suffix everything works fine. Looking at the SAM specifications and online forums, it seems like the QNAME should be identical for mate reads. So the easiest solution would be to remove the suffixes in your file, or maybe your alignment program has an option that can remove them during read mapping? I hope this helps.

ADD REPLYlink written 3.5 years ago by Leonard Goldstein80
1

BamFile() has arguments qnamePrefixEnd, qnameSuffixStart, which can be used to strip the prefix. The idea would be to support input as BamFile() / BamFileList(); I don't know if that's what SGSeq does.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Martin Morgan ♦♦ 23k

Hi Martin, thanks for the comments. Yes I saw BamFile() has these arguments, which should make it straightforward to strip suffixes. However I don’t want to hard-code this behavior. So supporting it will require changes in the API of several high-level functions. Probably not a good idea to do this in the release version, but I can do it in devel. Vakul, if you are happy to work with bioc-devel, or wait until the next release, this is an option. I can let you know when I have implemented it. 

ADD REPLYlink written 3.5 years ago by Leonard Goldstein80

I'll make the changes to my BAM files and give it another go.

Thank you for the help.

Also It will be awesome if you let me know about the changes Thanx!

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by vakul.mohanty10

Based on Martin’s suggestion I made the following change in SGSeq 1.5.9 (available via svn now). In the ‘sample_info’ data frame, column ‘file_bam’ can now be either a character vector as previously, or it can be a BamFileList object.

If you can switch to bioc-devel, this allows you to use your current BAM files. All you have to do is change the input in your example as follows:

library(Rsamtools)
files <- DataFrame(files)
files$file_bam < - BamFileList(files$file_bam, asMates = TRUE, qnameSuffixStart = “/“)

Then run getBamInfo() and analyzeFeatures() and you should get the correct counts. 

ADD REPLYlink written 3.5 years ago by Leonard Goldstein80

Thank you, 

I'll give it a shot!

ADD REPLYlink written 3.5 years ago by vakul.mohanty10

Hi I got the R_devel and also found the svn page for SGSeq (https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/SGSeq/), I'm however not familiar with working on the bioc-devel I will be grateful on a pointer to how I can get the package and run it. 

Thank you.

 

ADD REPLYlink written 3.5 years ago by vakul.mohanty10

Hi, in case you haven't seen it already, there are instructions on

https://www.bioconductor.org/install/

In order to switch to bioc-devel, run function useDevel() in the BiocInstaller package, then you can use biocLite() for installing packages as usual. You don't normally have to checkout packages via SVN. When developers commit changes for a package, the latest version is not available via biocLite() immediately (it usually becomes available on the following day, depending on the time that changes were committed to the repository). If you don't want to wait until then, you can check out the most recent version via SVN immediately. However, at this point SGSeq 1.5.9 is also available via biocLite().

ADD REPLYlink written 3.5 years ago by Leonard Goldstein80

Thank you that was really informative. 

I managed to obtain and run SGSeq 1.5.9., The code runs perfectly for one sample but fails when I input multiple BAM files. Is it possible for me to use one of the apply family of functions and coerce it into the required datastructure?  

files$file_bam = BamFileList(files$file_bam, asMates = TRUE, qnameSuffixStart = "/")

> sc = getBamInfo(files[1,])
TCGA-3C-AAAU-01A.bam complete.

> sc = getBamInfo(files)
TCGA-3C-AAAU-01A.bam complete.
TCGA-3C-AALK-01A.bam complete.
TCGA-4H-AAAK-01A.bam complete.
TCGA-5L-AAT0-01A.bam complete.
TCGA-5L-AAT1-01A.bam complete.
TCGA-5T-A9QA-01A.bam complete.
Error in DataFrame(df, check.names = FALSE) : 
  cannot coerce class "list" to a DataFrame

ADD REPLYlink written 3.5 years ago by vakul.mohanty10
1

Sorry - this is a bug I introduced when adding support for BamFileLists. Please try again with SGSeq 1.5.10 (you can obtain it via svn now or biocLite() tomorrow). If you have any other problems please let me know. 

To answer your question - the 'sample_info' input is just a data.frame or DataFrame object. So you can run getBamInfo() for individual samples and then create the full data frame yourself. 

ADD REPLYlink written 3.5 years ago by Leonard Goldstein80

Thank you for all the help Leonard, SGSeq work like a charm now.

ADD REPLYlink written 3.4 years ago by vakul.mohanty10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 126 users visited in the last hour