Question: Read counts generated by SGSeq are always 0
1
3.5 years ago by
United States
vakul.mohanty10 wrote:

Hello,

I'm running SGSeq on a set of bam files that have reads extracted from a region on chromosome 22 (29190548,29196560). When I run SGSseq on my bam files I always end up with read counts of 0 across all the features. I have sorted and indexed my bam files and the chromosome name is the same in my granges object and bam files. I will be grateful for any explanation for what I'm doing wrong.

I have my code below.

Thanking You,

Vakul

f1 = list.files(path= "/media/trantor/9E76CA7076CA492B/out/sort",pattern = "bam$",full.names = FALSE) f2 = list.files(path= "/media/trantor/9E76CA7076CA492B/out/sort",pattern = "bam$",full.names = TRUE) files = cbind(f1,f2) files = as.data.frame(files) colnames(files) = c("sample_name","file_bam") files[,1] = as.character(files[,1]) files[,2] = as.character(files[,2]) #getting summary of information about bam files required by the package sc = getBamInfo(files,cores =4)

#GRanges object of the region of intere gr = GRanges(seqnames = "chr22",ranges = IRanges(29190548,29196560),strand = "*")

#TxDb Object library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb = TxDb.Hsapiens.UCSC.hg19.knownGene

txdb = keepSeqlevels(txdb,"chr22")

#extracting features txf <- convertToTxFeatures(txdb) txf_f <- txf[txf %over% gr] sgfc <- analyzeFeatures(sc, features = txf_f)

Session Info

R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] XVector_0.8.0                           TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
[3] GenomicFeatures_1.20.6                  AnnotationDbi_1.30.1
[5] SGSeq_1.2.2                             DEXSeq_1.14.2
[9] Rcpp_0.12.3                             GenomicRanges_1.20.8
[11] GenomeInfoDb_1.4.3                      IRanges_2.2.9
[13] S4Vectors_0.6.6                         Biobase_2.28.0
[15] BiocGenerics_0.14.0                     BiocParallel_1.2.22

loaded via a namespace (and not attached):
[1] genefilter_1.50.0       statmod_1.4.24          locfit_1.5-9.1          splines_3.2.1
[5] lattice_0.20-33         colorspace_1.2-6        rtracklayer_1.28.10     survival_2.38-3
[9] XML_3.98-1.4            foreign_0.8-66          DBI_0.3.1               RColorBrewer_1.1-2
[13] lambda.r_1.1.7          plyr_1.8.3              stringr_1.0.0           zlibbioc_1.14.0
[17] Biostrings_2.36.4       munsell_0.4.3           gtable_0.2.0            futile.logger_1.4.1
[21] hwriter_1.3.2           latticeExtra_0.6-28     geneplotter_1.46.0      biomaRt_2.24.1
[25] acepack_1.3-3.3         xtable_1.8-2            scales_0.4.0            Hmisc_3.17-2
[29] annotate_1.46.1         Rsamtools_1.20.5        gridExtra_2.2.1         ggplot2_2.1.0
[33] stringi_1.0-1           numDeriv_2014.2-1       grid_3.2.1              tools_3.2.1
[37] bitops_1.0-6            magrittr_1.5            RCurl_1.95-4.8          RSQLite_1.0.0
[41] Formula_1.2-1           cluster_2.0.3           futile.options_1.0.0    ABSOLUTE_1.0.6
[45] rpart_4.1-10            igraph_1.0.1            GenomicAlignments_1.4.2 nnet_7.3-12

modified 3.5 years ago by Leonard Goldstein80 • written 3.5 years ago by vakul.mohanty10
0
3.5 years ago by
United States
Leonard Goldstein80 wrote:

Hi Vakul,

Your code looks fine. I assume your BAM files actually contain reads that map to the the XBP1 gene? Note that if you are working with paired-end data, only concordant read pairs will be considered (i.e. read pairs in the the expected orientation). Also, when obtaining counts for junctions or exons bins, SGSeq only considers reads (or read pairs) that are structurally compatible with the feature. For example, when obtaining counts for exon bins, overlapping reads have to spliced correctly (be compatible with the exon bin) to be considered. If these points don’t explain the zero counts, could you make the BAM files available, so I can have a look? Also I noticed you are working with Bioc 3.1. I suggest to update to the latest Bioc release. Thanks!

Leonard

Hi Leonard,

My BAM files contain reads that map to the genomic region corresponding to XBP1, I extracted them using samtools.  I'll check for the features you mentioned in my BAM files and update my Bioc version.

If you could provide me a way to send you a couple of my BAM files I will be grateful if you could take a look at them.

Thanking You,

Vakul

Hi Leonard,

My BAM files contain reads that map to the genomic region corresponding to XBP1, I extracted them using samtools.  I'll check for the features you mentioned in my BAM files and update my Bioc version.

If you could provide me a way to send you a couple of my BAM files I will be grateful if you could take a look at them.

Thanking You,

Vakul

It’s probably easiest if you can send test files via dropbox or as an email attachment (assuming they are not too large if they only contain reads for XBP1). You can find my email address on the SGSeq landing page.

Thank you,

I mailed you a text file of the reads from one of my BAM files.

Thanks for providing the test file. It looks like SGSeq, and the Bioc functions that SGSeq uses to read the data, cannot pair the mate reads in your file, because values in the QNAME field have suffix /1 and /2  (i.e the QNAME for mate reads are not identical). When stripping off the suffix everything works fine. Looking at the SAM specifications and online forums, it seems like the QNAME should be identical for mate reads. So the easiest solution would be to remove the suffixes in your file, or maybe your alignment program has an option that can remove them during read mapping? I hope this helps.

1

BamFile() has arguments qnamePrefixEnd, qnameSuffixStart, which can be used to strip the prefix. The idea would be to support input as BamFile() / BamFileList(); I don't know if that's what SGSeq does.

Hi Martin, thanks for the comments. Yes I saw BamFile() has these arguments, which should make it straightforward to strip suffixes. However I don’t want to hard-code this behavior. So supporting it will require changes in the API of several high-level functions. Probably not a good idea to do this in the release version, but I can do it in devel. Vakul, if you are happy to work with bioc-devel, or wait until the next release, this is an option. I can let you know when I have implemented it.

I'll make the changes to my BAM files and give it another go.

Thank you for the help.

Also It will be awesome if you let me know about the changes Thanx!

Based on Martin’s suggestion I made the following change in SGSeq 1.5.9 (available via svn now). In the ‘sample_info’ data frame, column ‘file_bam’ can now be either a character vector as previously, or it can be a BamFileList object.

If you can switch to bioc-devel, this allows you to use your current BAM files. All you have to do is change the input in your example as follows:

library(Rsamtools)
files <- DataFrame(files)
files$file_bam < - BamFileList(files$file_bam, asMates = TRUE, qnameSuffixStart = “/“)

Then run getBamInfo() and analyzeFeatures() and you should get the correct counts.

Thank you,

I'll give it a shot!

Hi I got the R_devel and also found the svn page for SGSeq (https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/SGSeq/), I'm however not familiar with working on the bioc-devel I will be grateful on a pointer to how I can get the package and run it.

Thank you.

Hi, in case you haven't seen it already, there are instructions on

https://www.bioconductor.org/install/

In order to switch to bioc-devel, run function useDevel() in the BiocInstaller package, then you can use biocLite() for installing packages as usual. You don't normally have to checkout packages via SVN. When developers commit changes for a package, the latest version is not available via biocLite() immediately (it usually becomes available on the following day, depending on the time that changes were committed to the repository). If you don't want to wait until then, you can check out the most recent version via SVN immediately. However, at this point SGSeq 1.5.9 is also available via biocLite().

Thank you that was really informative.

I managed to obtain and run SGSeq 1.5.9., The code runs perfectly for one sample but fails when I input multiple BAM files. Is it possible for me to use one of the apply family of functions and coerce it into the required datastructure?

files$file_bam = BamFileList(files$file_bam, asMates = TRUE, qnameSuffixStart = "/")

> sc = getBamInfo(files[1,])
TCGA-3C-AAAU-01A.bam complete.

> sc = getBamInfo(files)
TCGA-3C-AAAU-01A.bam complete.
TCGA-3C-AALK-01A.bam complete.
TCGA-4H-AAAK-01A.bam complete.
TCGA-5L-AAT0-01A.bam complete.
TCGA-5L-AAT1-01A.bam complete.
TCGA-5T-A9QA-01A.bam complete.
Error in DataFrame(df, check.names = FALSE) :
cannot coerce class "list" to a DataFrame

1

Sorry - this is a bug I introduced when adding support for BamFileLists. Please try again with SGSeq 1.5.10 (you can obtain it via svn now or biocLite() tomorrow). If you have any other problems please let me know.

To answer your question - the 'sample_info' input is just a data.frame or DataFrame object. So you can run getBamInfo() for individual samples and then create the full data frame yourself.