Question

SGSeq error on analyzeFeatures

0

Entering edit mode

Paolo Guarnieri ▴ 20

@paolo-guarnieri-5256

Last seen 9.4 years ago

United States

Dear all,

I am trying to test SGSeq with my own data, but I did not go that far:

I am simply trying to get your package working on a test set composed of 8 bam files (paried end) merged into 4, subset to only one chromosome.

Also the annotation is subset accordingly. I get an obscure "non-numeric argument to binary operator" error, that is not very helpful.

I tried with different bam files, generated with a different aligner, but I still get the same error. Different chromosome, produced the same results. I noticed that the fragment length column contained NAs and tried to replace with values, but no effect.

I would be grateful if someone could give me some pointers to move forward with the analysis.

Thanks,

Paolo Guarnieri, M.D.

Department of Systems Biology,

Herbert Irving Comprehensive Cancer Center,

Columbia University,

1130 St. Nicholas Ave, New York, NY 10032 (USA)

> myChr <- "chr6"
> txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
> txdb <- keepSeqlevels(txdb, myChr)
> txf_annotated <- convertToTxFeatures(txdb)
> threads <- 8
> mySamplesComplete <- getBamInfo(mySamplesMerged_si,BPPARAM = MulticoreParam(threads))

> mySamplesComplete

  sample_name                                                            file_bam paired_end read_length frag_length lib_size

1     SRSF1_1 /home/pg2296/Desktop/data_X/Datasets/CMF/rawdata/SRSF1_1.merged.bam      FALSE         100          NA   914422

2     SRSF1_2 /home/pg2296/Desktop/data_X/Datasets/CMF/rawdata/SRSF1_2.merged.bam      FALSE         100          NA   877404

3     SRSF1_3 /home/pg2296/Desktop/data_X/Datasets/CMF/rawdata/SRSF1_3.merged.bam      FALSE         100          NA   972719

4     SRSF1_4 /home/pg2296/Desktop/data_X/Datasets/CMF/rawdata/SRSF1_4.merged.bam      FALSE         100          NA   750876

> sgfc_dmd <- analyzeFeatures(sample_info = mySamplesComplete,

+                             features = txf_annotated,

+                             cores_per_sample = 2,

+                             BPPARAM = MulticoreParam(threads))

Process features...

Obtain counts...

Error in X/(E * 0.001) : non-numeric argument to binary operator

>

> txf_annotated

TxFeatures object with 186 ranges and 0 metadata columns:

        seqnames               ranges strand     type                           txName        geneName

           <Rle>            <IRanges>  <Rle> <factor>                  <CharacterList> <CharacterList>

    [1]     chrX [82814910, 82815024]      +        F                       uc009trg.1           13405

    [2]     chrX [82815024, 83136628]      +        J                       uc009trg.1           13405

    [3]     chrX [82948870, 82949147]      +        F            uc009trh.2,uc009tri.2           13405

    [4]     chrX [82949147, 83136628]      +        J            uc009trh.2,uc009tri.2           13405

    [5]     chrX [83043196, 83043478]      +        F                       uc009trj.1           13405

    ...      ...                  ...    ...      ...                              ...             ...

  [182]     chrX [85202471, 85205050]      +        L uc009tri.2,uc009trp.2,uc009trr.2           13405

  [183]     chrX [85249677, 85249744]      +        F                       uc009trs.2           71996

  [184]     chrX [85249744, 85269520]      +        J                       uc009trs.2           71996

  [185]     chrX [85269520, 85270291]      +        L                       uc009trs.2           71996

  [186]     chrX [85269613, 85270161]      +        U                       uc033jqz.1               

  -------

  seqinfo: 1 sequence from mm10 genome

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8  
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C               
[9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] XVector_0.6.0                            TxDb.Mmusculus.UCSC.mm10.knownGene_3.0.0
[3] GenomicFeatures_1.18.2                   AnnotationDbi_1.28.1                  
[5] Biobase_2.26.0                           SGSeq_1.0.6                           
[7] GenomicRanges_1.18.1                     GenomeInfoDb_1.2.2                    
[9] IRanges_2.0.0                            S4Vectors_0.4.0                       
[11] BiocGenerics_0.12.0                      BiocParallel_1.0.0                     

loaded via a namespace (and not attached):
[1] base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8              biomaRt_2.22.0       
[5] Biostrings_2.34.0       bitops_1.0-6            brew_1.0-6              checkmate_1.5.0      
[9] codetools_0.2-9         DBI_0.3.1               digest_0.6.4            fail_1.2             
[13] foreach_1.4.2           GenomicAlignments_1.2.1 igraph_0.7.1            iterators_1.0.7      
[17] RCurl_1.96-0            Rsamtools_1.18.1        RSQLite_1.0.0           rtracklayer_1.26.1   
[21] sendmailR_1.2-1         stringr_0.6.2           tools_3.1.2             XML_3.98-1.1         
[25] zlibbioc_1.12.0

SGSeq analyzeFeatures Splicing • 2.1k views

ADD COMMENT • link 9.4 years ago Paolo Guarnieri ▴ 20

score 1 · Answer 1 · 2014-12-20

Hi Paolo,

Thanks for providing your code and test files. It looks like the problem is caused by missing information in the BAM files.

Firstly, you mentioned the data are paired-end. However, your BAM files look like reads were aligned as single-end data. There is a recent post on the Bioconductor support website on how to test whether a BAM file is paired-end or single-end (https://support.bioconductor.org/p/63640/#63697). Running quickBamFlagSummary on your files suggests it is single-end, and getBamInfo in SGSeq arrives at the same conclusion.

Secondly, the BAM files are missing information from the custom tag XS:A and this is what caused the error. This tag is generated by many RNA-seq read aligners (e.g. GSNAP, TopHat) and indicates the direction of transcription (which can be inferred for spliced reads based on the splice site consensus sequences). This tag is required for use with SGSeq as the information is used for splice graph prediction, as well as assigning reads to features. Apologies this is not clear from the documentation. I have used SGSeq extensively with BAM files generated by GSNAP. So if you are re-aligning the data, I would recommend trying GSNAP.

Leonard

score 0 · Answer 2 · 2014-12-19

Hi Paolo,

Thanks for reporting the issue - an unhelpful error message indeed.

I noticed the error message can occur if there was a problem reading the BAM file. I will make sure this results in a more informative error message in the future. Is it possible that the chromosomes in your BAM files are named according to NCBI convention (“X” instead of “chrX”)? If this is the case, you can try

seqlevelsStyle(txf_annotated) <- "NCBI"

before running analyzeFeatures.

If this does not solve the problem, can you share one of the problematic BAM files with me (e.g. via an ftp site or dropbox), or point me to a publicly available BAM file that causes the problem? Then I can try to reproduce the error and fix it.

Leonard

score 0 · Answer 3 · 2014-12-19

Hi Leonard,

thanks for your reply. Unfortunately the chr is not the issue, or at least for I what I could check. I made sure it was in the right format. I checked BAM and annotation (that is why I removed the "NCBI" line). That was one of the things Brad recommended to check.

I can compile a self contained zip with script and sub-set bams for you to test if you want.

Thanks,

Paolo

score 0 · Answer 4 · 2014-12-22

Hi Leonard,

thanks for your reply.

You are right: my paired end reads were aligned with "olego" (wu et al. 2013 NAR) and the reads were treated as individual single ends and not as paired. And yes, it does not have that tag.

It was not clear to me, that tag was necessary, and even if it was clear, often I make assumptions on BAM files (aligned by others) and it would be good to add this property in the list of errors to trap and report.

I will realign with a compatible aligner and try again. Specifically we are also testing STAR and check if it is also "compatible".

Thanks,

Paolo

score 0 · Answer 5 · 2014-12-22

Hi Paolo,

I haven’t used STAR myself but the manual suggests there is an option that allows STAR to generate the XS tag. See section 4.2.2. SAM attributes and 4.2.3 Compatibility with Cufflinks/Cuffdiff in the STAR manual 2.4.0.1.

I will need to clarify the XS tag requirement in the documentation and also include relevant checks in the code. Thanks for making me aware of this. If you run into problems using SGSeq with alignments that include the XS tag, please let me know.

Leonard

score 0 · Answer 6 · 2014-12-23

Just got the alignments from STAR (STAR_2.4.0h1) and it works just fine, with no errors. the parameter you refer to is "--outSAMstrandField intronMotif", which enables that tag.

Now the function "getBamInfo" Recognizes fragment size and that is paired end.

We ran STAR it with these parameters (for reference):

STAR
--genomeDir /yourGenomeDir/mm10
--runThreadN yourNumOfCores
--outFilterMultimapNmax 1
--outSAMtype BAM SortedByCoordinate
--readFilesIn yourFileName.R1.fastq.gz yourFileName.R2.fastq.gz
--readFilesCommand zcat
--outSAMstrandField intronMotif
--outFileNamePrefix youtOutput/

Thanks again.

-P