Question: Issues with SGSeq: analyzeFeatures() returns errors on BAM files
0
gravatar for Sylvain Foisy
3.0 years ago by
Canada
Sylvain Foisy30 wrote:

Hi,

I am trying to run SGSeq over my data and I have issues with the analyzeFeatures method. I prepared both my data with getBamInfo() and built my annotations with importTranscripts() method using the GTF file that I used for alignment with tophat 2. At the analyzeFeatures() step, I get this:

> sgfc_ncbi<-analyzeFeatures(data.r,features=sgfdb,cores=8)

Obtain counts...

Erreur : 

Error in getSGFeatureCountsPerSample for MONOCYTES_169377:

Error in value[[3L]](cond) : 'scanBam' failed:

  record: 0

  error: 0

  file: /path/toward/myBamFiles/accepted_hits_properly_paired.bam

  index: /path/toward/myBamFiles/accepted_hits_properly_paired.bam

 

And so on for all my samples... These are paired-end BAM files, as checked via getBamInfo() and they do have the XS tag so I am at a lost right now.

Just in case:

> sessionInfo()

R version 3.2.0 (2015-04-16)

Platform: x86_64-unknown-linux-gnu (64-bit)

Running under: CentOS release 6.5 (Final)

 

locale:

 [1] LC_CTYPE=fr_CA.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=fr_CA.UTF-8        LC_COLLATE=fr_CA.UTF-8    

 [5] LC_MONETARY=fr_CA.UTF-8    LC_MESSAGES=fr_CA.UTF-8   

 [7] LC_PAPER=fr_CA.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=fr_CA.UTF-8 LC_IDENTIFICATION=C       

 

attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     

 

other attached packages:

[1] SGSeq_1.4.3          GenomicRanges_1.22.4 GenomeInfoDb_1.6.3  

[4] IRanges_2.4.8        S4Vectors_0.8.11     BiocGenerics_0.16.1 

 

loaded via a namespace (and not attached):

 [1] igraph_1.0.1               AnnotationDbi_1.32.3      

 [3] XVector_0.10.0             magrittr_1.5              

 [5] zlibbioc_1.16.0            GenomicAlignments_1.6.3   

 [7] BiocParallel_1.4.3         tools_3.2.0               

 [9] SummarizedExperiment_1.0.2 Biobase_2.30.0            

[11] DBI_0.4-1                  lambda.r_1.1.9            

[13] futile.logger_1.4.3        rtracklayer_1.30.4        

[15] futile.options_1.0.0       bitops_1.0-6              

[17] RUnit_0.4.31               biomaRt_2.26.1            

[19] RCurl_1.95-4.8             RSQLite_1.0.0             

[21] GenomicFeatures_1.22.13    Biostrings_2.38.4         

[23] Rsamtools_1.22.0           XML_3.98-1.4 

Any idea?

Best regards

Sylvain Foisy, Ph. D.

Montreal Heart Institute

Montreal, Qc

 

sgseq • 1.1k views
ADD COMMENTlink modified 22 months ago by oolongoni10 • written 3.0 years ago by Sylvain Foisy30
Answer: Issues with SGSeq: analyzeFeatures() returns errors on BAM files
2
gravatar for Leonard Goldstein
3.0 years ago by
United States
Leonard Goldstein80 wrote:

Hi Sylvain, 

I don't know for sure without seeing your data, but one situation that leads to this type of error is if the annotation and BAM files have incompatible chromosome names i.e. one might follow UCSC naming conventions (e.g. "chr1") and the other NCBI conventions (e.g. "1"). Can you check that these are compatible? Also note that you are working with an outdated Bioconductor release (although this should be unrelated to your problem).

Best regards

Leonard

 

 

ADD COMMENTlink written 3.0 years ago by Leonard Goldstein80

Hi Leonard,

Yes take 1, indeed my alignements were made with annotations from Illumina's iGenome data from NCBI; I used "seqlevelsStyle(txdb) <- "NCBI"" as specified elsewhere to go over this problem. Yes, take 2, you are right it ain't the latest but as of now, I am still stuck using this version for a while... I am trying the same code on a newer version on my Mac.

Is SGSeq using the BAI file created via samtools? I ask this because of the line "index: " which is pointing toward the BAM file, not its index... What is the scanBam method actually doing that might fail?

Best regards

S

ADD REPLYlink written 3.0 years ago by Sylvain Foisy30

Hi Leonard,

Ok, I looked into running elsewhere and well, I have errors of a different kind. I build a smaller version of my analysis (2 BAM per condition with only 2 conditions) and using the latest R/Bioconductor on my MacBook. Using the same pipeline, I now get these:

Obtain counts...
Erreur : 
Error in getSGFeatureCountsPerSample for MONOCYTES_169377:
Error in .io_bam(.scan_bamfile, file, reverseComplement, yieldSize(file),  : 
  seqlevels(param) not in BAM header:
    seqlevels: ‘chr1_GL383518v1_alt’, ‘chr1_GL383519v1_alt’, ‘chr1_GL383520v2_alt’, ‘chr1_KI270759v1_alt’, ‘chr1_KI270761v1_alt’, ‘chr1_KI270762v1_alt’, ‘chr1_KI270763v1_alt’, ‘chr1_KI270765v1_alt’, ‘chr1_KI270766v1_alt’, ‘chr1_KI270892v1_alt’, ‘chr2_GL383521v1_alt’, ‘chr2_GL383522v1_alt’, ‘chr2_GL582966v2_alt’, ‘chr2_KI270767v1_alt’, ‘chr2_KI270768v1_alt’, ‘chr2_KI270769v1_alt’, ‘chr2_KI270770v1_alt’, ‘chr2_KI270772v1_alt’, ‘chr2_KI270773v1_alt’, ‘chr2_KI270774v1_alt’, ‘chr2_KI270776v1_alt’, ‘chr2_KI270893v1_alt’, ‘chr2_KI270894v1_alt’, ‘chr3_GL383526v1_alt’, ‘chr3_JH636055v2_alt’, ‘chr3_KI270777v1_alt’, ‘chr3_KI270779v1_alt’, ‘chr3_KI270780v1_alt’, ‘chr3_KI270782v1_alt’, ‘chr3_KI270783'

This tells you something interesting? FYI, I did the original alignments using tophat 2 with the Illumina's iGenomes human GRCh38-based indexes and GTF files.

Best regards

S

ADD REPLYlink written 3.0 years ago by Sylvain Foisy30

Hi Sylvain,

Yes the more recent Bioconductor installation gives a more informative error message. It tells you that the errors are due to chromosome names in your annotation that cannot be found in the BAM file header. If you remove the problematic chromosomes from your annotation, this should fix the problem, e.g. for your data you can use something like 

> seqlevels_in_bam <- seqlevels(BamFile(data.r$file_bam[1]))
> sgfdb2 <- keepSeqlevels(sgfdb, seqlevels_in_bam)

and then run analyzeFeatures() using sgfdb2.

Regarding your earlier question, the scanBam() function has an argument 'index' which can be used to specify the name of the index file. By default this is simply the name of the BAM file with extension '.bai' appended.

I hope this answers your questions, let me know if you run into any other problems.

Best wishes

Leonard

ADD REPLYlink written 3.0 years ago by Leonard Goldstein80

Hi Leonard,

Success!! Your last inputs made the thing work ;-) I guess that I will have to make some preprocessing first on this criteria.

Thanks for all your help. Can't promise I won't need additional help in a few hours/days ;-)

S

ADD REPLYlink written 3.0 years ago by Sylvain Foisy30
Answer: Issues with SGSeq: analyzeFeatures() returns errors on BAM files
1
gravatar for oolongoni
22 months ago by
oolongoni10
California
oolongoni10 wrote:

Just had a similar problem but figured it out thanks to this post! `seqlevels` in bam files did not match `hsapiens_gene_ensembl` annotations.

If using Ensembl annotation

> hsaEnsembl <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl")
> txf <- convertToTxFeatures(hsaEnsembl)
> Gene <- genes(edb, filter = list(GenenameFilter("CD19"), GeneBiotypeFilter("protein_coding")))
> txf_CD19 <- txf[txf %over% Gene]

But the chromosomes do not match

> seqlevels(BamFile(si$file_bam[1]))
[1] "1"          "2"          "3"          "4"          "5"          "6"          "7"          "8"          "9"          "10"      
[11] "11"         "12"         "13"         "14"         "15"         "16"         "17"         "18"         "19"         "20"      
[21] "21"         "22"         "X"          "Y"          "MT"         "GL000191.1" "GL000192.1" "GL000193.1" "GL000194.1" "GL000195.1"
[31] "GL000196.1" "GL000197.1" "GL000198.1" "GL000199.1" "GL000200.1" "GL000201.1" "GL000202.1" "GL000203.1" "GL000204.1" "GL000205.1"
[41] "GL000206.1" "GL000207.1" "GL000208.1" "GL000209.1" "GL000210.1" "GL000211.1" "GL000212.1" "GL000213.1" "GL000214.1" "GL000215.1"
[51] "GL000216.1" "GL000217.1" "GL000218.1" "GL000219.1" "GL000220.1" "GL000221.1" "GL000222.1" "GL000223.1" "GL000224.1" "GL000225.1"
[61] "GL000226.1" "GL000227.1" "GL000228.1" "GL000229.1" "GL000230.1" "GL000231.1" "GL000232.1" "GL000233.1" "GL000234.1" "GL000235.1"
[71] "GL000236.1" "GL000237.1" "GL000238.1" "GL000239.1" "GL000240.1" "GL000241.1" "GL000242.1" "GL000243.1" "GL000244.1" "GL000245.1"
[81] "GL000246.1" "GL000247.1" "GL000248.1" "GL000249.1"

> seqlevels(txf_CD19)[grepl("GL", seqlevels(txf_CD19))]
[1] "GL000008.2" "GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2" "GL000208.1" "GL000213.1" "GL000214.1" "GL000216.2" "GL000218.1"
[11] "GL000219.1" "GL000220.1" "GL000221.1" "GL000224.1" "GL000225.1" "GL000226.1"

Just keep only chromosomes 1-22,X,Y,MT

> seqlevels_in_bam <- seqlevels(BamFile(si$file_bam[1]))[1:25]
> txf2 <- keepSeqlevels(txf_CD19, seqlevels_in_bam)
>  sgfc_CD19 <- analyzeFeatures(si, features = txf2)

 

ADD COMMENTlink modified 22 months ago • written 22 months ago by oolongoni10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 179 users visited in the last hour