Issues with SGSeq: analyzeFeatures() returns errors on BAM files
3
1
Entering edit mode
@sylvain-foisy-5539
Last seen 5.2 years ago
Canada

Hi,

I am trying to run SGSeq over my data and I have issues with the analyzeFeatures method. I prepared both my data with getBamInfo() and built my annotations with importTranscripts() method using the GTF file that I used for alignment with tophat 2. At the analyzeFeatures() step, I get this:

> sgfc_ncbi<-analyzeFeatures(data.r,features=sgfdb,cores=8)

Obtain counts...

Erreur : 

Error in getSGFeatureCountsPerSample for MONOCYTES_169377:

Error in value[[3L]](cond) : 'scanBam' failed:

  record: 0

  error: 0

  file: /path/toward/myBamFiles/accepted_hits_properly_paired.bam

  index: /path/toward/myBamFiles/accepted_hits_properly_paired.bam

 

And so on for all my samples... These are paired-end BAM files, as checked via getBamInfo() and they do have the XS tag so I am at a lost right now.

Just in case:

> sessionInfo()

R version 3.2.0 (2015-04-16)

Platform: x86_64-unknown-linux-gnu (64-bit)

Running under: CentOS release 6.5 (Final)

 

locale:

 [1] LC_CTYPE=fr_CA.UTF-8       LC_NUMERIC=C              

 [3] LC_TIME=fr_CA.UTF-8        LC_COLLATE=fr_CA.UTF-8    

 [5] LC_MONETARY=fr_CA.UTF-8    LC_MESSAGES=fr_CA.UTF-8   

 [7] LC_PAPER=fr_CA.UTF-8       LC_NAME=C                 

 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

[11] LC_MEASUREMENT=fr_CA.UTF-8 LC_IDENTIFICATION=C       

 

attached base packages:

[1] stats4    parallel  stats     graphics  grDevices utils     datasets 

[8] methods   base     

 

other attached packages:

[1] SGSeq_1.4.3          GenomicRanges_1.22.4 GenomeInfoDb_1.6.3  

[4] IRanges_2.4.8        S4Vectors_0.8.11     BiocGenerics_0.16.1 

 

loaded via a namespace (and not attached):

 [1] igraph_1.0.1               AnnotationDbi_1.32.3      

 [3] XVector_0.10.0             magrittr_1.5              

 [5] zlibbioc_1.16.0            GenomicAlignments_1.6.3   

 [7] BiocParallel_1.4.3         tools_3.2.0               

 [9] SummarizedExperiment_1.0.2 Biobase_2.30.0            

[11] DBI_0.4-1                  lambda.r_1.1.9            

[13] futile.logger_1.4.3        rtracklayer_1.30.4        

[15] futile.options_1.0.0       bitops_1.0-6              

[17] RUnit_0.4.31               biomaRt_2.26.1            

[19] RCurl_1.95-4.8             RSQLite_1.0.0             

[21] GenomicFeatures_1.22.13    Biostrings_2.38.4         

[23] Rsamtools_1.22.0           XML_3.98-1.4 

Any idea?

Best regards

Sylvain Foisy, Ph. D.

Montreal Heart Institute

Montreal, Qc

 

sgseq • 3.5k views
ADD COMMENT
2
Entering edit mode
@leonard-goldstein-6845
Last seen 5 months ago
Australia

Hi Sylvain, 

I don't know for sure without seeing your data, but one situation that leads to this type of error is if the annotation and BAM files have incompatible chromosome names i.e. one might follow UCSC naming conventions (e.g. "chr1") and the other NCBI conventions (e.g. "1"). Can you check that these are compatible? Also note that you are working with an outdated Bioconductor release (although this should be unrelated to your problem).

Best regards

Leonard

 

 

ADD COMMENT
0
Entering edit mode

Hi Leonard,

Yes take 1, indeed my alignements were made with annotations from Illumina's iGenome data from NCBI; I used "seqlevelsStyle(txdb) <- "NCBI"" as specified elsewhere to go over this problem. Yes, take 2, you are right it ain't the latest but as of now, I am still stuck using this version for a while... I am trying the same code on a newer version on my Mac.

Is SGSeq using the BAI file created via samtools? I ask this because of the line "index: " which is pointing toward the BAM file, not its index... What is the scanBam method actually doing that might fail?

Best regards

S

ADD REPLY
0
Entering edit mode

Hi Leonard,

Ok, I looked into running elsewhere and well, I have errors of a different kind. I build a smaller version of my analysis (2 BAM per condition with only 2 conditions) and using the latest R/Bioconductor on my MacBook. Using the same pipeline, I now get these:

Obtain counts...
Erreur : 
Error in getSGFeatureCountsPerSample for MONOCYTES_169377:
Error in .io_bam(.scan_bamfile, file, reverseComplement, yieldSize(file),  : 
  seqlevels(param) not in BAM header:
    seqlevels: ‘chr1_GL383518v1_alt’, ‘chr1_GL383519v1_alt’, ‘chr1_GL383520v2_alt’, ‘chr1_KI270759v1_alt’, ‘chr1_KI270761v1_alt’, ‘chr1_KI270762v1_alt’, ‘chr1_KI270763v1_alt’, ‘chr1_KI270765v1_alt’, ‘chr1_KI270766v1_alt’, ‘chr1_KI270892v1_alt’, ‘chr2_GL383521v1_alt’, ‘chr2_GL383522v1_alt’, ‘chr2_GL582966v2_alt’, ‘chr2_KI270767v1_alt’, ‘chr2_KI270768v1_alt’, ‘chr2_KI270769v1_alt’, ‘chr2_KI270770v1_alt’, ‘chr2_KI270772v1_alt’, ‘chr2_KI270773v1_alt’, ‘chr2_KI270774v1_alt’, ‘chr2_KI270776v1_alt’, ‘chr2_KI270893v1_alt’, ‘chr2_KI270894v1_alt’, ‘chr3_GL383526v1_alt’, ‘chr3_JH636055v2_alt’, ‘chr3_KI270777v1_alt’, ‘chr3_KI270779v1_alt’, ‘chr3_KI270780v1_alt’, ‘chr3_KI270782v1_alt’, ‘chr3_KI270783'

This tells you something interesting? FYI, I did the original alignments using tophat 2 with the Illumina's iGenomes human GRCh38-based indexes and GTF files.

Best regards

S

ADD REPLY
1
Entering edit mode

Hi Sylvain,

Yes the more recent Bioconductor installation gives a more informative error message. It tells you that the errors are due to chromosome names in your annotation that cannot be found in the BAM file header. If you remove the problematic chromosomes from your annotation, this should fix the problem, e.g. for your data you can use something like 

> seqlevels_in_bam <- seqlevels(BamFile(data.r$file_bam[1]))
> sgfdb2 <- keepSeqlevels(sgfdb, seqlevels_in_bam)

and then run analyzeFeatures() using sgfdb2.

Regarding your earlier question, the scanBam() function has an argument 'index' which can be used to specify the name of the index file. By default this is simply the name of the BAM file with extension '.bai' appended.

I hope this answers your questions, let me know if you run into any other problems.

Best wishes

Leonard

ADD REPLY
0
Entering edit mode

Hi Leonard,

Success!! Your last inputs made the thing work ;-) I guess that I will have to make some preprocessing first on this criteria.

Thanks for all your help. Can't promise I won't need additional help in a few hours/days ;-)

S

ADD REPLY
2
Entering edit mode
oolongoni ▴ 40
@oolongoni-2289
Last seen 6.9 years ago
California

Just had a similar problem but figured it out thanks to this post! `seqlevels` in bam files did not match `hsapiens_gene_ensembl` annotations.

If using Ensembl annotation

> hsaEnsembl <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl")
> txf <- convertToTxFeatures(hsaEnsembl)
> Gene <- genes(edb, filter = list(GenenameFilter("CD19"), GeneBiotypeFilter("protein_coding")))
> txf_CD19 <- txf[txf %over% Gene]

But the chromosomes do not match

> seqlevels(BamFile(si$file_bam[1]))
[1] "1"          "2"          "3"          "4"          "5"          "6"          "7"          "8"          "9"          "10"      
[11] "11"         "12"         "13"         "14"         "15"         "16"         "17"         "18"         "19"         "20"      
[21] "21"         "22"         "X"          "Y"          "MT"         "GL000191.1" "GL000192.1" "GL000193.1" "GL000194.1" "GL000195.1"
[31] "GL000196.1" "GL000197.1" "GL000198.1" "GL000199.1" "GL000200.1" "GL000201.1" "GL000202.1" "GL000203.1" "GL000204.1" "GL000205.1"
[41] "GL000206.1" "GL000207.1" "GL000208.1" "GL000209.1" "GL000210.1" "GL000211.1" "GL000212.1" "GL000213.1" "GL000214.1" "GL000215.1"
[51] "GL000216.1" "GL000217.1" "GL000218.1" "GL000219.1" "GL000220.1" "GL000221.1" "GL000222.1" "GL000223.1" "GL000224.1" "GL000225.1"
[61] "GL000226.1" "GL000227.1" "GL000228.1" "GL000229.1" "GL000230.1" "GL000231.1" "GL000232.1" "GL000233.1" "GL000234.1" "GL000235.1"
[71] "GL000236.1" "GL000237.1" "GL000238.1" "GL000239.1" "GL000240.1" "GL000241.1" "GL000242.1" "GL000243.1" "GL000244.1" "GL000245.1"
[81] "GL000246.1" "GL000247.1" "GL000248.1" "GL000249.1"

> seqlevels(txf_CD19)[grepl("GL", seqlevels(txf_CD19))]
[1] "GL000008.2" "GL000009.2" "GL000194.1" "GL000195.1" "GL000205.2" "GL000208.1" "GL000213.1" "GL000214.1" "GL000216.2" "GL000218.1"
[11] "GL000219.1" "GL000220.1" "GL000221.1" "GL000224.1" "GL000225.1" "GL000226.1"

Just keep only chromosomes 1-22,X,Y,MT

> seqlevels_in_bam <- seqlevels(BamFile(si$file_bam[1]))[1:25]
> txf2 <- keepSeqlevels(txf_CD19, seqlevels_in_bam)
>  sgfc_CD19 <- analyzeFeatures(si, features = txf2)

 

ADD COMMENT
0
Entering edit mode
anjanhazra ▴ 10
@anjanhazra-22513
Last seen 4.9 years ago

I followed the instructions privided by Dr. Leonard to overcome the similar issue and got this. Can you please help out to get rid of this error?

sgfWRKY332 <- keepSeqlevels(sgfWRKY33, seqlevelsinbam) Error in keepSeqlevels(sgfWRKY33, seqlevelsinbam) : invalid seqlevels: Scaffold2, Scaffold4, Scaffold5, Scaffold7, Scaffold8, Scaffold9, Scaffold11, Scaffold12, Scaffold13, Scaffold14, Scaffold16, Scaffold17, Scaffold18, Scaffold22, Scaffold23, Scaffold26, Scaffold27, Scaffold30, Scaffold31, Scaffold33, Scaffold35, Scaffold36, Scaffold38, Scaffold39, Scaffold41, Scaffold42, Scaffold43, Scaffold44, Scaffold45, Scaffold46, Scaffold47, Scaffold48, Scaffold51, Scaffold52, Scaffold54, Scaffold56, Scaffold57, Scaffold58, Scaffold60, Scaffold61, Scaffold62, Scaffold64, Scaffold65, Scaffold66, Scaffold67, Scaffold70, Scaffold71, Scaffold72, Scaffold73, Scaffold74, Scaffold76, Scaffold77, Scaffold78, Scaffold79, Scaffold80, Scaffold81, Scaffold82, Scaffold83, Scaffold85, Scaffold86, Scaffold87, Scaffold88, Scaffold89, Scaffold90, Scaffold91, Scaffold92, Scaffold93, Scaffold94, Scaffold95, Scaffold96, Scaffold97, Scaffold98, Scaffold99, Scaffold100, Scaffold102, Scaffold103, Scaffold104, Scaffold105, Scaffold106, Scaff

ADD COMMENT
0
Entering edit mode

You are trying to keep seqlevels that are not in your object. Try intersecting with existing seqlevels:

sgfWRKY332 <- keepSeqlevels(sgfWRKY33, intersect(seqlevelsinbam, seqlevels(sgfWRKY33)))
ADD REPLY

Login before adding your answer.

Traffic: 805 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6