easyRNAseq says Your annotation is not in sync with your alignments!
4
0
Entering edit mode
Mayte • 0
@mayte-6759
Last seen 9.9 years ago
United States

Dear Bioconducters 

 

I was using easyRNAseq to get the matrix counts from Bam files, which where obtained using  STAR to do the mapping with hg19 as reference. 

CountsGenes_biomart <- easyRNASeq(
  filesDirectory=BamPath,  filenames=fls.bam,   nbCore=6,  organism='Hsapiens',  gapped = TRUE,  annotationMethod="biomaRt",  format="bam",  count=c('genes'),  outputFormat=c("edgeR"),  summarization=c("geneModels"),  conditions=conditions)

 

 I got the following error

Your annotation is not in sync with your alignments! Some annotation lie outside the sequences range reported in your BAM file. You may be using two different genome versions.

Is biomart annotating with something other than hg19? How can I use easyRNASeq then ?

I know easyRNASeq is deprecated by still trying to get a gist on simpleRNASeq. If someone can point to a helpful place where a complete example of using this function is provided, I will appreciate.

Thanks a lot!!!

 

 

 

 

easyRNAseq • 2.2k views
ADD COMMENT
0
Entering edit mode

Hi Mayte,

Ensembl switched from using human assembly GRCh37 to GRCh38 in August 2014 with their release 76:

  http://lists.ensembl.org/pipermail/announce/2014-August/thread.html

This is probably why easyRNASeq() is complaining that "your annotation is not in sync with your alignments".

AFAIK the current release of Ensembl (release 78) is still using the GRCh38 assembly. Note that this assembly is the same as hg38 from UCSC except for the chromosome/scaffold names. 

Cheers,

H.

ADD REPLY
1
Entering edit mode
Mayte • 0
@mayte-6759
Last seen 9.9 years ago
United States

Hej Nico

Thanks for your answer. I took the time to figure the simpleRNASeq function and took your advice on the annotation. I got a msg error that I can not figure it out. I include the litle code, the error msg and the sessionInfo bellow. Hope you can be as helpful as always!

Best

Mayte

 

library("easyRNASeq")
library(Rsamtools)
library(DESeq)
library(edgeR)
library(GenomicRanges)
library(parallel)
library(S4Vectors)

fls.bam = list.files(path= BamPath,recursive=FALSE, pattern="*sorted.bam$", full=FALSE)

bamFiles <- getBamFileList(filenames= list.files(path= BamPath,recursive=FALSE, pattern="*sorted.bam$", full=FALSE))

annotParam <- AnnotParam(datasource="/mydir/Homo_sapiens.GRCh37.75.tran.gtf", type="gtf")

> Counts <- simpleRNASeq(

+   bamFiles=bamFiles,

+   param= RnaSeqParam(annotParam=annotParam, countBy='genes'),

+   verbose=TRUE,

+   nnodes=6

+ )

==========================

simpleRNASeq version 2.2.0

==========================

Creating a SummarizedExperiment.

==========================

Processing the alignments.

==========================

Pre-processing 84 BAM files.

Validating the BAM files.

Extracted 93 reference sequences information.

Error in checkForRemoteErrors(val) :

  84 nodes produced errors; first error: could not find function "DataFrame"

 

sessionInfo()

R version 3.1.1 (2014-07-10)

Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

 

attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets

[8] methods   base    

 

other attached packages:

 [1] edgeR_3.8.5          limma_3.22.4         DESeq_1.18.0       

 [4] lattice_0.20-29      locfit_1.5-9.1       Biobase_2.26.0     

 [7] Rsamtools_1.18.2     Biostrings_2.34.1    XVector_0.6.0      

[10] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4   IRanges_2.0.1      

[13] S4Vectors_0.4.0      BiocGenerics_0.12.1  easyRNASeq_2.2.0   

 

loaded via a namespace (and not attached):

 [1] annotate_1.44.0         AnnotationDbi_1.28.1    base64enc_0.1-2       

 [4] BatchJobs_1.5           BBmisc_1.8              BiocParallel_1.0.1    

 [7] biomaRt_2.22.0          bitops_1.0-6            brew_1.0-6            

[10] checkmate_1.5.1         codetools_0.2-10        DBI_0.3.1             

[13] digest_0.6.8            fail_1.2                foreach_1.4.2         

[16] genefilter_1.48.1       geneplotter_1.44.0      genomeIntervals_1.22.0

[19] GenomicAlignments_1.2.1 GenomicFeatures_1.18.3  grid_3.1.1            

[22] hwriter_1.3.2           intervals_0.15.0        iterators_1.0.7       

[25] latticeExtra_0.6-26     LSD_3.0                 plyr_1.8.1            

[28] RColorBrewer_1.1-2      Rcpp_0.11.4             RCurl_1.95-4.5        

[31] RSQLite_1.0.0           rtracklayer_1.26.2      sendmailR_1.2-1       

[34] ShortRead_1.24.0        splines_3.1.1           stringr_0.6.2         

[37] survival_2.37-7         tools_3.1.1             XML_3.98-1.1          

[40] xtable_1.7-4   

 

ADD COMMENT
1
Entering edit mode
@nicolas-delhomme-6252
Last seen 6.1 years ago
Sweden

 

Hej Mayte!

 

 

 

Using simpleRNASeq is indeed recommended other using easyRNASeq, but you would in all likelihood get the same error using simpleRNASeq. 

The putative reason why you observe that error is because two different versions of the genome (despite it being hg19) are being used. It is not unfrequent that EnsEMBL update genomic coordinates, which is then reflected in the data you get from biomaRt. So, if your alignments were done to a genome fasta file that is not the latest one available at EnsEMBL, this is the most likely cause for that problem. 

A work around is to download the gff file (or gtf file) annotation for hg19 from EnsEMBL or USCS and use that as an annotation source for running easyRNASeq.

Note that doing this is the preferred way anyway (read the vignette section 7.3) to create annotations that will prevent the "double counting" caveat.

If you have been using the most recent EnsEMBL fasta file for your alignments, being able to have a peak at your data would be helpful to reproduce the issue.

There are examples of using simpleRNASeq in its man page. I can otherwise help you get the command line right if you need. I'll revise the vignette asap to integrate more adequate examples.

Finally, if you could post your sessionInfo(), that would be very helpful too.

Cheers,

Nico

ADD COMMENT
0
Entering edit mode

1

Mayte

1 day ago by

Mayte0

United States

Hej Nico

Thanks for your answer. I took the time to figure the simpleRNASeq function and took your advice on the annotation. I got a msg error that I can not figure it out. I include the litle code, the error msg and the sessionInfo bellow. Hope you can be as helpful as always!

Best

Mayte

 

library("easyRNASeq")
library(Rsamtools)
library(DESeq)
library(edgeR)
library(GenomicRanges)
library(parallel)
library(S4Vectors)

fls.bam = list.files(path= BamPath,recursive=FALSE, pattern="*sorted.bam$", full=FALSE)

bamFiles <- getBamFileList(filenames= list.files(path= BamPath,recursive=FALSE, pattern="*sorted.bam$", full=FALSE))

annotParam <- AnnotParam(datasource="/mydir/Homo_sapiens.GRCh37.75.tran.gtf", type="gtf")

> Counts <- simpleRNASeq(

+   bamFiles=bamFiles,

+   param= RnaSeqParam(annotParam=annotParam, countBy='genes'),

+   verbose=TRUE,

+   nnodes=6

+ )

==========================

simpleRNASeq version 2.2.0

==========================

Creating a SummarizedExperiment.

==========================

Processing the alignments.

==========================

Pre-processing 84 BAM files.

Validating the BAM files.

Extracted 93 reference sequences information.

Error in checkForRemoteErrors(val) :

  84 nodes produced errors; first error: could not find function "DataFrame"

 

sessionInfo()

R version 3.1.1 (2014-07-10)

Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

 

attached base packages:

[1] parallel  stats4    stats     graphics  grDevices utils     datasets

[8] methods   base    

 

other attached packages:

 [1] edgeR_3.8.5          limma_3.22.4         DESeq_1.18.0       

 [4] lattice_0.20-29      locfit_1.5-9.1       Biobase_2.26.0     

 [7] Rsamtools_1.18.2     Biostrings_2.34.1    XVector_0.6.0      

[10] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4   IRanges_2.0.1      

[13] S4Vectors_0.4.0      BiocGenerics_0.12.1  easyRNASeq_2.2.0   

 

loaded via a namespace (and not attached):

 [1] annotate_1.44.0         AnnotationDbi_1.28.1    base64enc_0.1-2       

 [4] BatchJobs_1.5           BBmisc_1.8              BiocParallel_1.0.1    

 [7] biomaRt_2.22.0          bitops_1.0-6            brew_1.0-6            

[10] checkmate_1.5.1         codetools_0.2-10        DBI_0.3.1             

[13] digest_0.6.8            fail_1.2                foreach_1.4.2         

[16] genefilter_1.48.1       geneplotter_1.44.0      genomeIntervals_1.22.0

[19] GenomicAlignments_1.2.1 GenomicFeatures_1.18.3  grid_3.1.1            

[22] hwriter_1.3.2           intervals_0.15.0        iterators_1.0.7       

[25] latticeExtra_0.6-26     LSD_3.0                 plyr_1.8.1            

[28] RColorBrewer_1.1-2      Rcpp_0.11.4             RCurl_1.95-4.5        

[31] RSQLite_1.0.0           rtracklayer_1.26.2      sendmailR_1.2-1       

[34] ShortRead_1.24.0        splines_3.1.1           stringr_0.6.2         

[37] survival_2.37-7         tools_3.1.1             XML_3.98-1.1          

[40] xtable_1.7-4  

ADD REPLY
0
Entering edit mode
@nicolas-delhomme-6252
Last seen 6.1 years ago
Sweden

Thanks Hervé for your comment! That certainly explains the original issue. 

 

Mayte, I'll have a look at your error tomorrow. If need be, would you be able to give me an excerpt of 2 of your BAM files? It just needs to have a few entries, 50,000 should do already. I'll contact you off the list if I need the data with a solution for you to upload these.

Cheers,

Nico

ADD COMMENT
0
Entering edit mode
@nicolas-delhomme-6252
Last seen 6.1 years ago
Sweden

Hej Mayte!

Sorry for the long time in answering.

I've fixed the issue in easyRNASeq version 2.2.1. It should be available from Bioc in a couple of days - or immediately from svn.

Cheers!

Nico

ADD COMMENT

Login before adding your answer.

Traffic: 830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6