Question

easyRNAseq says Your annotation is not in sync with your alignments!

0

Entering edit mode

Mayte • 0

@mayte-6759

Last seen 9.2 years ago

United States

Dear Bioconducters

I was using easyRNAseq to get the matrix counts from Bam files, which where obtained using STAR to do the mapping with hg19 as reference.

CountsGenes_biomart <- easyRNASeq(
filesDirectory=BamPath, filenames=fls.bam, nbCore=6, organism='Hsapiens', gapped = TRUE, annotationMethod="biomaRt", format="bam", count=c('genes'), outputFormat=c("edgeR"), summarization=c("geneModels"), conditions=conditions)

I got the following error

Your annotation is not in sync with your alignments! Some annotation lie outside the sequences range reported in your BAM file. You may be using two different genome versions.

Is biomart annotating with something other than hg19? How can I use easyRNASeq then ?

I know easyRNASeq is deprecated by still trying to get a gist on simpleRNASeq. If someone can point to a helpful place where a complete example of using this function is provided, I will appreciate.

Thanks a lot!!!

easyRNAseq • 1.9k views

ADD COMMENT • link updated 9.2 years ago by Nicolas Delhomme ▴ 320 • written 9.2 years ago by Mayte • 0

0

Entering edit mode

Hi Mayte,

Ensembl switched from using human assembly GRCh37 to GRCh38 in August 2014 with their release 76:

http://lists.ensembl.org/pipermail/announce/2014-August/thread.html

This is probably why easyRNASeq() is complaining that "your annotation is not in sync with your alignments".

AFAIK the current release of Ensembl (release 78) is still using the GRCh38 assembly. Note that this assembly is the same as hg38 from UCSC except for the chromosome/scaffold names.

Cheers,

H.

ADD REPLY • link 9.2 years ago Hervé Pagès 16k

1

Entering edit mode

Nicolas Delhomme ▴ 320

@nicolas-delhomme-6252

Last seen 5.4 years ago

Sweden

Hej Mayte!

Using simpleRNASeq is indeed recommended other using easyRNASeq, but you would in all likelihood get the same error using simpleRNASeq.

The putative reason why you observe that error is because two different versions of the genome (despite it being hg19) are being used. It is not unfrequent that EnsEMBL update genomic coordinates, which is then reflected in the data you get from biomaRt. So, if your alignments were done to a genome fasta file that is not the latest one available at EnsEMBL, this is the most likely cause for that problem.

A work around is to download the gff file (or gtf file) annotation for hg19 from EnsEMBL or USCS and use that as an annotation source for running easyRNASeq.

Note that doing this is the preferred way anyway (read the vignette section 7.3) to create annotations that will prevent the "double counting" caveat.

If you have been using the most recent EnsEMBL fasta file for your alignments, being able to have a peak at your data would be helpful to reproduce the issue.

There are examples of using simpleRNASeq in its man page. I can otherwise help you get the command line right if you need. I'll revise the vignette asap to integrate more adequate examples.

Finally, if you could post your sessionInfo(), that would be very helpful too.

Cheers,

Nico

ADD COMMENT • link 9.2 years ago Nicolas Delhomme ▴ 320

0

Entering edit mode

1

Mayte

1 day ago by

Mayte • 0

United States

Hej Nico

Thanks for your answer. I took the time to figure the simpleRNASeq function and took your advice on the annotation. I got a msg error that I can not figure it out. I include the litle code, the error msg and the sessionInfo bellow. Hope you can be as helpful as always!

Best

Mayte

library("easyRNASeq")
library(Rsamtools)
library(DESeq)
library(edgeR)
library(GenomicRanges)
library(parallel)
library(S4Vectors)

fls.bam = list.files(path= BamPath,recursive=FALSE, pattern="*sorted.bam$", full=FALSE)

bamFiles <- getBamFileList(filenames= list.files(path= BamPath,recursive=FALSE, pattern="*sorted.bam$", full=FALSE))

annotParam <- AnnotParam(datasource="/mydir/Homo_sapiens.GRCh37.75.tran.gtf", type="gtf")

> Counts <- simpleRNASeq(

+ bamFiles=bamFiles,

+ param= RnaSeqParam(annotParam=annotParam, countBy='genes'),

+ verbose=TRUE,

+ nnodes=6

+ )

==========================

simpleRNASeq version 2.2.0

==========================

Creating a SummarizedExperiment.

==========================

Processing the alignments.

==========================

Pre-processing 84 BAM files.

Validating the BAM files.

Extracted 93 reference sequences information.

Error in checkForRemoteErrors(val) :

84 nodes produced errors; first error: could not find function "DataFrame"

sessionInfo()

R version 3.1.1 (2014-07-10)

Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] parallel stats4 stats graphics grDevices utils datasets

[8] methods base

other attached packages:

[1] edgeR_3.8.5 limma_3.22.4 DESeq_1.18.0

[4] lattice_0.20-29 locfit_1.5-9.1 Biobase_2.26.0

[7] Rsamtools_1.18.2 Biostrings_2.34.1 XVector_0.6.0

[10] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 IRanges_2.0.1

[13] S4Vectors_0.4.0 BiocGenerics_0.12.1 easyRNASeq_2.2.0

loaded via a namespace (and not attached):

[1] annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2

[4] BatchJobs_1.5 BBmisc_1.8 BiocParallel_1.0.1

[7] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6

[10] checkmate_1.5.1 codetools_0.2-10 DBI_0.3.1

[13] digest_0.6.8 fail_1.2 foreach_1.4.2

[16] genefilter_1.48.1 geneplotter_1.44.0 genomeIntervals_1.22.0

[19] GenomicAlignments_1.2.1 GenomicFeatures_1.18.3 grid_3.1.1

[22] hwriter_1.3.2 intervals_0.15.0 iterators_1.0.7

[25] latticeExtra_0.6-26 LSD_3.0 plyr_1.8.1

[28] RColorBrewer_1.1-2 Rcpp_0.11.4 RCurl_1.95-4.5

[31] RSQLite_1.0.0 rtracklayer_1.26.2 sendmailR_1.2-1

[34] ShortRead_1.24.0 splines_3.1.1 stringr_0.6.2

[37] survival_2.37-7 tools_3.1.1 XML_3.98-1.1

[40] xtable_1.7-4

ADD REPLY • link 9.2 years ago Mayte • 0

0

Entering edit mode

Nicolas Delhomme ▴ 320

@nicolas-delhomme-6252

Last seen 5.4 years ago

Sweden

Thanks Hervé for your comment! That certainly explains the original issue.

Mayte, I'll have a look at your error tomorrow. If need be, would you be able to give me an excerpt of 2 of your BAM files? It just needs to have a few entries, 50,000 should do already. I'll contact you off the list if I need the data with a solution for you to upload these.

Cheers,

Nico

ADD COMMENT • link 9.2 years ago Nicolas Delhomme ▴ 320

0

Entering edit mode

Nicolas Delhomme ▴ 320

@nicolas-delhomme-6252

Last seen 5.4 years ago

Sweden

Hej Mayte!

Sorry for the long time in answering.

I've fixed the issue in easyRNASeq version 2.2.1. It should be available from Bioc in a couple of days - or immediately from svn.

Cheers!

Nico

ADD COMMENT • link 9.2 years ago Nicolas Delhomme ▴ 320

score 1 · Accepted Answer · 2015-01-31

Hej Nico

Thanks for your answer. I took the time to figure the simpleRNASeq function and took your advice on the annotation. I got a msg error that I can not figure it out. I include the litle code, the error msg and the sessionInfo bellow. Hope you can be as helpful as always!

Best

Mayte

library("easyRNASeq")
library(Rsamtools)
library(DESeq)
library(edgeR)
library(GenomicRanges)
library(parallel)
library(S4Vectors)

fls.bam = list.files(path= BamPath,recursive=FALSE, pattern="*sorted.bam$", full=FALSE)

bamFiles <- getBamFileList(filenames= list.files(path= BamPath,recursive=FALSE, pattern="*sorted.bam$", full=FALSE))

annotParam <- AnnotParam(datasource="/mydir/Homo_sapiens.GRCh37.75.tran.gtf", type="gtf")

> Counts <- simpleRNASeq(

+ bamFiles=bamFiles,

+ param= RnaSeqParam(annotParam=annotParam, countBy='genes'),

+ verbose=TRUE,

+ nnodes=6

+ )

==========================

simpleRNASeq version 2.2.0

==========================

Creating a SummarizedExperiment.

==========================

Processing the alignments.

==========================

Pre-processing 84 BAM files.

Validating the BAM files.

Extracted 93 reference sequences information.

Error in checkForRemoteErrors(val) :

84 nodes produced errors; first error: could not find function "DataFrame"

sessionInfo()

R version 3.1.1 (2014-07-10)

Platform: x86_64-apple-darwin13.1.0 (64-bit)

locale:

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] parallel stats4 stats graphics grDevices utils datasets

[8] methods base

other attached packages:

[1] edgeR_3.8.5 limma_3.22.4 DESeq_1.18.0

[4] lattice_0.20-29 locfit_1.5-9.1 Biobase_2.26.0

[7] Rsamtools_1.18.2 Biostrings_2.34.1 XVector_0.6.0

[10] GenomicRanges_1.18.4 GenomeInfoDb_1.2.4 IRanges_2.0.1

[13] S4Vectors_0.4.0 BiocGenerics_0.12.1 easyRNASeq_2.2.0

loaded via a namespace (and not attached):

[1] annotate_1.44.0 AnnotationDbi_1.28.1 base64enc_0.1-2

[4] BatchJobs_1.5 BBmisc_1.8 BiocParallel_1.0.1

[7] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6

[10] checkmate_1.5.1 codetools_0.2-10 DBI_0.3.1

[13] digest_0.6.8 fail_1.2 foreach_1.4.2

[16] genefilter_1.48.1 geneplotter_1.44.0 genomeIntervals_1.22.0

[19] GenomicAlignments_1.2.1 GenomicFeatures_1.18.3 grid_3.1.1

[22] hwriter_1.3.2 intervals_0.15.0 iterators_1.0.7

[25] latticeExtra_0.6-26 LSD_3.0 plyr_1.8.1

[28] RColorBrewer_1.1-2 Rcpp_0.11.4 RCurl_1.95-4.5

[31] RSQLite_1.0.0 rtracklayer_1.26.2 sendmailR_1.2-1

[34] ShortRead_1.24.0 splines_3.1.1 stringr_0.6.2

[37] survival_2.37-7 tools_3.1.1 XML_3.98-1.1

[40] xtable_1.7-4