How to combine different annotation (chromosomic, genic) when using easyRNASeq (was easyRNASeq error)
1
0
Entering edit mode
@delhommeemblde-3232
Last seen 9.6 years ago
Hi Wade, Here is a description on how to deal with the problem you reported in the thread: "easyRNASeq error". I just paste the R code below, check the comments for the explanation of the different steps. I'll add a section in the vignette about this. Cheers, Nico ## libs library(easyRNASeq) ## >= 1.1.10 ## set wd setwd("Your working dir") ## read in your data ## this is inefficient in comparison to a bam file, a you load as well the whole sequences in memory aln <- readAligned("data",type="SolexaExport",pattern="*.txt.gz") ## free some mem! gc() ## the names are different from what we expect (UCSC standards), as they have an additional .fa extension levels(chromosome(aln)) ## faster than a table on the same content: table(chromosome(aln)) ## they are different from what biomaRt will return as well, as these are 1:19, X, Y and MT plus others ## therefore it will be a problem from easyRNASeq as you have seen when you want to use biomaRt as an annotation source ## for this easyRNASeq requires the name of the organism, which circumvent the "custom" chromosome name map by-pass ## The solution is to fetch the annotation first obj <- fetchAnnotation(new('RNAseq', organismName="Mmusculus" ), method="biomaRt") ## the annotation are in the genomicAnnotation slot gAnnot <- genomicAnnotation(obj) ## shall we subset to the chromosome we're interested in ## there are only 1181 NT contigs, so let's keep them length(grep("NT_",space(gAnnot))) ## rename the chromosomes in the annotation ## that will ease our chr.map table generation ## note that it would not be necessary names(gAnnot) <- paste("chr",names(gAnnot),".fa",sep="") ## run easyRNASeq using the local annotation ## since we are within a script, I'll directly use the gAnnot object. ## However as you will see this will raise double counting warnings. ## double counting reads is not what ones want, that's why the better way ## would be to rework the obtained annotation to remove/clarify these cases ## and save the obtained annotation as an rda file on your file system ## save(gAnnot,file="myAnnotation.rda") ## easyRNASeq can use such an rda file as annotation as well ## first I need some space rm(aln,obj) gc() ## get the chromosome sizes ## Again, note that the use of the BSgenome ## is not mandatory. It's just easy as they are ## available in Bioconductor. Typing in your own ## chromosome size list is as valid. library(BSgenome.Mmusculus.UCSC.mm9) ## note that this might change to a vector soon, which is more intuitive ## it's historicaly due to the RangedData API chr.sizes<- as.list(seqlengths(Mmusculus)) ## create the chr.map chr.map <- data.frame(from=paste("chr",c(1:19,"X","Y"),".fa",sep=""),t o=paste("chr",c(1:19,"X","Y"),sep="")) ## get the geneModels count within an RNAseq object ## the advantage of using an RNAseq object is that ## you can access the geneModels annotation ## and tailor it to your needs as explained above ## Here we have several solutions ## using the entire set of annotation we got ## We need to define a set of Filter to ensure ## that the data is read properly ## The export file contains all the reads, so the one that do not pass the chastity filter have to be removed. ## In addition, some of the other reads are for internal QC, and they have no position. For that reason, we need ## to filter those out too. ## Reading in aln data is more resource exhausting that reading in bam files, as we are ## reading in the sequence and quality information as well, whereas we do not need them. ## As you will see, this generates a lot of warnings, ## because of the differing annotations rnaSeq <- easyRNASeq(filesDirectory="data", organism="custom", chr.map=chr.map, chr.sizes=chr.sizes, filter=compose( naPositionFilter(), chastityFilter()), readLength=50L, annotationMethod="env", annotationObject=gAnnot, format="aln", count="genes", summarization= "geneModels", filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz", outputFormat="RNAseq", nbCore=2 ) ## one way to reduce the warnings is to select the chromosome we are interested in ## that's waht we do by adding the chr.sel selector rnaSeq <- easyRNASeq(filesDirectory="data", organism="custom", chr.map=chr.map, chr.sizes=chr.sizes, chr.sel=chr.map$from, filter=compose( naPositionFilter(), chastityFilter()), readLength=50L, annotationMethod="env", annotationObject=gAnnot, format="aln", count="genes", summarization= "geneModels", filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz", outputFormat="RNAseq", nbCore=2 ) ## To further reduce the warnings, we can manipulate the RangedData object ## to remove the unnecessary annotation. It's quite straightforward. sel <- grep("NT_",names(gAnnot)) gAnnot <- RangedData(ranges=ranges(gAnnot)[-sel,],values=values(gAnnot)[-sel,]) colnames(gAnnot) <- gsub("values\\.","",colnames(gAnnot)) ## This will only generate two warnings, one that could be easily dealt with (the chrMT) ## The other one is about what was discussed previously. There's a need to adapt the annotation ## to avoid potential double read counting. rnaSeq <- easyRNASeq(filesDirectory="data", organism="custom", chr.map=chr.map, chr.sizes=chr.sizes, chr.sel=chr.map$from, filter=compose( naPositionFilter(), chastityFilter()), readLength=50L, annotationMethod="env", annotationObject=gAnnot, format="aln", count="genes", summarization= "geneModels", filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz", outputFormat="RNAseq", nbCore=2 ) --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
RNASeq Annotation Organism biomaRt easyRNASeq RNASeq Annotation Organism biomaRt • 1.0k views
ADD COMMENT
0
Entering edit mode
Davis, Wade ▴ 350
@davis-wade-2803
Last seen 9.6 years ago
Hi Nico, Thanks for adding these convenient features to easyRNASeq. Everything works fine now. I'm sure other users will benefit from the detailed example below. Thanks again, Wade -----Original Message----- From: Nicolas Delhomme [mailto:delhomme@embl.de] Sent: Friday, March 09, 2012 7:28 AM To: Bioconductor mailing list Cc: Davis, Wade Subject: How to combine different annotation (chromosomic, genic) when using easyRNASeq (was easyRNASeq error) Hi Wade, Here is a description on how to deal with the problem you reported in the thread: "easyRNASeq error". I just paste the R code below, check the comments for the explanation of the different steps. I'll add a section in the vignette about this. Cheers, Nico ## libs library(easyRNASeq) ## >= 1.1.10 ## set wd setwd("Your working dir") ## read in your data ## this is inefficient in comparison to a bam file, a you load as well the whole sequences in memory aln <- readAligned("data",type="SolexaExport",pattern="*.txt.gz") ## free some mem! gc() ## the names are different from what we expect (UCSC standards), as they have an additional .fa extension levels(chromosome(aln)) ## faster than a table on the same content: table(chromosome(aln)) ## they are different from what biomaRt will return as well, as these are 1:19, X, Y and MT plus others ## therefore it will be a problem from easyRNASeq as you have seen when you want to use biomaRt as an annotation source ## for this easyRNASeq requires the name of the organism, which circumvent the "custom" chromosome name map by-pass ## The solution is to fetch the annotation first obj <- fetchAnnotation(new('RNAseq', organismName="Mmusculus" ), method="biomaRt") ## the annotation are in the genomicAnnotation slot gAnnot <- genomicAnnotation(obj) ## shall we subset to the chromosome we're interested in ## there are only 1181 NT contigs, so let's keep them length(grep("NT_",space(gAnnot))) ## rename the chromosomes in the annotation ## that will ease our chr.map table generation ## note that it would not be necessary names(gAnnot) <- paste("chr",names(gAnnot),".fa",sep="") ## run easyRNASeq using the local annotation ## since we are within a script, I'll directly use the gAnnot object. ## However as you will see this will raise double counting warnings. ## double counting reads is not what ones want, that's why the better way ## would be to rework the obtained annotation to remove/clarify these cases ## and save the obtained annotation as an rda file on your file system ## save(gAnnot,file="myAnnotation.rda") ## easyRNASeq can use such an rda file as annotation as well ## first I need some space rm(aln,obj) gc() ## get the chromosome sizes ## Again, note that the use of the BSgenome ## is not mandatory. It's just easy as they are ## available in Bioconductor. Typing in your own ## chromosome size list is as valid. library(BSgenome.Mmusculus.UCSC.mm9) ## note that this might change to a vector soon, which is more intuitive ## it's historicaly due to the RangedData API chr.sizes<- as.list(seqlengths(Mmusculus)) ## create the chr.map chr.map <- data.frame(from=paste("chr",c(1:19,"X","Y"),".fa",sep=""),t o=paste("chr",c(1:19,"X","Y"),sep="")) ## get the geneModels count within an RNAseq object ## the advantage of using an RNAseq object is that ## you can access the geneModels annotation ## and tailor it to your needs as explained above ## Here we have several solutions ## using the entire set of annotation we got ## We need to define a set of Filter to ensure ## that the data is read properly ## The export file contains all the reads, so the one that do not pass the chastity filter have to be removed. ## In addition, some of the other reads are for internal QC, and they have no position. For that reason, we need ## to filter those out too. ## Reading in aln data is more resource exhausting that reading in bam files, as we are ## reading in the sequence and quality information as well, whereas we do not need them. ## As you will see, this generates a lot of warnings, ## because of the differing annotations rnaSeq <- easyRNASeq(filesDirectory="data", organism="custom", chr.map=chr.map, chr.sizes=chr.sizes, filter=compose( naPositionFilter(), chastityFilter()), readLength=50L, annotationMethod="env", annotationObject=gAnnot, format="aln", count="genes", summarization= "geneModels", filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz", outputFormat="RNAseq", nbCore=2 ) ## one way to reduce the warnings is to select the chromosome we are interested in ## that's waht we do by adding the chr.sel selector rnaSeq <- easyRNASeq(filesDirectory="data", organism="custom", chr.map=chr.map, chr.sizes=chr.sizes, chr.sel=chr.map$from, filter=compose( naPositionFilter(), chastityFilter()), readLength=50L, annotationMethod="env", annotationObject=gAnnot, format="aln", count="genes", summarization= "geneModels", filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz", outputFormat="RNAseq", nbCore=2 ) ## To further reduce the warnings, we can manipulate the RangedData object ## to remove the unnecessary annotation. It's quite straightforward. sel <- grep("NT_",names(gAnnot)) gAnnot <- RangedData(ranges=ranges(gAnnot)[-sel,],values=values(gAnnot)[-sel,]) colnames(gAnnot) <- gsub("values\\.","",colnames(gAnnot)) ## This will only generate two warnings, one that could be easily dealt with (the chrMT) ## The other one is about what was discussed previously. There's a need to adapt the annotation ## to avoid potential double read counting. rnaSeq <- easyRNASeq(filesDirectory="data", organism="custom", chr.map=chr.map, chr.sizes=chr.sizes, chr.sel=chr.map$from, filter=compose( naPositionFilter(), chastityFilter()), readLength=50L, annotationMethod="env", annotationObject=gAnnot, format="aln", count="genes", summarization= "geneModels", filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz", outputFormat="RNAseq", nbCore=2 ) --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany
ADD COMMENT

Login before adding your answer.

Traffic: 647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6