Question

EasyRNASeq issue (missing bai files)

0

Entering edit mode

delhomme@embl.de ★ 1.2k

@delhommeemblde-3232

Last seen 9.7 years ago

Dear Jorge, I've Cc'ed the Bioc mailing list as it can be of help to others. You're missing the index (.bam.bai) files for your bam files; i.e. you need to run 'samtools index TTGR1.bam' on the command line in your /Users/jbeira/Desktop/bams directory to create the TTGR1.bam.bai file. You may as well want to use the indexBam function of the Rsamtools package. This package is required by easyRNASeq, so the following would index all your bam files in your bams directory: library(easyRNASeq) setwd("/Users/jbeira/Desktop/bams") indexBam(files=dir(".",pattern="*\\.bam$")) Note that it is important for your bam files to be sorted first (check the samtools webpage for more info: http://samtools.sourceforge.net/) In your easyRNASeq call, you're missing the "conditions" argument that describes your samples (e.g. tumor, control). This is necessary if you want to produce a DESeq output. The conditions should be a named vector, the names being the actual filenames: e.g. in your case conditions=c("TTGR1.bam"="tumor"). Asking for a DESeq output with a single sample is not going to work, but I suppose you've got more than one sample :-); you can provide all of them at once in the filenames argument or use the pattern argument instead (as in the indexBam for example). Finally, if you are using easyRNASeq version 1.3.8 (the development version), you do not need to precise the chr.sizes argument, provided your bam files have an header (which they most probably have). The readLength would as well be determined from your data, but it does not harm providing it. Moreover, if the computer you're running on as enough memory and CPUs, you can process the input file in parallel using the nbCore argument (as of easyRNASeq version 1.3.8). The memory usage is roughly the size of the BAM files, i.e. if you have 12GB RAM, you could proceed 4 x 3GB bam files in parallel (in an ideal world, in practice I would go for 3 just in case) Cheers, Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delhomme at embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On Jul 18, 2012, at 5:24 PM, Jorge Beira wrote: > Dear Nicolas, > > I am trying to use the easyRNASeq package to obtain read counts so that I can proceed with my analysis for DESeq. However I must be doing something wrong in giving it the right arguments, since it gives me errors like > " Error in easyRNASeq(filesDirectory = getwd(), filenames = c("TTGR1.bam"), : > Index files (bai) are required. They are missing for the files: /Users/jbeira/Desktop/bams/TTGR1.bam " > > Info: I have my bam files in a folder "bams" in my Desktop, and I also added the Drosophila .gff file on the same directory. So the whole code I'm trying to run is: > > > setwd("~/Desktop/bams") > library("easyRNASeq") > count.table <- easyRNASeq( > filesDirectory=getwd(), > filenames=c("TTGR1.bam"), > organism="Dmelanogaster", > chr.sizes=as.list(seqlengths(Dmelanogaster)), > readLength=75L, > annotationMethod="gff", > annotationFile=system.file("dmel-all-r5.45.gff"), > format="bam", > count="exons", > outputFormat="DESeq") > > > If you could help me spotting where the problem is, it'd be great. Thanks a lot! > > Best wishes > > Jorge Beira > National Institute for Medical Research > and University College London, UK

GO PROcess DESeq easyRNASeq GO PROcess DESeq easyRNASeq • 2.4k views

ADD COMMENT • link updated 11.8 years ago by Jorge Beira ▴ 10 • written 11.8 years ago by delhomme@embl.de ★ 1.2k

score 0 · Answer 1 · 2012-07-20

Hi Nico, Thanks for your help - I hadn't replied back yet as I wanted to try to index them bam file first and see if other problems arised (as they did...!). I changed the arguments with the suggestions you made for the conditions and output, however the chrsize didn't really work, now I've got this error " Error in .readGffGtf(filename = filename, ignoreWarnings = ignoreWarnings, : The filename you provided: does not exists In addition: Warning message: In easyRNASeq(filesDirectory = getwd(), filenames = c("J05_orig_genome.sorted.bam"), : You enforce UCSC chromosome conventions, however the provided chromosome size list is not compliant. Correcting it." I guess I also am not doing something properly with the gff file, given the error, but since it didn't spit out any error about this before I don't know why this happens now... I tried to use library(BSgenome.Dmelanogaster.UCSC.dm3) to try to solve the chrsize issue but maybe that's what generated some incompatibility? Let me paste here my current code to see if you spot a problem: setwd("/Users/jbeira/Desktop/bams") library("easyRNASeq") count.table <- easyRNASeq( filesDirectory=getwd(), filenames=c("J05_orig_genome.sorted.bam"), organism="Dmelanogaster", readLength=75L, chr.sizes=as.list(seqlengths(Dmelanogaster)), annotationMethod="gff", annotationFile=system.file("dmel-all-r5.45.gff"), format="bam", conditions=c("J05_orig_genome.sorted.bam"="wt"), count="exons", outputFormat="RNAseq") --- You told me I didn't need the chrsize with the developer's version, but because of a proxy permission here at the institute I am not able to install the latest version of the package now, so if there's any way around this by defining the variable, that would be a good help. Sorry if this is getting more complicated, thanks for your help! Best Jorge On Wed Jul 18 16:49:24 2012, Nicolas Delhomme wrote: > Dear Jorge, > > I've Cc'ed the Bioc mailing list as it can be of help to others. > > You're missing the index (.bam.bai) files for your bam files; i.e. you need to run 'samtools index TTGR1.bam' on the command line in your /Users/jbeira/Desktop/bams directory to create the TTGR1.bam.bai file. > > You may as well want to use the indexBam function of the Rsamtools package. This package is required by easyRNASeq, so the following would index all your bam files in your bams directory: > > library(easyRNASeq) > > setwd("/Users/jbeira/Desktop/bams") > > indexBam(files=dir(".",pattern="*\\.bam$")) > > Note that it is important for your bam files to be sorted first (check the samtools webpage for more info: http://samtools.sourceforge.net/) > > In your easyRNASeq call, you're missing the "conditions" argument that describes your samples (e.g. tumor, control). This is necessary if you want to produce a DESeq output. The conditions should be a named vector, the names being the actual filenames: e.g. in your case conditions=c("TTGR1.bam"="tumor"). Asking for a DESeq output with a single sample is not going to work, but I suppose you've got more than one sample :-); you can provide all of them at once in the filenames argument or use the pattern argument instead (as in the indexBam for example). > > Finally, if you are using easyRNASeq version 1.3.8 (the development version), you do not need to precise the chr.sizes argument, provided your bam files have an header (which they most probably have). The readLength would as well be determined from your data, but it does not harm providing it. Moreover, if the computer you're running on as enough memory and CPUs, you can process the input file in parallel using the nbCore argument (as of easyRNASeq version 1.3.8). The memory usage is roughly the size of the BAM files, i.e. if you have 12GB RAM, you could proceed 4 x 3GB bam files in parallel (in an ideal world, in practice I would go for 3 just in case) > > Cheers, > > Nico > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delhomme at embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On Jul 18, 2012, at 5:24 PM, Jorge Beira wrote: > >> Dear Nicolas, >> >> I am trying to use the easyRNASeq package to obtain read counts so that I can proceed with my analysis for DESeq. However I must be doing something wrong in giving it the right arguments, since it gives me errors like >> " Error in easyRNASeq(filesDirectory = getwd(), filenames = c("TTGR1.bam"), : >> Index files (bai) are required. They are missing for the files: /Users/jbeira/Desktop/bams/TTGR1.bam " >> >> Info: I have my bam files in a folder "bams" in my Desktop, and I also added the Drosophila .gff file on the same directory. So the whole code I'm trying to run is: >> >> >> setwd("~/Desktop/bams") >> library("easyRNASeq") >> count.table <- easyRNASeq( >> filesDirectory=getwd(), >> filenames=c("TTGR1.bam"), >> organism="Dmelanogaster", >> chr.sizes=as.list(seqlengths(Dmelanogaster)), >> readLength=75L, >> annotationMethod="gff", >> annotationFile=system.file("dmel-all-r5.45.gff"), >> format="bam", >> count="exons", >> outputFormat="DESeq") >> >> >> If you could help me spotting where the problem is, it'd be great. Thanks a lot! >> >> Best wishes >> >> Jorge Beira >> National Institute for Medical Research >> and University College London, UK > >

score 0 · Answer 2 · 2012-07-20

Hi Jorge, On Jul 20, 2012, at 12:27 PM, Jorge Beira wrote: > > Hi Nico, > > Thanks for your help - I hadn't replied back yet as I wanted to try to index them bam file first and see if other problems arised (as they did...!). > > I changed the arguments with the suggestions you made for the conditions and output, however the chrsize didn't really work, now I've got this error " Error in .readGffGtf(filename = filename, ignoreWarnings = ignoreWarnings, : > The filename you provided: does not exists Well, the error is clear enough :-) This: system.file("dmel- all-r5.45.gff") does not seem to be a valid file path. What does this returns: file.exists(system.file("dmel-all-r5.45.gff")) > In addition: Warning message: > In easyRNASeq(filesDirectory = getwd(), filenames = c("J05_orig_genome.sorted.bam"), : > You enforce UCSC chromosome conventions, however the provided chromosome size list is not compliant. Correcting it." This warning is expected if your chromosome size list does not contain UCSC compliant names (i.e. for the fruit-fly chr2L, chr2R, etc.), however this should be dealt with smoothly; i.e. you can ignore this warning in your case. I should really avoid reporting those warnings then, I'll add that to my TODO list. > > I guess I also am not doing something properly with the gff file, given the error, but since it didn't spit out any error about this before I don't know why this happens now... Well, it stopped before getting to that point so of course you would not see it. So there's progress :-) > I tried to use library(BSgenome.Dmelanogaster.UCSC.dm3) to try to solve the chrsize issue but maybe that's what generated some incompatibility? That should not, no. As I said above, I need a better handling of the warnings. > > Let me paste here my current code to see if you spot a problem: > > setwd("/Users/jbeira/Desktop/bams") > library("easyRNASeq") > count.table <- easyRNASeq( > filesDirectory=getwd(), > filenames=c("J05_orig_genome.sorted.bam"), > organism="Dmelanogaster", > readLength=75L, > chr.sizes=as.list(seqlengths(Dmelanogaster)), > annotationMethod="gff", > annotationFile=system.file("dmel-all-r5.45.gff"), > format="bam", > conditions=c("J05_orig_genome.sorted.bam"="wt"), > count="exons", > outputFormat="RNAseq") > You do not need the "conditions" if you are not asking for a "DESeq" or "edgeR" outputFormat. If you want just a count table (an exons x samples matrix), then you do not need to precise the outputFormat at all. Did you look at the package vignette and help pages that defines the arguments and their usage? ?easyRNASeq vignette("easyRNASeq") If you did, let me know what was unclear, so that I can rectify them. > > --- You told me I didn't need the chrsize with the developer's version, but because of a proxy permission here at the institute I am not able to install the latest version of the package now, so if there's any way around this by defining the variable, that would be a good help. Importing this new feature from the development version into the release one would introduce too many changes to the release version, so I can't do that. You can always create your own named list instead of using as.list(seqlengths(Dmelanogaster)). > > Sorry if this is getting more complicated, thanks for your help! > No worries. One thing that could be helpful is that you post the results of the "sessionInfo()" command when posting on this list. This way, I would know directly if we are talking about the release or development version of the package. You'd get faster and more accurate answers then :-) Cheers, Nico > Best > Jorge > > On Wed Jul 18 16:49:24 2012, Nicolas Delhomme wrote: >> Dear Jorge, >> >> I've Cc'ed the Bioc mailing list as it can be of help to others. >> >> You're missing the index (.bam.bai) files for your bam files; i.e. you need to run 'samtools index TTGR1.bam' on the command line in your /Users/jbeira/Desktop/bams directory to create the TTGR1.bam.bai file. >> >> You may as well want to use the indexBam function of the Rsamtools package. This package is required by easyRNASeq, so the following would index all your bam files in your bams directory: >> >> library(easyRNASeq) >> >> setwd("/Users/jbeira/Desktop/bams") >> >> indexBam(files=dir(".",pattern="*\\.bam$")) >> >> Note that it is important for your bam files to be sorted first (check the samtools webpage for more info: http://samtools.sourceforge.net/) >> >> In your easyRNASeq call, you're missing the "conditions" argument that describes your samples (e.g. tumor, control). This is necessary if you want to produce a DESeq output. The conditions should be a named vector, the names being the actual filenames: e.g. in your case conditions=c("TTGR1.bam"="tumor"). Asking for a DESeq output with a single sample is not going to work, but I suppose you've got more than one sample :-); you can provide all of them at once in the filenames argument or use the pattern argument instead (as in the indexBam for example). >> >> Finally, if you are using easyRNASeq version 1.3.8 (the development version), you do not need to precise the chr.sizes argument, provided your bam files have an header (which they most probably have). The readLength would as well be determined from your data, but it does not harm providing it. Moreover, if the computer you're running on as enough memory and CPUs, you can process the input file in parallel using the nbCore argument (as of easyRNASeq version 1.3.8). The memory usage is roughly the size of the BAM files, i.e. if you have 12GB RAM, you could proceed 4 x 3GB bam files in parallel (in an ideal world, in practice I would go for 3 just in case) >> >> Cheers, >> >> Nico >> >> --------------------------------------------------------------- >> Nicolas Delhomme >> >> Genome Biology Computational Support >> >> European Molecular Biology Laboratory >> >> Tel: +49 6221 387 8310 >> Email: nicolas.delhomme at embl.de >> Meyerhofstrasse 1 - Postfach 10.2209 >> 69102 Heidelberg, Germany >> --------------------------------------------------------------- >> >> >> >> >> >> On Jul 18, 2012, at 5:24 PM, Jorge Beira wrote: >> >>> Dear Nicolas, >>> >>> I am trying to use the easyRNASeq package to obtain read counts so that I can proceed with my analysis for DESeq. However I must be doing something wrong in giving it the right arguments, since it gives me errors like >>> " Error in easyRNASeq(filesDirectory = getwd(), filenames = c("TTGR1.bam"), : >>> Index files (bai) are required. They are missing for the files: /Users/jbeira/Desktop/bams/TTGR1.bam " >>> >>> Info: I have my bam files in a folder "bams" in my Desktop, and I also added the Drosophila .gff file on the same directory. So the whole code I'm trying to run is: >>> >>> >>> setwd("~/Desktop/bams") >>> library("easyRNASeq") >>> count.table <- easyRNASeq( >>> filesDirectory=getwd(), >>> filenames=c("TTGR1.bam"), >>> organism="Dmelanogaster", >>> chr.sizes=as.list(seqlengths(Dmelanogaster)), >>> readLength=75L, >>> annotationMethod="gff", >>> annotationFile=system.file("dmel-all-r5.45.gff"), >>> format="bam", >>> count="exons", >>> outputFormat="DESeq") >>> >>> >>> If you could help me spotting where the problem is, it'd be great. Thanks a lot! >>> >>> Best wishes >>> >>> Jorge Beira >>> National Institute for Medical Research >>> and University College London, UK >> >> > >

score 0 · Answer 3 · 2012-07-20

Hi Jorge, On Jul 20, 2012, at 2:16 PM, Jorge Beira wrote: > > Hi Nico, > > Thanks for your patient reply :) > > Update: I could get the proxy problem solved now, and install the latest easyRNASeq package Good. > > So, sessionInfo() returns the following: > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] easyRNASeq_1.3.8 ShortRead_1.14.4 latticeExtra_0.6-19 RColorBrewer_1.0-5 lattice_0.20-6 > [6] Rsamtools_1.8.5 DESeq_1.8.3 locfit_1.5-8 BSgenome_1.24.0 GenomicRanges_1.8.7 > [11] Biostrings_2.24.1 IRanges_1.14.4 edgeR_2.6.9 limma_3.12.1 BiocInstaller_1.4.7 > [16] biomaRt_2.12.0 Biobase_2.16.0 genomeIntervals_1.12.0 BiocGenerics_0.2.0 intervals_0.13.3 > > loaded via a namespace (and not attached): > [1] annotate_1.34.1 AnnotationDbi_1.18.1 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.38.0 geneplotter_1.34.0 > [7] grid_2.15.1 hwriter_1.3 RCurl_1.91-1 RSQLite_0.11.1 splines_2.15.1 stats4_2.15.1 > [13] survival_2.36-14 tools_2.15.1 XML_3.9-4 xtable_1.7-0 zlibbioc_1.2.0 > > I'm really surprised you could install version 1.3.8 actually. It requires edgeR >= 2.7.23 (yours is 2.6.9), and IRanges >= 1.15.18 (yours is 1.14.4). How did you do the installation? Anyway, if you want to use the development version of the easyRNASeq package, you should use the development version of every package (knowing that they are then more unstable and might be broken for a couple of days from time to time). If you want to use the development version, you should do the following: library(BiocInstaller) useDevel(TRUE) chooseBioCmirror() ## pick the closest one update.packages(ask=FALSE) If not, you're better off reverting to easyRNASeq version 1.2.3. > > ------- > > getting back to the issues: > > indeed when I query file.exists(system.file("dmel-all-r5.45.gff")) is turns FALSE -- but now this is very strange, then, since I know the file is in the bams directory together with the bam files so why is it not reading it?.... In any case looks like the major problem is here before I can proceed. Any idea on this one?... > system.file is not appropriate here, look at the help about it for more details (it basically look up files in your R installation directories), i.e. in R type: ?system.file Just using annotationFile="dmel-all-r5.45.gff" should do it. > My current code is as follows: > > setwd("/Users/jbeira/Desktop/bams") > library("easyRNASeq") > library(BSgenome.Dmelanogaster.UCSC.dm3) > count.table <- easyRNASeq( > filesDirectory=getwd(), > filenames=c("J05_orig_genome.sorted.bam"), > format="bam", > count="exons", > organism="Dmelanogaster", > chr.sizes=as.list(seqlengths(Dmelanogaster)), > readLength=75L, > annotationMethod="gff", > annotationFile=system.file("dmel-all-r5.45.gff")) > > > Just thinking of some info in case it explains some incompatibilities: > > I aligned my reads to the "dmel-all-chromosome-r5.45.fasta" version, however now I'm using the gff file "dmel-all-r5.45.gff" -- does this suggest any issue to you? No, that should be fine, but we can double check. What does the following gives you (in a terminal, not in R) grep ">" dmel-all-chromosome-r5.45.fasta Cheers, Nico > > Thanks once again, > > Jorge > > > > On Fri Jul 20 11:27:42 2012, Jorge Beira wrote: >> >> Hi Nico, >> >> Thanks for your help - I hadn't replied back yet as I wanted to try to >> index them bam file first and see if other problems arised (as they >> did...!). >> >> I changed the arguments with the suggestions you made for the >> conditions and output, however the chrsize didn't really work, now >> I've got this error " Error in .readGffGtf(filename = filename, >> ignoreWarnings = ignoreWarnings, : >> The filename you provided: does not exists >> In addition: Warning message: >> In easyRNASeq(filesDirectory = getwd(), filenames = >> c("J05_orig_genome.sorted.bam"), : >> You enforce UCSC chromosome conventions, however the provided >> chromosome size list is not compliant. Correcting it." >> >> I guess I also am not doing something properly with the gff file, >> given the error, but since it didn't spit out any error about this >> before I don't know why this happens now... >> I tried to use library(BSgenome.Dmelanogaster.UCSC.dm3) to try to >> solve the chrsize issue but maybe that's what generated some >> incompatibility? >> >> Let me paste here my current code to see if you spot a problem: >> >> setwd("/Users/jbeira/Desktop/bams") >> library("easyRNASeq") >> count.table <- easyRNASeq( >> filesDirectory=getwd(), >> filenames=c("J05_orig_genome.sorted.bam"), >> organism="Dmelanogaster", >> readLength=75L, >> chr.sizes=as.list(seqlengths(Dmelanogaster)), >> annotationMethod="gff", >> annotationFile=system.file("dmel-all-r5.45.gff"), >> format="bam", >> conditions=c("J05_orig_genome.sorted.bam"="wt"), >> count="exons", >> outputFormat="RNAseq") >> >> >> --- You told me I didn't need the chrsize with the developer's >> version, but because of a proxy permission here at the institute I am >> not able to install the latest version of the package now, so if >> there's any way around this by defining the variable, that would be a >> good help. >> >> Sorry if this is getting more complicated, thanks for your help! >> >> Best >> Jorge >> >> On Wed Jul 18 16:49:24 2012, Nicolas Delhomme wrote: >>> Dear Jorge, >>> >>> I've Cc'ed the Bioc mailing list as it can be of help to others. >>> >>> You're missing the index (.bam.bai) files for your bam files; i.e. >>> you need to run 'samtools index TTGR1.bam' on the command line in >>> your /Users/jbeira/Desktop/bams directory to create the TTGR1.bam.bai >>> file. >>> >>> You may as well want to use the indexBam function of the Rsamtools >>> package. This package is required by easyRNASeq, so the following >>> would index all your bam files in your bams directory: >>> >>> library(easyRNASeq) >>> >>> setwd("/Users/jbeira/Desktop/bams") >>> >>> indexBam(files=dir(".",pattern="*\\.bam$")) >>> >>> Note that it is important for your bam files to be sorted first >>> (check the samtools webpage for more info: >>> http://samtools.sourceforge.net/) >>> >>> In your easyRNASeq call, you're missing the "conditions" argument >>> that describes your samples (e.g. tumor, control). This is necessary >>> if you want to produce a DESeq output. The conditions should be a >>> named vector, the names being the actual filenames: e.g. in your case >>> conditions=c("TTGR1.bam"="tumor"). Asking for a DESeq output with a >>> single sample is not going to work, but I suppose you've got more >>> than one sample :-); you can provide all of them at once in the >>> filenames argument or use the pattern argument instead (as in the >>> indexBam for example). >>> >>> Finally, if you are using easyRNASeq version 1.3.8 (the development >>> version), you do not need to precise the chr.sizes argument, provided >>> your bam files have an header (which they most probably have). The >>> readLength would as well be determined from your data, but it does >>> not harm providing it. Moreover, if the computer you're running on as >>> enough memory and CPUs, you can process the input file in parallel >>> using the nbCore argument (as of easyRNASeq version 1.3.8). The >>> memory usage is roughly the size of the BAM files, i.e. if you have >>> 12GB RAM, you could proceed 4 x 3GB bam files in parallel (in an >>> ideal world, in practice I would go for 3 just in case) >>> >>> Cheers, >>> >>> Nico >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delhomme at embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> --------------------------------------------------------------- >>> >>> >>> >>> >>> >>> On Jul 18, 2012, at 5:24 PM, Jorge Beira wrote: >>> >>>> Dear Nicolas, >>>> >>>> I am trying to use the easyRNASeq package to obtain read counts so >>>> that I can proceed with my analysis for DESeq. However I must be >>>> doing something wrong in giving it the right arguments, since it >>>> gives me errors like >>>> " Error in easyRNASeq(filesDirectory = getwd(), filenames = >>>> c("TTGR1.bam"), : >>>> Index files (bai) are required. They are missing for the files: >>>> /Users/jbeira/Desktop/bams/TTGR1.bam " >>>> >>>> Info: I have my bam files in a folder "bams" in my Desktop, and I >>>> also added the Drosophila .gff file on the same directory. So the >>>> whole code I'm trying to run is: >>>> >>>> >>>> setwd("~/Desktop/bams") >>>> library("easyRNASeq") >>>> count.table <- easyRNASeq( >>>> filesDirectory=getwd(), >>>> filenames=c("TTGR1.bam"), >>>> organism="Dmelanogaster", >>>> chr.sizes=as.list(seqlengths(Dmelanogaster)), >>>> readLength=75L, >>>> annotationMethod="gff", >>>> annotationFile=system.file("dmel-all-r5.45.gff"), >>>> format="bam", >>>> count="exons", >>>> outputFormat="DESeq") >>>> >>>> >>>> If you could help me spotting where the problem is, it'd be great. >>>> Thanks a lot! >>>> >>>> Best wishes >>>> >>>> Jorge Beira >>>> National Institute for Medical Research >>>> and University College London, UK >>> >>> >> >> > >

score 0 · Answer 4 · 2012-07-20

On Jul 20, 2012, at 3:57 PM, Jorge Beira wrote: > > Hi Nico, > It looks like I'm making some progress slowly... so it's good. > > On Fri Jul 20 14:43:49 2012, Nicolas Delhomme wrote: > >> >> I'm really surprised you could install version 1.3.8 actually. It requires edgeR >= 2.7.23 (yours is 2.6.9), and IRanges >= 1.15.18 (yours is 1.14.4). How did you do the installation? >> > > I installed it with the command " source("http://bioconductor.org/biocLite.R") biocLite("easyRNASeq")" and then it asked me if I wanted to update the dependencies to which I said yes... as long as it works I'm happy ;) This is extremely strange. Since your version of BiocInstaller is 1.4.7 it should never have installed easyRNASeq 1.3.8. Can you open a fresh R session, load easyRNASeq and report the sessionInfo? > > >> Anyway, if you want to use the development version of the easyRNASeq package, you should use the development version of every package (knowing that they are then more unstable and might be broken for a couple of days from time to time). If you want to use the development version, you should do the following: >> >> library(BiocInstaller) >> useDevel(TRUE) >> chooseBioCmirror() ## pick the closest one >> update.packages(ask=FALSE) >> >> If not, you're better off reverting to easyRNASeq version 1.2.3. >> >>> >>> ------- >>> >>> getting back to the issues: >>> >>> indeed when I query file.exists(system.file("dmel-all-r5.45.gff")) is turns FALSE -- but now this is very strange, then, since I know the file is in the bams directory together with the bam files so why is it not reading it?.... In any case looks like the major problem is here before I can proceed. Any idea on this one?... >>> >> >> system.file is not appropriate here, look at the help about it for more details (it basically look up files in your R installation directories), i.e. in R type: >> >> ?system.file >> >> Just using annotationFile="dmel-all-r5.45.gff" should do it. >> >>> I aligned my reads to the "dmel-all-chromosome-r5.45.fasta" version, however now I'm using the gff file "dmel-all-r5.45.gff" -- does this suggest any issue to you? >> >> No, that should be fine, but we can double check. What does the following gives you (in a terminal, not in R) >> >> grep ">" dmel-all-chromosome-r5.45.fasta >> > > > This command gives me what I paste below, so from the looks of it there's all the info...: Good, that looks as expected. > > (What if I align it to the transcriptome, could I still use easyRNASeq or would there be problems to decide what sequences correspond to each feature...? It's just something I've been wondering about) I would not do that for several reasons. 1) Yes you could use easyRNASeq, but you would have to define your own gff/gtf file, because the one you use uses genomic coordinates and if you align to the transcriptome, you would get transcriptomic coordinates. 2) Aligning against the transcriptome is fine if that is the only reference you have. Drosophila has a well known genome so it's much better to align against that. Indeed, it might be that a read aligns perfectly to the genome (i.e. in an enhancer (see the eRNAs paper from Kim et al, Nature, 2010)) and only incorrectly to the transcriptome. By aligning to the transcriptome only, such reads would be incorrectly assigned and would bias your results. If you're worried about reads spanning EEJ and/or multiple mapping reads, you should rather use an aligner that can handle that (tophat, RSEM,..) rather than using the transcriptome as a reference for your alignment. Cheers, Nico > > thanks! > J > > grep ">" dmel-all-chromosome-r5.45.fasta > >> YHet type=chromosome_arm; loc=YHet:1..347038; ID=YHet; dbxref=REFSEQ:NW_001848860,GB:CM000461; MD5=478fbc07ea1b1c03b3d8d04abacf51a5; length=347038; release=r5.45; species=Dmel; >> dmel_mitochondrion_genome type=chromosome; loc=dmel_mitochondrion_genome:1..19517; ID=dmel_mitochondrion_genome; dbxref=GB:NC_001709; MD5=61af8db53361cd5744f41f773d21c3d4; length=19517; release=r5.45; species=Dmel; >> 2L type=chromosome_arm; loc=2L:1..23011544; ID=2L; dbxref=REFSEQ:NT_033779,GB:AE014134; MD5=bfdfb99d39fa5174dae1e2ecd8a231cd; length=23011544; release=r5.45; species=Dmel; >> X type=chromosome_arm; loc=X:1..22422827; ID=X; dbxref=REFSEQ:NC_004354,GB:AE014298; MD5=a83251de102926ba52e3e71c72f1d9b0; length=22422827; release=r5.45; species=Dmel; >> 3L type=chromosome_arm; loc=3L:1..24543557; ID=3L; dbxref=REFSEQ:NT_037436,GB:AE014296; MD5=ec7148cae3daabbd2a226eaa6e85d7c2; length=24543557; release=r5.45; species=Dmel; >> 4 type=chromosome_arm; loc=4:1..1351857; ID=4; dbxref=REFSEQ:NC_004353,GB:AE014135; MD5=515edfcc517bc4e39ae5c696cfd44021; length=1351857; release=r5.45; species=Dmel; >> 2R type=chromosome_arm; loc=2R:1..21146708; ID=2R; dbxref=REFSEQ:NT_033778,GB:AE013599; MD5=1589a9447d4dc94c048aa48ea5b8099d; length=21146708; release=r5.45; species=Dmel; >> 3R type=chromosome_arm; loc=3R:1..27905053; ID=3R; dbxref=REFSEQ:NT_033777,GB:AE014297; MD5=6b279651b3b268f11e0dd1d87ded0ccc; length=27905053; release=r5.45; species=Dmel; >> Uextra type=chromosome_arm; loc=Uextra:1..29004656; ID=Uextra; MD5=3d47647f1cde286a947d03cc17f0bad3; length=29004656; release=r5.45; species=Dmel; >> 2RHet type=chromosome_arm; loc=2RHet:1..3288761; ID=2RHet; dbxref=REFSEQ:NW_001848856,GB:CM000457; MD5=88c0ac39ebe4d9ef5a8f58cd746c9810; length=3288761; release=r5.45; species=Dmel; >> 2LHet type=chromosome_arm; loc=2LHet:1..368872; ID=2LHet; dbxref=REFSEQ:NW_001848855,GB:CM000456; MD5=5217e6a4295a824c31e43d5a9da9038b; length=368872; release=r5.45; species=Dmel; >> 3LHet type=chromosome_arm; loc=3LHet:1..2555491; ID=3LHet; dbxref=REFSEQ:NW_001848857,GB:CM000458; MD5=4902d32327726bf385aa55a75399bdfc; length=2555491; release=r5.45; species=Dmel; >> 3RHet type=chromosome_arm; loc=3RHet:1..2517507; ID=3RHet; dbxref=REFSEQ:NW_001848858,GB:CM000459; MD5=752e488dbea78e592aca03745eb732ea; length=2517507; release=r5.45; species=Dmel; >> U type=chromosome_arm; loc=U:1..10049037; ID=U; MD5=4b72bf19979c8466d5b66acca66f1804; length=10049037; release=r5.45; species=Dmel; >> XHet type=chromosome_arm; loc=XHet:1..204112; ID=XHet; dbxref=REFSEQ:NW_001848859,GB:CM000460; MD5=dcd15c551e34903f6171f53f553b55ee; length=204112; release=r5.45; species=Dmel; > > >

score 0 · Answer 5 · 2012-07-20

Hi Jorge, On Jul 20, 2012, at 6:54 PM, Jorge Beira wrote: > > > Hi Nico, > > I started R again to look into it as you suggested, and when I loaded the easyRNASeq library, I got this (don't know if this is the info you wanted): > >> library(easyRNASeq) > Loading required package: parallel > Loading required package: genomeIntervals > Loading required package: intervals > Loading required package: BiocGenerics > > Attaching package: ?BiocGenerics? > > The following object(s) are masked from ?package:stats?: > > xtabs > > The following object(s) are masked from ?package:base?: > > anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, get, intersect, lapply, Map, mapply, mget, order, paste, > pmax, pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, rownames, sapply, setdiff, table, tapply, union, > unique > > Loading required package: Biobase > Welcome to Bioconductor > > Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', > and for packages 'citation("pkgname")'. > > Loading required package: biomaRt > Error: package ?edgeR? 2.6.9 was found, but >= 2.7.23 is required by ?easyRNASeq? > In addition: Warning messages: > 1: replacing previous import ?coerce? when loading ?intervals? > 2: replacing previous import ?initialize? when loading ?intervals? > > > That's more like it. > > -------- > However, then I re-installed the package and got: > >> library(easyRNASeq) > Error: package ?edgeR? 2.6.9 was found, but >= 2.7.23 is required by ?easyRNASeq? >> source("http://bioconductor.org/biocLite.R") biocLite("easyRNASeq") > Error: unexpected symbol in "source("http://bioconductor.org/biocLite.R") biocLite" >> source("http://bioconductor.org/biocLite.R") > BiocInstaller version 1.4.7, ?biocLite for help >> biocLite("easyRNASeq") > BioC_mirror: http://bioconductor.org > Using R version 2.15, BiocInstaller version 1.4.7. > Warning: unable to access index for repository http://brainarray.mbn i.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.15 > Installing package(s) 'easyRNASeq' > Warning: unable to access index for repository http://brainarray.mbn i.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.15 > trying URL 'http://www.bioconductor.org/packages/2.10/bioc/bin/macos x/leopard/contrib/2.15/easyRNASeq_1.2.3.tgz' > Content type 'application/x-gzip' length 586109 bytes (572 Kb) > opened URL > ================================================== > downloaded 572 Kb > > > The downloaded binary packages are in > /var/folders/Vu/VuRnTG0pHVChk0UeyLVkr++++TM/-Tmp-//RtmplXgRF0/ downloaded_packages > Warning: unable to access index for repository http://brainarray.mbn i.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.15 > > > > ------------------------- > Exactly what should have happened the first time. Very weird you got 1.3.8 installed. Anyway, now you're set :-) > > I got it to work now, using: > > setwd("/Users/jbeira/Desktop/bams") > library("easyRNASeq") > library(BSgenome.Dmelanogaster.UCSC.dm3) > count.table <- easyRNASeq( > filesDirectory=getwd(), > filenames=c("J05_orig_genome.sorted.bam"), > format="bam", > count="exons", > organism="Dmelanogaster", > chr.sizes=as.list(seqlengths(Dmelanogaster)), > readLength=75L, > annotationMethod="gff", > annotationFile="dmel-all-r5.45.gff") > > > However the processing is too much for my machine... and I've tried it in the lab one (24GB RAM) and still couldn't do it. Says it couldn't allocate a 1.2 Gb sized vector. The thing is, I am starting with a 20.5 GB .sorted.bam file -- I found it a bit too big, but don't know why really, but now I'm wondering if there's something wrong with my alignment / sorting as normally BAMs should be smaller (before sorting was 6.9 GB). Any ideas now? Wow, that's huge! It's no surprise you get memory issue, i.e. while reading the BAM file, easyRNASeq will need as much memory as your bam file => 20GB+. It's on my TODO list to use the new streaming functionalities that Martin provided for reading BAM files, but haven't got the time to put my hands on that yet. If you're unsorted bam file is 6.9GB, there's no reason that it get 3x larger when sorted. Are you sure, it is a bam file? I.e. what does this command reports (in the terminal) file J05_orig_genome.sorted.bam I suspect that either you asked for an uncompressed bam file as ouptut or that you got a sam file as the output of your sort command for some reason. Check the samtools webpage to make sure you have been using the proper parameters. Cheers, Nico > > At least I got it to the running phase...! thanks for your help so far!! Hope the next bit isn't too hard to get around. > > Thanks > > Jorge > > > > On Fri Jul 20 16:15:22 2012, Nicolas Delhomme wrote: >> >> On Jul 20, 2012, at 3:57 PM, Jorge Beira wrote: >> >>> >>> Hi Nico, >>> It looks like I'm making some progress slowly... so it's good. >>> >>> On Fri Jul 20 14:43:49 2012, Nicolas Delhomme wrote: >>> >>>> >>>> I'm really surprised you could install version 1.3.8 actually. It requires edgeR >= 2.7.23 (yours is 2.6.9), and IRanges >= 1.15.18 (yours is 1.14.4). How did you do the installation? >>>> >>> >>> I installed it with the command " source("http://bioconductor.org/biocLite.R") biocLite("easyRNASeq")" and then it asked me if I wanted to update the dependencies to which I said yes... as long as it works I'm happy ;) >> >> This is extremely strange. Since your version of BiocInstaller is 1.4.7 it should never have installed easyRNASeq 1.3.8. Can you open a fresh R session, load easyRNASeq and report the sessionInfo? >> >>> >>> >>>> Anyway, if you want to use the development version of the easyRNASeq package, you should use the development version of every package (knowing that they are then more unstable and might be broken for a couple of days from time to time). If you want to use the development version, you should do the following: >>>> >>>> library(BiocInstaller) >>>> useDevel(TRUE) >>>> chooseBioCmirror() ## pick the closest one >>>> update.packages(ask=FALSE) >>>> >>>> If not, you're better off reverting to easyRNASeq version 1.2.3. >>>> >>>>> >>>>> ------- >>>>> >>>>> getting back to the issues: >>>>> >>>>> indeed when I query file.exists(system.file("dmel- all-r5.45.gff")) is turns FALSE -- but now this is very strange, then, since I know the file is in the bams directory together with the bam files so why is it not reading it?.... In any case looks like the major problem is here before I can proceed. Any idea on this one?... >>>>> >>>> >>>> system.file is not appropriate here, look at the help about it for more details (it basically look up files in your R installation directories), i.e. in R type: >>>> >>>> ?system.file >>>> >>>> Just using annotationFile="dmel-all-r5.45.gff" should do it. >>>> >>>>> I aligned my reads to the "dmel-all-chromosome-r5.45.fasta" version, however now I'm using the gff file "dmel-all-r5.45.gff" -- does this suggest any issue to you? >>>> >>>> No, that should be fine, but we can double check. What does the following gives you (in a terminal, not in R) >>>> >>>> grep ">" dmel-all-chromosome-r5.45.fasta >>>> >>> >>> >>> This command gives me what I paste below, so from the looks of it there's all the info...: >> >> Good, that looks as expected. >> >>> >>> (What if I align it to the transcriptome, could I still use easyRNASeq or would there be problems to decide what sequences correspond to each feature...? It's just something I've been wondering about) >> >> I would not do that for several reasons. >> >> 1) Yes you could use easyRNASeq, but you would have to define your own gff/gtf file, because the one you use uses genomic coordinates and if you align to the transcriptome, you would get transcriptomic coordinates. >> >> 2) Aligning against the transcriptome is fine if that is the only reference you have. Drosophila has a well known genome so it's much better to align against that. Indeed, it might be that a read aligns perfectly to the genome (i.e. in an enhancer (see the eRNAs paper from Kim et al, Nature, 2010)) and only incorrectly to the transcriptome. By aligning to the transcriptome only, such reads would be incorrectly assigned and would bias your results. If you're worried about reads spanning EEJ and/or multiple mapping reads, you should rather use an aligner that can handle that (tophat, RSEM,..) rather than using the transcriptome as a reference for your alignment. >> >> Cheers, >> >> Nico >> >>> >>> thanks! >>> J >>> >>> grep ">" dmel-all-chromosome-r5.45.fasta >>> >>>> YHet type=chromosome_arm; loc=YHet:1..347038; ID=YHet; dbxref=REFSEQ:NW_001848860,GB:CM000461; MD5=478fbc07ea1b1c03b3d8d04abacf51a5; length=347038; release=r5.45; species=Dmel; >>>> dmel_mitochondrion_genome type=chromosome; loc=dmel_mitochondrion_genome:1..19517; ID=dmel_mitochondrion_genome; dbxref=GB:NC_001709; MD5=61af8db53361cd5744f41f773d21c3d4; length=19517; release=r5.45; species=Dmel; >>>> 2L type=chromosome_arm; loc=2L:1..23011544; ID=2L; dbxref=REFSEQ:NT_033779,GB:AE014134; MD5=bfdfb99d39fa5174dae1e2ecd8a231cd; length=23011544; release=r5.45; species=Dmel; >>>> X type=chromosome_arm; loc=X:1..22422827; ID=X; dbxref=REFSEQ:NC_004354,GB:AE014298; MD5=a83251de102926ba52e3e71c72f1d9b0; length=22422827; release=r5.45; species=Dmel; >>>> 3L type=chromosome_arm; loc=3L:1..24543557; ID=3L; dbxref=REFSEQ:NT_037436,GB:AE014296; MD5=ec7148cae3daabbd2a226eaa6e85d7c2; length=24543557; release=r5.45; species=Dmel; >>>> 4 type=chromosome_arm; loc=4:1..1351857; ID=4; dbxref=REFSEQ:NC_004353,GB:AE014135; MD5=515edfcc517bc4e39ae5c696cfd44021; length=1351857; release=r5.45; species=Dmel; >>>> 2R type=chromosome_arm; loc=2R:1..21146708; ID=2R; dbxref=REFSEQ:NT_033778,GB:AE013599; MD5=1589a9447d4dc94c048aa48ea5b8099d; length=21146708; release=r5.45; species=Dmel; >>>> 3R type=chromosome_arm; loc=3R:1..27905053; ID=3R; dbxref=REFSEQ:NT_033777,GB:AE014297; MD5=6b279651b3b268f11e0dd1d87ded0ccc; length=27905053; release=r5.45; species=Dmel; >>>> Uextra type=chromosome_arm; loc=Uextra:1..29004656; ID=Uextra; MD5=3d47647f1cde286a947d03cc17f0bad3; length=29004656; release=r5.45; species=Dmel; >>>> 2RHet type=chromosome_arm; loc=2RHet:1..3288761; ID=2RHet; dbxref=REFSEQ:NW_001848856,GB:CM000457; MD5=88c0ac39ebe4d9ef5a8f58cd746c9810; length=3288761; release=r5.45; species=Dmel; >>>> 2LHet type=chromosome_arm; loc=2LHet:1..368872; ID=2LHet; dbxref=REFSEQ:NW_001848855,GB:CM000456; MD5=5217e6a4295a824c31e43d5a9da9038b; length=368872; release=r5.45; species=Dmel; >>>> 3LHet type=chromosome_arm; loc=3LHet:1..2555491; ID=3LHet; dbxref=REFSEQ:NW_001848857,GB:CM000458; MD5=4902d32327726bf385aa55a75399bdfc; length=2555491; release=r5.45; species=Dmel; >>>> 3RHet type=chromosome_arm; loc=3RHet:1..2517507; ID=3RHet; dbxref=REFSEQ:NW_001848858,GB:CM000459; MD5=752e488dbea78e592aca03745eb732ea; length=2517507; release=r5.45; species=Dmel; >>>> U type=chromosome_arm; loc=U:1..10049037; ID=U; MD5=4b72bf19979c8466d5b66acca66f1804; length=10049037; release=r5.45; species=Dmel; >>>> XHet type=chromosome_arm; loc=XHet:1..204112; ID=XHet; dbxref=REFSEQ:NW_001848859,GB:CM000460; MD5=dcd15c551e34903f6171f53f553b55ee; length=204112; release=r5.45; species=Dmel; >>> >>> >>> >> >> > >