Entering edit mode
delhomme@embl.de
★
1.2k
@delhommeemblde-3232
Last seen 10.3 years ago
Dear Wade,
Thanks for the nice words, that's good to hear! Although I have to say
it got me a little worried, because the more people praise, the more
difficult the questions to come ;-)
I added the bioconductor mailing list in Cc, so that it benefits
others.
The first error, I was surprised to see. From the parallel package
documentation, I did not expect it, but it got well extended since I
started using it and is now obvious that the MakeForkCluster would not
work on windows. As I don't have access to any Windows machine, I
would like to give you the code that I'll write to replace that per
email (off list) so that you can try it and tell me if it fixes the
issue. Would that be OK?
For the second error, you are correct. Note that it is NOT necessary
to use the BSgenome for easyRNASeq, that's just that as they are
available in Bioconductor it's easy to use them for exemplifying, but
they are not mandatory. I need to state that more explicitly in the
vignette.
Back to your issue, I'm expecting in the alignment file and in the
chr.sizes object chromosome names that follow UCSC conventions, e.g.
chr1 to chr19 and chrX, chrY and chrM for the M.musculus for example.
If it's not what I get, I try to "guess" them but obviously, since
every sequencing facility has its own convention, it is not a
straightforward business. For that reason, I've implemented a pass-by.
To use it, we need to do the following
1) create a data.frame that contains the mapping from your different
input. As you're using biomaRt as an annotationMethod, we will have
three sets of chromosome names:
a) biomaRT: 1:19,"X","Y"
b) your alignment file: chr1.fa to chr19.fa, chrX.fa, chrY.fa, etc...
c) the chromosome file: chr1 to chr19, chrX, chrY, etc...
that we need to combine together. Since c) follows the UCSC
convention, we'll use that as our standard. I'm not sure what you want
to do about the NM, QC, RM and splice_sites-auto.fa, so I ignore them.
From the name, you probably don't want to ignore that last one, but
I'm not sure how you can use it. Here we go
chr.map <- data.frame(from=c(1:19,"X","Y",paste("chr",c(1:19,"X","Y"),
".fa",sep="")),to=rep(paste("chr",c(1:19,"X","Y"),sep=""),2))
2) rerun you command with the chr.map argument and a "custom" organism
argument
rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
organism="custom",
chr.sizes=chr.sizes,
filter=myfilt,
readLength=50L,
annotationMethod="biomaRt",
format="aln",
count="exons",
pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
outputFormat="RNAseq",
chr.map=chr.map
)
Note that I haven't tried that myself, I always had a least two set of
chromosomes in agreement, so it might fail. If so, it would be great
if I could get access to an excerpt of your data to try it out.
Finally, note that there's another way to by-pass this issue, if you
are not using biomaRt but a local annotation source (gff, gtf, rda,
etc...). If this file would contain the same chromosome names as your
alignment file, you could do the following:
1) change the chromosome name in the chr.sizes to add a ".fa" at the
end
chr.sizes <- as.list(seqlengths(Mmusculus))
names(chr.sizes) <- paste(names(chr.sizes),"fa",sep=".")
2) rerun your command with that new chr.sizes
and precise validity.check=FALSE in the easyRNASeq command
rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
organism="Mmusculus",
chr.sizes=chr.sizes,
filter=myfilt,
readLength=50L,
annotationMethod="biomaRt",
format="aln",
count="exons",
pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
outputFormat="RNAseq",
validity.check=FALSE
)
You need to be careful about what you're doing there, since you shut
off all the consistency check on the chromosome naming.
Cheers,
Nico
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------
On 29 Feb 2012, at 22:07, Davis, Wade wrote:
> Dear Nicolas,
> First , I want to thank you for the hard work you have put in making
the easyRNASeq package. I am very familiar with R and BioConductor,
and NGS statistical models. But I have been learning over the past
month how to deal with raw NGS for the first time, and the number of
different ways and packages is a bit overwhelming. Fortunately, I
discovered your package, and it has simplified things greatly and
helped me understand the workflow better to getting to a count table.
So once again, thank you! I am sure you will have many users in the
coming months.
>
> I have 2 questions. The first relates to the error below which seems
to come from the summarization method I am using, which must involve
trying to do some work in parallel. However, since the default nbCore
is 1, I am surprised it still tries to use multiple processers (I am
on a 8 core machine). Is there anything I can do about this? I really
want to use the geneModels..
> My second question follows the output below.
>
> Thanks for your help and time!
>
> J. Wade Davis, PhD
> Associate Professor, Department of Health Management and Informatics
> Co-Director, Biostatistics and Research Design Unit
> 187 Galena DC 018.0
> University of Missouri
> Columbia, MO 65212
> Phone: (573) 882-0770
> Fax: (573) 884-4196
> www.biostat.missouri.edu
>
>
>
> > # run this code only the first time this way to get the
annotations;
> > # run the subsequent code for future use....
> > rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
> + organism="Mmusculus",
> + chr.sizes=chr.sizes,
> + filter=myfilt,
> + readLength=50L,
> + annotationMethod="biomaRt",
> + # format="SolexaExport",
> + format="aln",
> + #withId=TRUE,
> + count="genes",
> + summarization= "geneModels",
> + #count="exons", #generates error message
> + #filenames=fastqfilenames[1],
> +
pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
> + outputFormat="RNAseq"
> + )
> Checking arguments...
> Fetching annotations...
> Computing gene models...
> Error in makeForkCluster(nnodes = nbCore) :
> fork clusters are not supported on Windows
> > sessionInfo()
> R Under development (unstable) (2012-02-28 r58513)
> Platform: x86_64-pc-mingw32/x64 (64-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets
methods base
>
> other attached packages:
> [1] RnaSeqTutorial_0.0.7
BSgenome.Dmelanogaster.UCSC.dm3_1.3.17
BSgenome.Mmusculus.UCSC.mm9_1.3.17 easyRNASeq_1.1.7
> [5] ShortRead_1.13.13 latticeExtra_0.6-20
RColorBrewer_1.0-5 Rsamtools_1.7.37
> [9] DESeq_1.7.6 locfit_1.5-6
lattice_0.20-0 akima_0.5-7
> [13] BSgenome_1.23.2 GenomicRanges_1.7.28
Biostrings_2.23.6 IRanges_1.13.26
> [17] edgeR_2.5.9 limma_3.11.15
biomaRt_2.11.1 Biobase_2.15.3
> [21] BiocGenerics_0.1.7 genomeIntervals_1.11.0
intervals_0.13.3
>
> loaded via a namespace (and not attached):
> [1] annotate_1.33.2 AnnotationDbi_1.17.22 bitops_1.0-4.1
DBI_0.2-5 genefilter_1.37.0 geneplotter_1.33.1
grid_2.15.0
> [8] hwriter_1.3 RCurl_1.91-1.1 RSQLite_0.11.1
splines_2.15.0 survival_2.36-12 tools_2.15.0
XML_3.9-4.1
> [15] xtable_1.7-0 zlibbioc_1.1.1
> > library(parallel)
> > ?makeForkClsuter
> No documentation for ?makeForkClsuter? in specified packages and
libraries:
> you could try ???makeForkClsuter?
>
>
> My second question is about a different error message, which seems
very close to one you dealt with on the mailing list a few weeks ago,
but the person never said if it was resolved:
>
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/
38841/focus=38858
>
> Here is my code and the error:
>
>
> rnaSeq2 <- easyRNASeq(filesDirectory=extdataDir,
> + organism="Mmusculus",
> + chr.sizes=chr.sizes,
> + filter=myfilt,
> + readLength=50L,
> + annotationMethod="biomaRt",
> + format="aln",
> + count="exons",
> +
pattern="^1-Feb_ATCACG_L003_R1_001_export.txt.gz$",
> + outputFormat="RNAseq"
> + )
> Checking arguments...
> Fetching annotations...
> Summarizing counts...
> Processing 1-Feb_ATCACG_L003_R1_001_export.txt.gz
> Error in FUN(c("chr1.fa", "chr10.fa", "chr11.fa", "chr12.fa",
"chr13.fa", :
> (list) object cannot be coerced to type 'integer'
> In addition: Warning message:
> In easyRNASeq(filesDirectory = extdataDir, organism = "Mmusculus",
:
> There are 429316 features/exons defined in your annotation that
overlap! This implies that some reads will be counted more than once!
Is that really what you want?
>
> What easyRNASeq is using as my chromosome names is different from my
Illumina file (see below). I?m sure this is the source of my error, I
just don?t know how to change the names and fix this. The file I am
dealing with from the sequencing core is aligned I am told as such
?The alignment was run in a splice-aware fashion using as input the
mm9 build of the mouse genome from UCSC as provided by Illumina.?
>
>
> From my Illumina export/fastQ file read in using ReadAlign:
>
>
> table(chromosome(aln))
>
> chr1.fa chr10.fa chr11.fa
chr12.fa chr13.fa chr14.fa
> 285307 221943 395265
115402 108108 127673
> chr15.fa chr16.fa chr17.fa
chr18.fa chr19.fa chr2.fa
> 159310 143801 201871
83037 156612 304649
> chr3.fa chr4.fa chr5.fa
chr6.fa chr7.fa chr8.fa
> 168498 222071 234236
229869 297021 194308
> chr9.fa chrX.fa chrY.fa
NM QC RM
> 264232 124942 3
1009813 5877 1074323
> splice_sites-auto.fa
> 804227
>
>
>
>
> as.list(seqlengths(Mmusculus))
>
>
> $chr1
> [1] 197195432
>
> $chr2
> [1] 181748087
>
> $chr3
> [1] 159599783
>
> $chr4
> [1] 155630120
>
> $chr5
> [1] 152537259
>
> $chr6
> [1] 149517037
>
> $chr7
> [1] 152524553
>
> $chr8
> [1] 131738871
>
> $chr9
> [1] 124076172
>
> $chr10
> [1] 129993255
>
> $chr11
> [1] 121843856
>
> $chr12
> [1] 121257530
>
> $chr13
> [1] 120284312
>
> $chr14
> [1] 125194864
>
> $chr15
> [1] 103494974
>
> $chr16
> [1] 98319150
>
> $chr17
> [1] 95272651
>
> $chr18
> [1] 90772031
>
> $chr19
> [1] 61342430
>
> $chrX
> [1] 166650296
>
> $chrY
> [1] 15902555
>
> $chrM
> [1] 16299
>
> $chr1_random
> [1] 1231697
>
> $chr3_random
> [1] 41899
>
> $chr4_random
> [1] 160594
>
> $chr5_random
> [1] 357350
>
> $chr7_random
> [1] 362490
>
> $chr8_random
> [1] 849593
>
> $chr9_random
> [1] 449403
>
> $chr13_random
> [1] 400311
>
> $chr16_random
> [1] 3994
>
> $chr17_random
> [1] 628739
>
> $chrX_random
> [1] 1785075
>
> $chrY_random
> [1] 58682461
>
> $chrUn_random
> [1] 5900358
>
>
>
>