Entering edit mode
刘鹏飞
▴
80
@-6181
Last seen 10.2 years ago
Hi,
Now I am want to use goseq for doing GO analysis of my bacteria RNA-
seq
data, but now I am stucked by the following issue:
1, makeTranscriptDbFromGFF
First download gff from NCBI(
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Pelotomaculum_thermopropio
nicum_SI_uid58877/NC_009454.gff),
try to use it to build TranscriptDb in order to get the lengthData
which
needed to import into goseq
>library(GenomicFeatures)
>gffFile="/home/liupf/NC_009454.gff"
>txdb <- makeTranscriptDbFromGFF(file="NC_009454.gff", format="gff3",
dataSource=NA, species="Pelotomaculum thermopropionicum")
failed:
Error in .prepareGFF3TXS(data, useGenesAsTranscripts) :
No Transcript information found in gff file
In addition: Warning message:
In as.data.frame(data, stringsAsFactors = FALSE) :
Arguments in '...' ignored
Then I follow the discussion in the mailinglist:
makeTranscriptDbFromGFF
fails on NCBI Bacteria genomes, use, the following way:
$perl bp_genbank2gff3.pl NC_009454.gbk
>txdb <- makeTranscriptDbFromGFF(file="NC_009454.gbk.gff",
format="gff3",
dataSource=NA, species="Pelotomaculum thermopropionicum")
Failed :
Error in .prepareGFF3TXS(data, useGenesAsTranscripts) :
Unexpected transcript duplicates
In addition: Warning message:
In as.data.frame(data, stringsAsFactors = FALSE) :
Arguments in '...' ignored
I know that for the first way, there is no mRNA and exon information,
as
mentioned here (
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053146.html) and
add
following like things would work:
NC_011025.1 RefSeq mRNA 107 1471 . + 0
ID=mRNA0;Parent=gene0
NC_011025.1 RefSeq exon 107 1471 . + 0
ID=exon0;Parent=mRNA0
Ouoted: "Taking a quick look at the GFF in question ... I don't see
any mRNA features.... they appear to be implicit .... which is not
well formed GFF3 (c.f. http://www.sequenceontology.org/gff3.shtml)
That is your first problem. In particular, the first gene in a file by
itself is rejected. Adding the mRNA and exon lines as below, and the
first gene is now accepted by makeTranscriptDbFromGFF"
But by doing so, The TranscriptDb only contain the first gene.
The second way may work on R-2.15 or before, but failed on R 3.0.0 (
https://stat.ethz.ch/pipermail/bioconductor/2013-June/053148.html) and
here
I got same error on R 3.0.2
this topic has been discussed before, but for bacteria genome, it
still
seems unsolved,
is there any solutions on this issue now?
Really thanks for this, for my work was stucked here now!
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] GenomicFeatures_1.14.0 AnnotationDbi_1.24.0 Biobase_2.22.0
[4] GenomicRanges_1.14.1 XVector_0.2.0 IRanges_1.20.0
[7] BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] biomaRt_2.18.0 Biostrings_2.30.0 bitops_1.0-6 BSgenome_1.30.0
[5] DBI_0.2-7 RCurl_1.95-4.1 Rsamtools_1.14.1 RSQLite_0.11.4
[9] rtracklayer_1.22.0 stats4_3.0.2 tools_3.0.2 XML_3.98-1.1
[13] zlibbioc_1.8.0
--
Pengfei Liu, PhD Candidate
Lab of Microbial Ecology
College of Resources and Environmental Sciences
China Agricultural University
No.2 Yuanmingyuanxilu, Beijing, 100193
P.R. China
Tel: +86-10-62731358
Fax: +86-10-62731016
E-mail: liupfskygre@gmail.com
[[alternative HTML version deleted]]