makeTxDbFromGFF returns object with wrong seqids (wrong chromosome names)
1
0
Entering edit mode
jggomeza • 0
@jggomeza-12779
Last seen 7.7 years ago

Greetings, 

I'm working on RNA-Seq analysis from dog (Canis Familiaris) samples and I'm using the last genome annotation data available (CanFam3.1) from NCBI (https://www.ncbi.nlm.nih.gov/genome?LinkName=assembly_genome&from_uid=317138)

When I use the following code:

txdb <- makeTxDbFromGFF(gtffile, format = "gff3",  circ_seqs = DEFAULT_CIRC_SEQS )

A txdb object is created successfully and R returns the following:

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... The following orphan exon were dropped (showing only the 6 first):
        seqid   start     end strand       ID   Parent     Name
1 NC_006590.3 2233573 2233690      + id182056 id182055 id182056
2 NC_006590.3 2233883 2234287      + id182057 id182055 id182057
3 NC_006590.3 2235773 2235792      + id182058 id182055 id182058
4 NC_006590.3 2262018 2262086      + id182061 id182060 id182061
5 NC_006590.3 2264839 2264917      + id182062 id182060 id182062
6 NC_006590.3 2265047 2265381      + id182063 id182060 id182063The following transcripts have exons that contain more than one
  CDS (only the first CDS was kept for each exon): rna27051,
  rna32073, rna36992, rna47992Named parameters not used in query: internal_chrom_id, chrom, length, is_circularNamed parameters not used in query: internal_id, name, type, chrom, strand, start, endNamed parameters not used in query: internal_id, name, chrom, strand, start, endNamed parameters not used in query: internal_id, name, chrom, strand, start, endNamed parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_idNamed parameters not used in query: gene_id, internal_tx_idOK

The following is the problem

When I ask for seqids to verify if they match the chromosome names in my BAM files I get some weird number which happens to be the NCBI chromosome accession number:

head(seqlevels (txdb))

[1] "NC_002008.4" "NC_006583.3" "NC_006584.3" "NC_006585.3" "NC_006586.3"
[6] "NC_006587.3"
 

Is there anyway I can modify the GFF and remove the pertinent column? or can I ask makeTxDbFromGFF to use different columns to make TxDb object?

Thank you!

 

genomicfeatures txdb canis familiaris rna-seq • 1.4k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 20 hours ago
United States

The GTF from NCBI just has the RefSeq chromosome identifiers, so I don't know if there is an easy way to get the conventional chromosomal names using their GTF. But you don't have to use that anyway. If you want NCBI-based TxDb packages, you can use makeTxDbFromUCSC:

> txdb <- makeTxDbFromUCSC("canFam3", "refGene")
Download the refGene table ... OK
Download the hgFixed.refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
  UCSC data anomaly in 116 transcript(s): the cds cumulative length is
  not a multiple of 3 for transcripts 'NM_001286967' 'NM_001002949'
  'NM_001003268' 'NM_001003372' 'NM_001012395' 'NM_001003021'
  'NM_001097552' 'NM_001197156' 'NM_001003011' 'NM_001097552'
  'NM_001003148' 'NM_001003195' 'NM_001003184' 'NM_001247976'
  'NM_001003321' 'NM_001284483' 'NM_001003140' 'NM_001003300'
  'NM_001197035' 'NM_001314111' 'NM_001252229' 'NM_001284472'
  'NM_001003125' 'NM_001003083' 'NM_001251942' 'NM_001003207'
  'NM_001002958' 'NM_001006646' 'NM_001161713' 'NM_001252035'
  'NM_001006650' 'NM_001197118' 'NM_001003077' 'NM_001003341'
  'NM_001006123' 'NM_001002995' 'NM_001002939' 'NM_001002960'
  'NM_001130437' 'NM_001113711' 'NM_001003356' 'NM_001008717'
  'NM_001003114' 'NM_001003082' 'NM_001252581' 'NM_001097559'
  'NM_001002978' 'NM_001284440' 'NM_001110179' 'NM_001005252'
  'NM_001003966' 'NM_001197187' 'NM_001131049' 'NM_001003343'
  'NM_001002979' 'NM_001003103' 'NM_001048084' 'NM_001005 [... truncated]
2: RSQLite::dbGetPreparedQuery() is deprecated, please switch to DBI::dbGetQuery(params = bind.data).
3: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular
4: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end
5: Named parameters not used in query: internal_id, name, chrom, strand, start, end
6: Named parameters not used in query: internal_id, name, chrom, strand, start, end
7: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id
8: Named parameters not used in query: gene_id, internal_tx_id
> seqinfo(txdb)
Seqinfo object with 3268 sequences (1 circular) from canFam3 genome:
  seqnames       seqlengths isCircular  genome
  chr1            122678785       <NA> canFam3
  chr2             85426708       <NA> canFam3
  chr3             91889043       <NA> canFam3
  chr4             88276631       <NA> canFam3
  chr5             88915250       <NA> canFam3
  ...                   ...        ...     ...
  chrUn_JH374189       8038       <NA> canFam3
  chrUn_JH374190       5797       <NA> canFam3
  chrUn_JH374191       6845       <NA> canFam3
  chrUn_JH374192       7721       <NA> canFam3
  chrUn_JH374193       5700       <NA> canFam3
ADD COMMENT

Login before adding your answer.

Traffic: 611 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6