Question

Uniprot IDs and NCBI taxonomy IDs to coding sequences

0

Entering edit mode

Peter • 0

@9aaaabdf

Last seen 3.0 years ago

United States

I have a list of uniprot protein IDs and NCBI taxonomy IDs for the organism that contains the protein. For each combination, I want the DNA sequence that codes for the protein (CDS) in the organism.

I'm trying to find the most straight forward approach that can be automated. I'm good with the idea of reading the uniprot protein and NCBI taxonomy IDs into R as a table or a data frame.

I just don't see an easy way to use UniprotR or biodbUniprot to get the CDS, and then don't know a good way to use the uniprot IDs with ensembledb or rentrez to get the CDS.

Thanks!

ensembldb biodbUniprot • 3.2k views

ADD COMMENT • link updated 3.0 years ago by James W. MacDonald 68k • written 3.0 years ago by Peter • 0

score 0 · Answer 1 · 2023-01-25

0

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

It all depends on what you mean by CDS. That's dependent on the transcript, and most genes have multiple transcripts. But anyway, here's a start.

> library(BSgenome.Hsapiens.UCSC.hg19)  
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> library(UniProt.ws)
> up <- UniProt.ws()
## a couple UniProt IDs
> protids <- c("P38398","P04637")
## convert to NCBI Gene IDs
> ncbid <- select(up, protids, "xref_geneid", "UniProtKB")
## get CDS by gene
> cds <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "gene")
## subset to the two we want - note that the results from UniProt have trailing semicolons
> my.cds <- cds[gsub(";", "",ncbid$GeneID)] 
## get the sequences
> seqs <- getSeq(Hsapiens, my.cds)
> seqs
DNAStringSetList of length 2
[["672"]] CAATTGGGCAGATGTGTGAGGCACCTGTGGTGACCCGAGAGTGGGTGTTGGACAGTGTAGCACTCTA...
[["7157"]] GCCAAGGCAGGTGGATCACCTGAGGTCAGGAGTTCAAGACAGCCTGGCCAACATAGCGAAATCCCA...
> seqs[[1]]
DNAStringSet object of length 36:
     width seq
 [1]   125 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...CTGATACCCCAGATCCCCCACAGCCACTACTGA
 [2]    19 CAATTGGGCAGATGTGTGA
 [3]    61 GGTGTCCACCCAATTGTGGTTGTGCAGCCAGATGCCTGGACAGAGGACAATGGCTTCCATG
 [4]    74 ATCAACTGGAATGGATGGTACAGCTGTGTGGTG...GTGAAGGAGCTTTCATCATTCACCCTTGGCACA
 [5]    31 ATGAGTGACAGCAAGAAAACCTGGCTGCAAT
 ...   ... ...
[32]    78 ATTTTGCATGCTGAAACTTCTCAACCAGAAGAA...GTGTCCTTTATGTAAGAATGATATAACCAAAAG
[33]     2 AT
[34]    54 TCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGACCACATATTTTGCAA
[35]    80 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TGCTATGCAGAAAATCTTAGAGTGTCCCATCTG
[36]     4 ATGG

Or maybe you want the CDS by transcript

## CDS by transcript, need the names
> cds2 <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, use.names = TRUE)
## map NCBI ID to UCSC transcript ID
> txids <- select(TxDb.Hsapiens.UCSC.hg19.knownGene, gsub(";", "",ncbid$GeneID), "TXNAME","GENEID")
'select()' returned 1:many mapping between keys and columns
## remove those not available
> txids <- txids$TXNAME[txids$TXNAME %in% names(cds2)]
> my.cds2 <- cds2[txids]
> seqs2 <- getSeq(Hsapiens, my.cds2)
> seqs2
DNAStringSetList of length 34
[["uc010whl.2"]] ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA...
[["uc002icp.4"]] AT ...
[["uc010whm.2"]] ATGG ...
[["uc002icu.3"]] ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA...
[["uc010cyx.3"]] ATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGAT...
[["uc002icq.3"]] ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA...
[["uc002ict.3"]] ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAA...
[["uc010whn.2"]] ATGCACAGTTGCTCTGGGAGTCTTCAGAATAGAAACTACCCATCTCAAGAGGAGCTCATT...
[["uc010who.3"]] ATGAGTGACAGCAAGAAAACCTGGCTGCAAT ...
[["uc010whp.2"]] ATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGAT...
...
<24 more elements>
> seqs2[[1]]
DNAStringSet object of length 22:
     width seq
 [1]    80 ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTA...TGCTATGCAGAAAATCTTAGAGTGTCCCATCTG
 [2]    54 TCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGACCACATATTTTGCAA
 [3]    78 ATTTTGCATGCTGAAACTTCTCAACCAGAAGAA...GTGTCCTTTATGTAAGAATGATATAACCAAAAG
 [4]    89 GAGCCTACAAGAAAGTACGAGATTTAGTCAACT...TTTGTGCTTTTCAGCTTGACACAGGTTTGGAGT
 [5]   140 ATGCAAACAGCTATAATTTTGCAAAAAAGGAAA...CTTCTACAGAGTGAACCCGAAAATCCTTCCTTG
 ...   ... ...
[18]    84 CATGATTTTGAAGTCAGAGGAGATGTGGTCAAT...CCAAAGCGAGCAAGAGAATCCCAGGACAGAAAG
[19]    55 ATCTTCAGGGGGCTAGAAATCTGTTGCTATGGGCCCTTCACCAACATGCCCACAG
[20]    74 ATCAACTGGAATGGATGGTACAGCTGTGTGGTG...GTGAAGGAGCTTTCATCATTCACCCTTGGCACA
[21]    61 GGTGTCCACCCAATTGTGGTTGTGCAGCCAGATGCCTGGACAGAGGACAATGGCTTCCATG
[22]   125 CAATTGGGCAGATGTGTGAGGCACCTGTGGTGA...CTGATACCCCAGATCCCCCACAGCCACTACTGA
>

ADD COMMENT • link 3.0 years ago James W. MacDonald 68k

0

Entering edit mode

Thanks! I'll look at these. It seems like this will at least get me going in the right direction.

I want the DNA that is the CDS for the protein. Not the transcript sequence or the gene sequence. I don't want the 5' UTRs, 3' UTRs, or introns.

The uniprot protein IDs are for a specific protein sequence. The human PKLR gene generates two isoforms. The uniprot IDs are P30613-1 and P30613-2.

ADD REPLY • link 3.0 years ago Peter • 0

0

Entering edit mode

That's what the cdsBy function provides (hence the name). You can get all the parts of the gene and see that none overlap.

## PKLR
> txids <- select(TxDb.Hsapiens.UCSC.hg19.knownGene, "5313", "TXNAME", "GENEID")[,2]
## Note that UCSC says there are three transcripts, one of which has no UTR. We'll ignore that one here
> cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx",use.names = TRUE)[txids[1:2]]
GRangesList object of length 2:
$uc001fka.4
GRanges object with 11 ranges and 3 metadata columns:
       seqnames              ranges strand |    cds_id    cds_name exon_rank
          <Rle>           <IRanges>  <Rle> | <integer> <character> <integer>
   [1]     chr1 155270672-155270678      - |     19636        <NA>         1
   [2]     chr1 155269889-155270071      - |     19635        <NA>         2
   [3]     chr1 155265456-155265547      - |     19633        <NA>         3
   [4]     chr1 155265228-155265359      - |     19632        <NA>         4
   [5]     chr1 155264907-155265093      - |     19631        <NA>         5
   [6]     chr1 155264273-155264543      - |     19630        <NA>         6
   [7]     chr1 155264026-155264176      - |     19629        <NA>         7
   [8]     chr1 155263229-155263381      - |     19628        <NA>         8
   [9]     chr1 155262968-155263134      - |     19627        <NA>         9
  [10]     chr1 155261547-155261728      - |     19626        <NA>        10
  [11]     chr1 155260363-155260469      - |     19625        <NA>        11
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

$uc001fkb.4
GRanges object with 11 ranges and 3 metadata columns:
       seqnames              ranges strand |    cds_id    cds_name exon_rank
          <Rle>           <IRanges>  <Rle> | <integer> <character> <integer>
   [1]     chr1 155271087-155271186      - |     19637        <NA>         1
   [2]     chr1 155269889-155270071      - |     19635        <NA>         2
   [3]     chr1 155265456-155265547      - |     19633        <NA>         3
   [4]     chr1 155265228-155265359      - |     19632        <NA>         4
   [5]     chr1 155264907-155265093      - |     19631        <NA>         5
   [6]     chr1 155264273-155264543      - |     19630        <NA>         6
   [7]     chr1 155264026-155264176      - |     19629        <NA>         7
   [8]     chr1 155263229-155263381      - |     19628        <NA>         8
   [9]     chr1 155262968-155263134      - |     19627        <NA>         9
  [10]     chr1 155261547-155261728      - |     19626        <NA>        10
  [11]     chr1 155260363-155260469      - |     19625        <NA>        11
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

> threeUTRsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene, use.names = TRUE)[txids[1:2]]
GRangesList object of length 2:
$uc001fka.4
GRanges object with 1 range and 3 metadata columns:
      seqnames              ranges strand |   exon_id   exon_name exon_rank
         <Rle>           <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr1 155259084-155260362      - |     22835        <NA>        11
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

$uc001fkb.4
GRanges object with 1 range and 3 metadata columns:
      seqnames              ranges strand |   exon_id   exon_name exon_rank
         <Rle>           <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr1 155259084-155260362      - |     22835        <NA>        11
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

> fiveUTRsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene, use.names = TRUE)[txids[1:2]]
GRangesList object of length 2:
$uc001fka.4
GRanges object with 1 range and 3 metadata columns:
      seqnames              ranges strand |   exon_id   exon_name exon_rank
         <Rle>           <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr1 155270679-155270792      - |     22846        <NA>         1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

$uc001fkb.4
GRanges object with 1 range and 3 metadata columns:
      seqnames              ranges strand |   exon_id   exon_name exon_rank
         <Rle>           <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     chr1 155271187-155271225      - |     22847        <NA>         1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

> intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene, use.names = TRUE)[txids[1:2]]
GRangesList object of length 2:
$uc001fka.4
GRanges object with 10 ranges and 0 metadata columns:
       seqnames              ranges strand
          <Rle>           <IRanges>  <Rle>
   [1]     chr1 155260470-155261546      -
   [2]     chr1 155261729-155262967      -
   [3]     chr1 155263135-155263228      -
   [4]     chr1 155263382-155264025      -
   [5]     chr1 155264177-155264272      -
   [6]     chr1 155264544-155264906      -
   [7]     chr1 155265094-155265227      -
   [8]     chr1 155265360-155265455      -
   [9]     chr1 155265548-155269888      -
  [10]     chr1 155270072-155270671      -
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

$uc001fkb.4
GRanges object with 10 ranges and 0 metadata columns:
       seqnames              ranges strand
          <Rle>           <IRanges>  <Rle>
   [1]     chr1 155260470-155261546      -
   [2]     chr1 155261729-155262967      -
   [3]     chr1 155263135-155263228      -
   [4]     chr1 155263382-155264025      -
   [5]     chr1 155264177-155264272      -
   [6]     chr1 155264544-155264906      -
   [7]     chr1 155265094-155265227      -
   [8]     chr1 155265360-155265455      -
   [9]     chr1 155265548-155269888      -
  [10]     chr1 155270072-155271086      -
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

## check overlap between introns and CDS

> pcds <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx",use.names = TRUE)[txids[1:2]]
> icds <- intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene, use.names = TRUE)[txids[1:2]]

> findOverlaps(unlist(pcds), unlist(icds))
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1          20
  -------
  queryLength: 22 / subjectLength: 20
> unlist(pcds)[1]
GRanges object with 1 range and 3 metadata columns:
             seqnames              ranges strand |    cds_id    cds_name
                <Rle>           <IRanges>  <Rle> | <integer> <character>
  uc001fka.4     chr1 155270672-155270678      - |     19636        <NA>
             exon_rank
             <integer>
  uc001fka.4         1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
> unlist(icds)[20]
GRanges object with 1 range and 0 metadata columns:
             seqnames              ranges strand
                <Rle>           <IRanges>  <Rle>
  uc001fkb.4     chr1 155270072-155271086      -
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

You can eyeball the UTR ranges and see that they don't overlap the CDS ranges. We can use findOverlaps to check the introns vs the exons, and there is one overlap of an exon with an intron between the two transcripts. If you look at the UCSC genome browser you will note that the two transcripts are identical except for a very small exon (7 nt long) in one transcript that is in the middle of an intron for the other transcript, coming right after the 5' UTR. That small exon technically overlaps the intron for the other transcript, but otherwise the CDS is disjoint from all the things you don't want.

ADD REPLY • link 3.0 years ago James W. MacDonald 68k