Question

Missing values when extracting 5'UTR using GenomicFeatures?

1

Entering edit mode

polly.hsu ▴ 10

@pollyhsu-7029

Last seen 9.5 years ago

United States

Hi everyone,

I am trying to get the range information for all 5'UTR in Arabidopsis using GenomicFeatures and have 2 questions.

1. I found some of the genes that contain 5'UTR are missing in my results below. For example, AT1G01020 and AT1G01030 both contain 5'UTR, but they are missing in the list. Does anyone have a clue why this might happen?

2. Is there a way we can have a list of 5'UTR included in all of the gene models for a given gene, rather than for a given gene model? I mean, like for exons, we can use "exonsBy(txdb,by="gene")". Is there anything similar to that for 5'-UTR?

Many thanks!

- Polly

# the gff file is downloaded here (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_gff3).

> txdb <- makeTranscriptDbFromGFF("TAIR10_GFF3_genes.gff",format="gff")

> fiveUTR <-fiveUTRsByTranscript(txdb, use.names=T)

> head(fiveUTR)

GRangesList object of length 6:

$AT1G01010.1

GRanges object with 1 range and 3 metadata columns:

seqnames ranges strand | exon_id exon_name exon_rank

[1] Chr1 [3631, 3759] + | 1 <NA> 1

$AT1G01040.1

GRanges object with 1 range and 3 metadata columns:

seqnames ranges strand | exon_id exon_name exon_rank

[1] Chr1 [23146, 23518] + | 7 <NA> 1

$AT1G01040.2

GRanges object with 1 range and 3 metadata columns:

seqnames ranges strand | exon_id exon_name exon_rank

[1] Chr1 [23416, 23518] + | 8 <NA> 1

...

<3 more elements>

-------

seqinfo: 7 sequences (1 circular) from an unspecified genome; no seqlengths

> sessionInfo()

R version 3.1.1 (2014-07-10)

Platform: x86_64-redhat-linux-gnu (64-bit)

locale:

[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C

[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8

[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8

[7] LC_PAPER=en_US.UTF-8 LC_NAME=C

[9] LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] stats4 parallel stats graphics grDevices utils datasets

[8] methods base

other attached packages:

[1] rtracklayer_1.26.1 GenomicFeatures_1.18.0 AnnotationDbi_1.28.0

[4] Biobase_2.26.0 GenomicAlignments_1.2.0 Rsamtools_1.18.0

[7] Biostrings_2.34.0 XVector_0.6.0 GenomicRanges_1.18.1

[10] GenomeInfoDb_1.2.0 IRanges_2.0.0 S4Vectors_0.4.0

[13] BiocGenerics_0.12.0

loaded via a namespace (and not attached):

[1] base64enc_0.1-2 BatchJobs_1.4 BBmisc_1.7 BiocParallel_1.0.0

[5] biomaRt_2.22.0 bitops_1.0-6 brew_1.0-6 checkmate_1.5.0

[9] codetools_0.2-8 DBI_0.3.1 digest_0.6.4 fail_1.2

[13] foreach_1.4.2 iterators_1.0.7 RCurl_1.95-4.3 RSQLite_0.11.4

[17] sendmailR_1.2-1 stringr_0.6.2 tools_3.1.1 XML_3.98-1.1

[21] zlibbioc_1.12.0

GenomicFeatures 5'UTR • 1.4k views

ADD COMMENT • link updated 9.5 years ago by Hervé Pagès 16k • written 9.5 years ago by polly.hsu ▴ 10

score 0 · Answer 1 · 2014-11-12

Hi,

The man page for fiveUTRsByTranscript() says when 'use.names=TRUE', the names on the output are transcript names (not ids). The name prefix is the gene name and the suffix is the transcript number.

> nms <- names(fiveUTR)
> head(nms)
[1] "AT1G01010.1" "AT1G01040.1" "AT1G01040.2" "AT1G01110.1" "AT1G01160.1"
[6] "AT1G01160.2"

To find a gene name in fiveUTR you need to use the name prefix.

> mat <- matrix(unlist(strsplit(nms, ".", fixed=TRUE)), ncol=2, byrow=TRUE)
> head(mat)
[,1] [,2]
[1,] "AT1G01010" "1"
[2,] "AT1G01040" "1"
[3,] "AT1G01040" "2"
[4,] "AT1G01110" "1"
[5,] "AT1G01160" "1"
[6,] "AT1G01160" "2"

Match the prefix to the desired gene name(s):

> prefix <- mat[, 1]
> which(prefix == "AT1G01020")
[1] 3602 3603

Use the index to subset fiveUTR:

> fiveUTR[which(prefix == "AT1G01020")]
GRangesList object of length 2:
$AT1G01020.1
GRanges object with 1 range and 3 metadata columns:
      seqnames       ranges strand |   exon_id   exon_name exon_rank
         <Rle>    <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     Chr1 [8667, 8737]      - |     20856        <NA>         1

$AT1G01020.2
GRanges object with 1 range and 3 metadata columns:
      seqnames       ranges strand | exon_id exon_name exon_rank
  [1]     Chr1 [8667, 8737]      - |   20856      <NA>         1

Your second question (I think) is how to group the 5'UTRs from the same gene together. Again we can use the prefix.

Rep out the gene name to match the number of UTR elements:

> repGeneName <- rep(prefix, elementLengths(fiveUTR))
> fiveUTRByGene <- split(unlist(fiveUTR), factor(repGeneName))
> fiveUTRByGene
GRangesList object of length 19737:
$AT1G01010
GRanges object with 1 range and 3 metadata columns:
              seqnames       ranges strand |   exon_id   exon_name exon_rank
                 <Rle>    <IRanges>  <Rle> | <integer> <character> <integer>
  AT1G01010.1     Chr1 [3631, 3759]      + |         1        <NA>         1

$AT1G01020
GRanges object with 2 ranges and 3 metadata columns:
              seqnames       ranges strand | exon_id exon_name exon_rank
  AT1G01020.1     Chr1 [8667, 8737]      - |   20856      <NA>         1
  AT1G01020.2     Chr1 [8667, 8737]      - |   20856      <NA>         1

Names on the outer list elements are gene names and the inner list element names are 'gene.transcript.

> fiveUTRByGene["AT1G01020"]
GRangesList object of length 1:
$AT1G01020
GRanges object with 2 ranges and 3 metadata columns:
              seqnames       ranges strand |   exon_id   exon_name exon_rank
                 <Rle>    <IRanges>  <Rle> | <integer> <character> <integer>
  AT1G01020.1     Chr1 [8667, 8737]      - |     20856        <NA>         1
  AT1G01020.2     Chr1 [8667, 8737]      - |     20856        <NA>         1

Valerie

score 0 · Answer 2 · 2014-11-13

Hi Polly,

Here is an alternative to Val's approach that doesn't require you to know anything about the fancy transcript naming scheme they use at TAIR (i.e. <tx name> = <gene id>.<tx number>) so it will work on any TxDb object. It takes advantage of the mapping between gene ids and tx names that you can extract from any TxDb object with:

gene2tx <- mcols(transcripts(txdb, columns=c("gene_id", "tx_name")))
gene2tx$gene_id <- as.character(gene2tx$gene_id)

Then:

> gene2tx
DataFrame with 35386 rows and 2 columns
          gene_id     tx_name
      <character> <character>
1       AT1G01010 AT1G01010.1
2       AT1G01040 AT1G01040.1
3       AT1G01040 AT1G01040.2
4       AT1G01073 AT1G01073.1
5       AT1G01110 AT1G01110.2
...           ...         ...
35382   ATMG01320 ATMG01320.1
35383   ATMG01330 ATMG01330.1
35384   ATMG01360 ATMG01360.1
35385   ATMG01370 ATMG01370.1
35386   ATMG01410 ATMG01410.1

Then:

> fiveUTR[gene2tx$tx_name[gene2tx$gene_id %in% "AT1G01020"]]
GRangesList object of length 2:
$AT1G01020.1 
GRanges object with 1 range and 3 metadata columns:
      seqnames       ranges strand |   exon_id   exon_name exon_rank
         <Rle>    <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     Chr1 [8667, 8737]      - |     20856        <NA>         1

$AT1G01020.2 
GRanges object with 1 range and 3 metadata columns:
      seqnames       ranges strand | exon_id exon_name exon_rank
  [1]     Chr1 [8667, 8737]      - |   20856      <NA>         1

-------
seqinfo: 7 sequences (1 circular) from an unspecified genome; no seqlengths

> fiveUTR[gene2tx$tx_name[gene2tx$gene_id %in% "AT1G01030"]]
GRangesList object of length 1:
$AT1G01030.1 
GRanges object with 2 ranges and 3 metadata columns:
      seqnames         ranges strand |   exon_id   exon_name exon_rank
         <Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
  [1]     Chr1 [13335, 13714]      - |     20858        <NA>         1
  [2]     Chr1 [12941, 13173]      - |     20857        <NA>         2

-------
seqinfo: 7 sequences (1 circular) from an unspecified genome; no seqlengths

That answers your first question.

To group the 5'UTRs by gene:

tx2gene <- gene2tx$gene_id
names(tx2gene) <- gene2tx$tx_name
f <- rep.int(tx2gene[names(fiveUTR)], elementLengths(fiveUTR))
fiveUTR_by_gene <- split(unlist(fiveUTR), f)  # same as Val's fiveUTRByGene object

Hope this helps.

H.