Question

How to get The evolutionary conservation scores for lncRNAs by the phastCons100way.UCSC.hg38 (3.7.1) R package

0

Entering edit mode

rania.alanany • 0

@35004539

Last seen 1 day ago

Qatar

Hi, I want to get The evolutionary conservation scores for lncRNAs by the phastCons100way.UCSC.hg38 (3.7.1) R package. I already manually downloaded the lncrna annotation gtf file from gencode V34, GRCh38), and I extracted the exons for the lncrna.


>gtf_data_df <- as.data.frame(gtf_data)

# Filter for long non-coding RNAs

> gtf_data_df_final= gtf_data_df %>% filter (gene_type == "lncRNA" & type == "exon")
> head(gtf_data_df_final)
  seqnames start   end width strand source type score phase           gene_id
1     chr1 29554 30039   486      + HAVANA exon    NA    NA ENSG00000243485.5
2     chr1 30564 30667   104      + HAVANA exon    NA    NA ENSG00000243485.5
3     chr1 30976 31097   122      + HAVANA exon    NA    NA ENSG00000243485.5
4     chr1 30267 30667   401      + HAVANA exon    NA    NA ENSG00000243485.5
5     chr1 30976 31109   134      + HAVANA exon    NA    NA ENSG00000243485.5
6     chr1 35721 36081   361      - HAVANA exon    NA    NA ENSG00000237613.2
  gene_type   gene_name level    hgnc_id   tag          havana_gene
1    lncRNA MIR1302-2HG     2 HGNC:52482 basic OTTHUMG00000000959.1
2    lncRNA MIR1302-2HG     2 HGNC:52482 basic OTTHUMG00000000959.1
3    lncRNA MIR1302-2HG     2 HGNC:52482 basic OTTHUMG00000000959.1
4    lncRNA MIR1302-2HG     2 HGNC:52482 basic OTTHUMG00000000959.1
5    lncRNA MIR1302-2HG     2 HGNC:52482 basic OTTHUMG00000000959.1
6    lncRNA     FAM138A     2 HGNC:32334 basic OTTHUMG00000000960.1
      transcript_id transcript_type transcript_name transcript_support_level
1 ENST00000473358.1          lncRNA MIR1302-2HG-202                        5
2 ENST00000473358.1          lncRNA MIR1302-2HG-202                        5
3 ENST00000473358.1          lncRNA MIR1302-2HG-202                        5
4 ENST00000469289.1          lncRNA MIR1302-2HG-201                        5
5 ENST00000469289.1          lncRNA MIR1302-2HG-201                        5
6 ENST00000417324.1          lncRNA     FAM138A-201                        1
     havana_transcript exon_number           exon_id  ont
1 OTTHUMT00000002840.1           1 ENSE00001947070.1 <NA>
2 OTTHUMT00000002840.1           2 ENSE00001922571.1 <NA>
3 OTTHUMT00000002840.1           3 ENSE00001827679.1 <NA>
4 OTTHUMT00000002841.1           1 ENSE00001841699.1 <NA>
5 OTTHUMT00000002841.1           2 ENSE00001890064.1 <NA>
6 OTTHUMT00000002842.1           1 ENSE00001656588.1 <NA>

sessionInfo( )

txdbmaker GenomicScores phastCons100way.UCSC.hg38 GenomicFeatures • 86 views

ADD COMMENT • link updated 2 hours ago by Robert Castelo ★ 3.4k • written 3 days ago by rania.alanany • 0

score 0 · Answer 1 · 2024-11-18

It's just a GScores object that you can use like any other S4Vector type thing. You extract using the gscores function, which takes a GRanges object as the second argument (and which you apparently had prior to converting to a data.frame, which is not something you normally want to do.)

Something like this should work

gtf_data_lncrna <- subset(gtf_data, gene_type == "lncRNA" & type == "exon")
gtf_data_lncrna$conservation <- gscores(phastCons100way.UCSC.hg38, gtf_data_lncrna)$default

You tagged GenomicScores and GenomicFeatures, so I would imagine you already understand how GRanges and other S4Vectors work, but if not, you should read the vignettes for those packages and GenomicRanges as well.

As an aside, you are mixing Ensembl and UCSC data here, and that might be OK or it might not be. There are fundamental differences between how the two annotation services (EBI/EMBL and NCBI) identify things (they have worked together for like the last five years or so just to come up with a single transcript per gene in human that they can agree on, so...). You might be better off getting the lncRNA locations from UCSC as well.

score 0 · Answer 2 · 2024-11-19

The phastCons100way.UCSC.hg38 annotation package, storing conservation scores of the human genome version hg38, should be queried with the function gscores() from the GenomicScores package, which is automatically loaded when you load phastCons100way.UCSC.hg38. The function gscores() expects at least two parameters: (1) an annotation package with the scores, such as phastCons100way.UCSC.hg38, and a GenomicRanges, or GRanges, object with the positions for which you want to retrieve the scores, in your case the conservation scores. These positions are often derived from some sort of gene or transcript annotations. The gscores() function, does not know whether those gene or transcript annotations form part of protein-coding genes or lncRNAs, and one should take care of giving the positions/annotations for which one wants to retrieve the scores.

From what you describe that you want to do I assume you want to download the following annotation GTF file:

download.file("https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.long_noncoding_RNAs.gtf.gz",
              "gencode.v34.long_noncoding_RNAs.gtf.gz")

Assuming that you want to retrieve conservation scores for all of the annotations in this GTF file, the easiest way to import those annotations into a GRanges object is as follows:

library(txdbmaker)

txdb <- makeTxDbFromGFF("gencode.v34.long_noncoding_RNAs.gtf.gz")
exbytx <- exonsBy(txdb, by="tx")
length(exbytx)
[1] 48479

There are 48,479 transcripts in this GTF file, and the object exbytx is a GRangesList object that contains one GRanges object with the exon annotations for each of these 48,479 transcripts. Because gscores() expects a GRanges object, we will unlist exbytx to get all the exon annotations in a single GRanges object, then call gscores() and put back those exon annotations along their conservation scores in a GRangesList object back.

library(phastCons100way.UCSC.hg38)

allexons <- unlist(exbytx)
exonscores <- gscores(phastCons100way.UCSC.hg38, allexons) ## first time this should take about 20 seconds
exbytx <- relist(exonscores, exbytx)
exbytx[1]
GRangesList object of length 1:
$`1`
GRanges object with 3 ranges and 4 metadata columns:
    seqnames      ranges strand |   exon_id         exon_name exon_rank
       <Rle>   <IRanges>  <Rle> | <integer>       <character> <integer>
  1     chr1 29554-30039      + |         1 ENSE00001947070.1         1
  2     chr1 30564-30667      + |         3 ENSE00001922571.1         2
  1     chr1 30976-31097      + |         4 ENSE00001827679.1         3
        default
      <numeric>
  1 0.000205761
  2 0.162500000
  1 0.042622951
  -------
  seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome

The metadata column default contains the conservation scores. This resulting object is missing a lot of the available data in the GTF object that is not imported by makeTxDbFromGFF(). If you want to import all those data, because they might be important for your downstream analyses with the conservation scores, you could instead use the import() function from the rtracklayer package, which also provides a GRanges object, but including all the available data from the GTF file:

library(rtracklayer)

allfeatures <- import("gencode.v34.long_noncoding_RNAs.gtf.gz")
length(allfeatures)
[1] 243490
allfeatures[1:3]
GRanges object with 3 ranges and 19 metadata columns:
      seqnames      ranges strand |   source       type     score     phase
         <Rle>   <IRanges>  <Rle> | <factor>   <factor> <numeric> <integer>
  [1]     chr1 29554-31109      + |   HAVANA gene              NA      <NA>
  [2]     chr1 29554-31097      + |   HAVANA transcript        NA      <NA>
  [3]     chr1 29554-30039      + |   HAVANA exon              NA      <NA>
                gene_id   gene_type   gene_name       level     hgnc_id
            <character> <character> <character> <character> <character>
  [1] ENSG00000243485.5      lncRNA MIR1302-2HG           2  HGNC:52482
  [2] ENSG00000243485.5      lncRNA MIR1302-2HG           2  HGNC:52482
  [3] ENSG00000243485.5      lncRNA MIR1302-2HG           2  HGNC:52482
              tag          havana_gene     transcript_id transcript_type
      <character>          <character>       <character>     <character>
  [1]  ncRNA_host OTTHUMG00000000959.1              <NA>            <NA>
  [2]       basic OTTHUMG00000000959.1 ENST00000473358.1          lncRNA
  [3]       basic OTTHUMG00000000959.1 ENST00000473358.1          lncRNA
      transcript_name transcript_support_level    havana_transcript exon_number
          <character>              <character>          <character> <character>
  [1]            <NA>                     <NA>                 <NA>        <NA>
  [2] MIR1302-2HG-202                        5 OTTHUMT00000002840.1        <NA>
  [3] MIR1302-2HG-202                        5 OTTHUMT00000002840.1           1
                exon_id         ont
            <character> <character>
  [1]              <NA>        <NA>
  [2]              <NA>        <NA>
  [3] ENSE00001947070.1        <NA>
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

As you can see, here we get directly a GRanges object with all feature annotations (genes, transcripts, exons), and thus we can feed that directly to gscores():

featurescores <- gscores(phastCons100way.UCSC.hg38, allfeatures) ## now this should take about 10 seconds
featurescores2[1:3]
GRanges object with 3 ranges and 20 metadata columns:
      seqnames      ranges strand |   source       type     score     phase
         <Rle>   <IRanges>  <Rle> | <factor>   <factor> <numeric> <integer>
  [1]     chr1 29554-31109      + |   HAVANA gene              NA      <NA>
  [2]     chr1 29554-31097      + |   HAVANA transcript        NA      <NA>
  [3]     chr1 29554-30039      + |   HAVANA exon              NA      <NA>
                gene_id   gene_type   gene_name       level     hgnc_id
            <character> <character> <character> <character> <character>
  [1] ENSG00000243485.5      lncRNA MIR1302-2HG           2  HGNC:52482
  [2] ENSG00000243485.5      lncRNA MIR1302-2HG           2  HGNC:52482
  [3] ENSG00000243485.5      lncRNA MIR1302-2HG           2  HGNC:52482
              tag          havana_gene     transcript_id transcript_type
      <character>          <character>       <character>     <character>
  [1]  ncRNA_host OTTHUMG00000000959.1              <NA>            <NA>
  [2]       basic OTTHUMG00000000959.1 ENST00000473358.1          lncRNA
  [3]       basic OTTHUMG00000000959.1 ENST00000473358.1          lncRNA
      transcript_name transcript_support_level    havana_transcript exon_number
          <character>              <character>          <character> <character>
  [1]            <NA>                     <NA>                 <NA>        <NA>
  [2] MIR1302-2HG-202                        5 OTTHUMT00000002840.1        <NA>
  [3] MIR1302-2HG-202                        5 OTTHUMT00000002840.1           1
                exon_id         ont     default
            <character> <character>   <numeric>
  [1]              <NA>        <NA> 0.046465296
  [2]              <NA>        <NA> 0.046826425
  [3] ENSE00001947070.1        <NA> 0.000205761
  -------
  seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome

You can here either work further with the GRanges object, for instance, let's say you want the mean exon conservation score per transcript:

mask <- featurescores$type == "exon"
phastconsbytx <- aggregate(featurescores$default[mask], list(tx=featurescores$transcript_id[mask]), mean)
head(phastconsbytx)
                  tx           x
1 ENST00000229465.10 0.083043475
2  ENST00000230113.5 0.004694493
3  ENST00000235290.7 0.473864691
4  ENST00000242109.5 0.164317433
5  ENST00000244820.2 0.005057060
6  ENST00000244906.6 0.005229695

You can convert the GRanges object into a data.frame object, and dump it into a CSV file, or whatever you need:

head(as.data.frame(featurescores), 3)
  seqnames start   end width strand source       type score phase
1     chr1 29554 31109  1556      + HAVANA       gene    NA    NA
2     chr1 29554 31097  1544      + HAVANA transcript    NA    NA
3     chr1 29554 30039   486      + HAVANA       exon    NA    NA
            gene_id gene_type   gene_name level    hgnc_id        tag
1 ENSG00000243485.5    lncRNA MIR1302-2HG     2 HGNC:52482 ncRNA_host
2 ENSG00000243485.5    lncRNA MIR1302-2HG     2 HGNC:52482      basic
3 ENSG00000243485.5    lncRNA MIR1302-2HG     2 HGNC:52482      basic
           havana_gene     transcript_id transcript_type transcript_name
1 OTTHUMG00000000959.1              <NA>            <NA>            <NA>
2 OTTHUMG00000000959.1 ENST00000473358.1          lncRNA MIR1302-2HG-202
3 OTTHUMG00000000959.1 ENST00000473358.1          lncRNA MIR1302-2HG-202
  transcript_support_level    havana_transcript exon_number           exon_id
1                     <NA>                 <NA>        <NA>              <NA>
2                        5 OTTHUMT00000002840.1        <NA>              <NA>
3                        5 OTTHUMT00000002840.1           1 ENSE00001947070.1
   ont      default
1 <NA> 0.0464652956
2 <NA> 0.0468264249
3 <NA> 0.0002057613