The phastCons100way.UCSC.hg38
annotation package, storing conservation scores of the human genome version hg38
, should be queried with the function gscores()
from the GenomicScores
package, which is automatically loaded when you load phastCons100way.UCSC.hg38
. The function gscores()
expects at least two parameters: (1) an annotation package with the scores, such as phastCons100way.UCSC.hg38
, and a GenomicRanges
, or GRanges
, object with the positions for which you want to retrieve the scores, in your case the conservation scores. These positions are often derived from some sort of gene or transcript annotations. The gscores()
function, does not know whether those gene or transcript annotations form part of protein-coding genes or lncRNAs, and one should take care of giving the positions/annotations for which one wants to retrieve the scores.
From what you describe that you want to do I assume you want to download the following annotation GTF file:
download.file("https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.long_noncoding_RNAs.gtf.gz",
"gencode.v34.long_noncoding_RNAs.gtf.gz")
Assuming that you want to retrieve conservation scores for all of the annotations in this GTF file, the easiest way to import those annotations into a GRanges
object is as follows:
library(txdbmaker)
txdb <- makeTxDbFromGFF("gencode.v34.long_noncoding_RNAs.gtf.gz")
exbytx <- exonsBy(txdb, by="tx")
length(exbytx)
[1] 48479
There are 48,479 transcripts in this GTF file, and the object exbytx
is a GRangesList
object that contains one GRanges
object with the exon annotations for each of these 48,479 transcripts. Because gscores()
expects a GRanges
object, we will unlist exbytx
to get all the exon annotations in a single GRanges
object, then call gscores()
and put back those exon annotations along their conservation scores in a GRangesList
object back.
library(phastCons100way.UCSC.hg38)
allexons <- unlist(exbytx)
exonscores <- gscores(phastCons100way.UCSC.hg38, allexons) ## first time this should take about 20 seconds
exbytx <- relist(exonscores, exbytx)
exbytx[1]
GRangesList object of length 1:
$`1`
GRanges object with 3 ranges and 4 metadata columns:
seqnames ranges strand | exon_id exon_name exon_rank
<Rle> <IRanges> <Rle> | <integer> <character> <integer>
1 chr1 29554-30039 + | 1 ENSE00001947070.1 1
2 chr1 30564-30667 + | 3 ENSE00001922571.1 2
1 chr1 30976-31097 + | 4 ENSE00001827679.1 3
default
<numeric>
1 0.000205761
2 0.162500000
1 0.042622951
-------
seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome
The metadata column default
contains the conservation scores. This resulting object is missing a lot of the available data in the GTF object that is not imported by makeTxDbFromGFF()
. If you want to import all those data, because they might be important for your downstream analyses with the conservation scores, you could instead use the import()
function from the rtracklayer
package, which also provides a GRanges
object, but including all the available data from the GTF file:
library(rtracklayer)
allfeatures <- import("gencode.v34.long_noncoding_RNAs.gtf.gz")
length(allfeatures)
[1] 243490
allfeatures[1:3]
GRanges object with 3 ranges and 19 metadata columns:
seqnames ranges strand | source type score phase
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
[1] chr1 29554-31109 + | HAVANA gene NA <NA>
[2] chr1 29554-31097 + | HAVANA transcript NA <NA>
[3] chr1 29554-30039 + | HAVANA exon NA <NA>
gene_id gene_type gene_name level hgnc_id
<character> <character> <character> <character> <character>
[1] ENSG00000243485.5 lncRNA MIR1302-2HG 2 HGNC:52482
[2] ENSG00000243485.5 lncRNA MIR1302-2HG 2 HGNC:52482
[3] ENSG00000243485.5 lncRNA MIR1302-2HG 2 HGNC:52482
tag havana_gene transcript_id transcript_type
<character> <character> <character> <character>
[1] ncRNA_host OTTHUMG00000000959.1 <NA> <NA>
[2] basic OTTHUMG00000000959.1 ENST00000473358.1 lncRNA
[3] basic OTTHUMG00000000959.1 ENST00000473358.1 lncRNA
transcript_name transcript_support_level havana_transcript exon_number
<character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA>
[2] MIR1302-2HG-202 5 OTTHUMT00000002840.1 <NA>
[3] MIR1302-2HG-202 5 OTTHUMT00000002840.1 1
exon_id ont
<character> <character>
[1] <NA> <NA>
[2] <NA> <NA>
[3] ENSE00001947070.1 <NA>
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
As you can see, here we get directly a GRanges
object with all feature annotations (genes, transcripts, exons), and thus we can feed that directly to gscores()
:
featurescores <- gscores(phastCons100way.UCSC.hg38, allfeatures) ## now this should take about 10 seconds
featurescores2[1:3]
GRanges object with 3 ranges and 20 metadata columns:
seqnames ranges strand | source type score phase
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
[1] chr1 29554-31109 + | HAVANA gene NA <NA>
[2] chr1 29554-31097 + | HAVANA transcript NA <NA>
[3] chr1 29554-30039 + | HAVANA exon NA <NA>
gene_id gene_type gene_name level hgnc_id
<character> <character> <character> <character> <character>
[1] ENSG00000243485.5 lncRNA MIR1302-2HG 2 HGNC:52482
[2] ENSG00000243485.5 lncRNA MIR1302-2HG 2 HGNC:52482
[3] ENSG00000243485.5 lncRNA MIR1302-2HG 2 HGNC:52482
tag havana_gene transcript_id transcript_type
<character> <character> <character> <character>
[1] ncRNA_host OTTHUMG00000000959.1 <NA> <NA>
[2] basic OTTHUMG00000000959.1 ENST00000473358.1 lncRNA
[3] basic OTTHUMG00000000959.1 ENST00000473358.1 lncRNA
transcript_name transcript_support_level havana_transcript exon_number
<character> <character> <character> <character>
[1] <NA> <NA> <NA> <NA>
[2] MIR1302-2HG-202 5 OTTHUMT00000002840.1 <NA>
[3] MIR1302-2HG-202 5 OTTHUMT00000002840.1 1
exon_id ont default
<character> <character> <numeric>
[1] <NA> <NA> 0.046465296
[2] <NA> <NA> 0.046826425
[3] ENSE00001947070.1 <NA> 0.000205761
-------
seqinfo: 455 sequences (1 circular) from Genome Reference Consortium GRCh38 genome
You can here either work further with the GRanges
object, for instance, let's say you want the mean exon conservation score per transcript:
mask <- featurescores$type == "exon"
phastconsbytx <- aggregate(featurescores$default[mask], list(tx=featurescores$transcript_id[mask]), mean)
head(phastconsbytx)
tx x
1 ENST00000229465.10 0.083043475
2 ENST00000230113.5 0.004694493
3 ENST00000235290.7 0.473864691
4 ENST00000242109.5 0.164317433
5 ENST00000244820.2 0.005057060
6 ENST00000244906.6 0.005229695
You can convert the GRanges
object into a data.frame
object, and dump it into a CSV file, or whatever you need:
head(as.data.frame(featurescores), 3)
seqnames start end width strand source type score phase
1 chr1 29554 31109 1556 + HAVANA gene NA NA
2 chr1 29554 31097 1544 + HAVANA transcript NA NA
3 chr1 29554 30039 486 + HAVANA exon NA NA
gene_id gene_type gene_name level hgnc_id tag
1 ENSG00000243485.5 lncRNA MIR1302-2HG 2 HGNC:52482 ncRNA_host
2 ENSG00000243485.5 lncRNA MIR1302-2HG 2 HGNC:52482 basic
3 ENSG00000243485.5 lncRNA MIR1302-2HG 2 HGNC:52482 basic
havana_gene transcript_id transcript_type transcript_name
1 OTTHUMG00000000959.1 <NA> <NA> <NA>
2 OTTHUMG00000000959.1 ENST00000473358.1 lncRNA MIR1302-2HG-202
3 OTTHUMG00000000959.1 ENST00000473358.1 lncRNA MIR1302-2HG-202
transcript_support_level havana_transcript exon_number exon_id
1 <NA> <NA> <NA> <NA>
2 5 OTTHUMT00000002840.1 <NA> <NA>
3 5 OTTHUMT00000002840.1 1 ENSE00001947070.1
ont default
1 <NA> 0.0464652956
2 <NA> 0.0468264249
3 <NA> 0.0002057613