How to extract phastcons score for all protein coding genes and lncRNAs using GenomicScores?
1
0
Entering edit mode
Biologist ▴ 90
@biologist-9801
Last seen 2.4 years ago

Hi all,

I'm actually looking for phastcons scores for both protein-coding genes and known lncRNAs from Gencode. Then I got to know that conservation scores can be extracted from the UCSC browser.

I found an R package GenomicScores to extract the phastcons scores.

In the tutorial of the GenomicScore I see that information can be exxtracted only by each chromosome location.

But I would like to know, whether there is any way to get the phastcons score for all protein coding genes and known lncRNAs from phastCons100way.UCSC.hg38

library(GenomicRanges)
library(phastCons100way.UCSC.hg38)
gsco <- phastCons100way.UCSC.hg38
gscores(gsco, GRanges(seqnames="chr7", IRanges(start=117232380, width=1)))

GRanges object with 1 range and 1 metadata column:
seqnames    ranges strand |   default
<Rle> <IRanges>  <Rle> | <numeric>
[1]     chr7 117232380      * |       0.5
-------
seqinfo: 1 sequence from Genome Reference Consortium GRCh38 genome; no seqlengths


Yes from the above example I'm able to get the phastcons score for the specific chromosome location. But I'm interested in extracting the information for all protein coding genes and known lncRNAs.

May I get some help here soon. thanks a lot in advance.

1
Entering edit mode

Is this also by you: https://www.biostars.org/p/427723/ ?

1
Entering edit mode

yes, I will close that in biostars

1
Entering edit mode
Robert Castelo ★ 2.9k
@rcastelo
Last seen 19 days ago
Barcelona/Universitat Pompeu Fabra

There are multiple ways in which you can do this, I'll show you one. first, you need to fetch the gene annotations you want. You've mentioned Gencode. One version of Gencode annotations is available in the TxDb.Hsapiens.UCSC.hg38.knownGene annotation package:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene


next, you need to fetch coordinates of the genes, you have not mentioned whether you want to calculate the average conservation spanning the gene boundaries or, for instance, the average conservation of the spliced exons. i'm going to assume you want the latter:

exonsbygene <- exonsBy(txdb, by="gene")
class(exonsbygene)
[1] "CompressedGRangesList"
attr(,"package")
[1] "GenomicRanges"
length(exonsbygene)
[1] 27363


these exons by gene coordinates are stored in a list-like data structure with 27,363 elements, thus indicating we have exon annotations for 27,363 genes. let's load now the annotation package with the phastCons conservation scores and check whether all sequence names (whole chromosomes and chromosome alternate sequences and patches in GRCh38) in the gene annotations are present in the phastCons package:

library(phastCons100way.UCSC.hg38)
phast <- phastCons100way.UCSC.hg38
[1] FALSE
[1] 595
[1] 140
[1] "chr1_KN196472v1_fix" "chr1_KN196473v1_fix" "chr1_KN196474v1_fix"
[4] "chr1_KN538360v1_fix" "chr1_KN538361v1_fix" "chr1_KQ031383v1_fix"


so, not all sequence names in the gene annotations are present in the phastCons package, concretely for 140 of the alternate and sequences and patches. we have to remove the annotations from those sequences because otherwise GenomicScores will prompt an error. we do this by restricting annotations to those occurring in "standard" sequence names, corresponding to whole chromosomes:

exonsbygene <- keepStandardChromosomes(exonsbygene, pruning.mode="coarse")
length(exonsbygene)
[1] 25906


so, we have reduced the number of gene annotations from 27,363 to 25,906. still, this like-like data structure cannot be fed directly into the gscores() function of the GenomicScores package because it expects a GRanges object. for this reason, we will first "unlist" this list-like data structure into a GRanges object:

allexons <- unlist(exonsbygene)
length(allexons)
[1] 521670
GRanges object with 6 ranges and 2 metadata columns:
seqnames            ranges strand |   exon_id   exon_name
<Rle>         <IRanges>  <Rle> | <integer> <character>
1    chr19 58345178-58347029      - |    573605        <NA>
1    chr19 58345183-58347029      - |    573606        <NA>
1    chr19 58346854-58347029      - |    573607        <NA>
1    chr19 58346858-58347029      - |    573608        <NA>
1    chr19 58346860-58347029      - |    573609        <NA>
1    chr19 58347353-58347634      - |    573610        <NA>
-------
seqinfo: 25 sequences (1 circular) from hg38 genome


now extract the average conservation scores for each of those exons and put them as an additional metadata column in the GRanges object (this took a couple of minutes in my laptop):

allexons$phast <- gscores(phast, allexons)$default
GRanges object with 6 ranges and 3 metadata columns:
seqnames            ranges strand |   exon_id   exon_name
<Rle>         <IRanges>  <Rle> | <integer> <character>
1    chr19 58345178-58347029      - |    573605        <NA>
1    chr19 58345183-58347029      - |    573606        <NA>
1    chr19 58346854-58347029      - |    573607        <NA>
1    chr19 58346858-58347029      - |    573608        <NA>
1    chr19 58346860-58347029      - |    573609        <NA>
1    chr19 58347353-58347634      - |    573610        <NA>
phast
<numeric>
1 0.0209503239740821
1 0.0204656199242014
1 0.0318181818181818
1 0.0325581395348837
1 0.0329411764705882
1  0.119148936170213
-------
seqinfo: 25 sequences (1 circular) from hg38 genome


if we want to have the conservation scores organized by gene as in the annotations, we can do the following:

exonsbygene <- relist(allexons, exonsbygene)


i'm hitting the 5000 character limit in my answer, so i can't give further details on how to go on from here.

0
Entering edit mode

Thank you very much @Robert Castelo.

To print number of exons for each gene I gave it like below:

head(sapply(exonsbygene,length))

1        10       100      1000 100009613 100009667
29         4        25        24         5         1


I see the ids instead how can I get transcript / gene name?

And May I know how can I save the final output in a dataframe?

1
Entering edit mode

hi, a much faster way to get the lengths of vector-like elements in a list is using the function lengths():

head(lengths(exonsbygene))
1        10       100      1000 100009613 100009667
29         4        25        24         5         1


regarding the ids, to get the gene HGNC symbol you can use the org.Hs.eg.db gene-centric annotation package, as follows:

library(org.Hs.eg.db)

exonsbygene2 <- exonsbygene
names(exonsbygene2) <- select(org.Hs.eg.db, keys=names(exonsbygene), columns="SYMBOL")$SY MBOL head(names(exonsbygene2)) [1] "A1BG" "NAT2" "ADA" "CDH2" "LINC02584" "POU5F1P5"  to get transcript-level information you need to modify the script upstream, replacing gene-level annotations by transcript-level annotations. this can be done using the argument by="tx" in the call to the function exonsBy(). please consult the manual page of exonsBy() if you want more details on this. regarding dumping the output into a data.frame object, it depends what do you want to put in there, let's say you want to have the gene symbol and the mean conservation by gene: dtf <- data.frame(GeneSymbol=names(exonsbygene2), Conservation=tapply(allexons$phast, names(allexons), mean),
stringsAsFactors=FALSE)
GeneSymbol Conservation
1               A1BG   0.06276106
10              NAT2   0.31622833
1000            CDH2   0.73190676
100009613  LINC02584   0.10303927
100009667   POU5F1P5   0.52204262


note that the rownames are the Entrez identfiers, you can write this data.frame object into a text file using the write.csv() function.

0
Entering edit mode

Hi Robert,

thanks a lot for the information. Need small help please. The above what you gave is phastCons conservation scores for the human genome (hg38) [ Gencode known coding and known lncRNAs] calculated from multiple alignments with other 99 vertebrate species. I see that it is exonsbygene.

I'm interested in calculating two conservation scores for each transcript, one was based on the average value of the PhastCons scores in the exonic regions, whereas the other was based on those in the intronic regions.

So, for this, I should use like below:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
exonsbytranscript <- exonsBy(txdb, by="tx")
intronsbytranscript <- intronsByTranscript(txdb)


Am I right the above way to start with to get the average value of the PhastCons scores in the exonic regions, whereas the other was based on those in the intronic regions?

Similarly, I have a FASTA file with human novel lncRNA transcript sequences and I wanted to calculate phastCons score for those lncRNAs.

Any idea how I can do that? please let me know. thanq

0
Entering edit mode

hi,

yes, so exonsBy(txdb, by="tx") and intronsByTranscript(txdb) will give you two list-like data structures, which in parallel will correspond to exons and introns of every transcript. you may want to have a link from transcripts to genes, which you can obtain with the function call transcriptsBy(txdb, by="gene"). you'll find two different identifiers associated with transcripts, tx_id and tx_name. here tx_id is a unique internal identifier while tx_name is the transcript identifier from the data source, in this case Gencode.

regarding your second question, you'll have to align in one way or another those transcript sequences in the FASTA file to the human genome in order to get genome coordinates, which are necessary to fetch the conservation scores. i don't know if there is a Bioconductor package for such a task, i would recommend post a new question for this, but outside Bioconductor i guess you could use BLAT.

0
Entering edit mode

Hi Robert,

Thanks a lot again for the reply. I did like below, but I see ann error in getting the transcript ids or names.

txdb2 <- makeTxDbFromGFF("gencode.v27.annotation.gtf", format="gtf", organism="Homo sapiens")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type stop_codon.
This information was ignored.

exonsbytrans <- exonsBy(txdb2, by="tx")


To get the number of exons for each transscript

head(sapply(exonsbytrans,length))
1 2 3 4 5 6
3 6 3 2 1 1


Above I got the ids. So, to get the transcript ids I did like below, but there is an error.

library(org.Hs.eg.db)
exonsbytrans2 <- exonsbytrans
names(exonsbytrans2) <- select(org.Hs.eg.db, keys=names(exonsbytrans), columns="TXID")
Error in .testForValidCols(x, cols) :
Invalid columns: TXID. Please use the columns method to see a listing of valid arguments.


But for number of exons for each gene, I see the Gene ids.

exonsbygene <- exonsBy(txdb2, by="gene")
ENSG00000000003.14  ENSG00000000005.5 ENSG00000000419.12 ENSG00000000457.13 ENSG00000000460.16
20                 10                 27                 29                 70
ENSG00000000938.12
24

1
Entering edit mode

hi,

you're querying the wrong object. try to call the select() function on the txdb object. the TXID column only exists in TxDb.* packages, the gene-centric org.Hs.eg.db package has no information about transcripts.

0
Entering edit mode

Sure, but here may I know what is the number 1,2,3,4,5,6?

head(sapply(exonsbytrans,length))
1 2 3 4 5 6
3 6 3 2 1 1


Is it exon rank or id? Asking this because I have to mention the keytype in select()

exonsbytrans2 <- exonsbytrans
names(exonsbytrans2) <- select(txdb2, keys=names(exonsbytrans), columns="TXID")

Error in .select(x, keys, columns, keytype, ...) :
argument "keytype" is missing, with no default

1
Entering edit mode

if you do

lst <- list(a=1:3, b=4:5, c=6:10)
sapply(lst, length)
a b c
3 2 5


as you can see, you're getting a named vector, where values are the result of calling length() to each element of the list on which you do sapply(), while the names of the vector are the names of the elements of the list. so, in your case, the names of the vector are probably de internal transcript identifiers tx_id and the values of the vector are the number of exons by transcripts. if you call keytypes(txdb2) you'll get the available values for that argument and you'll see you probably need to set keytype="TXID".

0
Entering edit mode

I guess I'm doing something wrong here.

keytypes(txdb2)
[1] "CDSID"    "CDSNAME"  "EXONID"   "EXONNAME" "GENEID"   "TXID"     "TXNAME"


exonsbytrans

GRangesList object of length 200401:
$1 GRanges object with 3 ranges and 3 metadata columns: seqnames ranges strand | exon_id exon_name exon_rank <Rle> <IRanges> <Rle> | <integer> <character> <integer> [1] chr1 11869-12227 + | 1 ENSE00002234944.1 1 [2] chr1 12613-12721 + | 5 ENSE00003582793.1 2 [3] chr1 13221-14409 + | 8 ENSE00002312635.1 3$2
GRanges object with 6 ranges and 3 metadata columns:
seqnames      ranges strand | exon_id         exon_name exon_rank
[1]     chr1 12010-12057      + |       2 ENSE00001948541.1         1
[2]     chr1 12179-12227      + |       3 ENSE00001671638.2         2
[3]     chr1 12613-12697      + |       4 ENSE00001758273.2         3
[4]     chr1 12975-13052      + |       6 ENSE00001799933.2         4
[5]     chr1 13221-13374      + |       7 ENSE00001746346.2         5
[6]     chr1 13453-13670      + |       9 ENSE00001863096.1         6

\$3
GRanges object with 3 ranges and 3 metadata columns:
seqnames      ranges strand | exon_id         exon_name exon_rank
[1]     chr1 29554-30039      + |      10 ENSE00001947070.1         1
[2]     chr1 30564-30667      + |      13 ENSE00001922571.1         2
[3]     chr1 30976-31097      + |      14 ENSE00001827679.1         3

...
<200398 more elements>
-------
seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths


I gave the keytype="TXID"

exonsbytrans2 <- exonsbytrans
names(exonsbytrans2) <- select(txdb2, keys=names(exonsbytrans), columns = names(exonsbytrans), keytype = "TXID")
Error in .testForValidCols(x, cols) :
Invalid columns: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,2

1
Entering edit mode

you should read carefully the error message, which says "Error in .testForValidCols(x, cols) : Invalid columns", so while you may be giving a correct keytype argument, you are not giving a correct columns argument. i think you should more carefully read the help page and documentation about the select() function. in the columns argument you give the names of the columns you want to extract. to know the columns available you should type columns(txdb2). then, you should also be aware that select() returns a data.frame object and as such, you probably want to fetch the values of a particular column of that data.frame. i'm afraid all this has nothing to do with the original question about extracting phastCons scores, so if you still get stuck with managing data.frames and the select() interface, i strongly suggest you to post a new question, trying to specify as much as possible what is your goal and what is the step where you get stuck.