How to get The evolutionary conservation scores for lncRNAs by the phastCons100way.UCSC.hg38 (3.7.1) R package
1
0
Entering edit mode
@35004539
Last seen 7 hours ago
Qatar

Hi, I want to get The evolutionary conservation scores for lncRNAs by the phastCons100way.UCSC.hg38 (3.7.1) R package. I already manually downloaded the lncrna annotation gtf file from gencode V34, GRCh38), and I extracted the exons for the lncrna.


>gtf_data_df <- as.data.frame(gtf_data)

# Filter for long non-coding RNAs

> gtf_data_df_final= gtf_data_df %>% filter (gene_type == "lncRNA" & type == "exon")
> head(gtf_data_df_final)
  seqnames start   end width strand source type score phase           gene_id
1     chr1 29554 30039   486      + HAVANA exon    NA    NA ENSG00000243485.5
2     chr1 30564 30667   104      + HAVANA exon    NA    NA ENSG00000243485.5
3     chr1 30976 31097   122      + HAVANA exon    NA    NA ENSG00000243485.5
4     chr1 30267 30667   401      + HAVANA exon    NA    NA ENSG00000243485.5
5     chr1 30976 31109   134      + HAVANA exon    NA    NA ENSG00000243485.5
6     chr1 35721 36081   361      - HAVANA exon    NA    NA ENSG00000237613.2
  gene_type   gene_name level    hgnc_id   tag          havana_gene
1    lncRNA MIR1302-2HG     2 HGNC:52482 basic OTTHUMG00000000959.1
2    lncRNA MIR1302-2HG     2 HGNC:52482 basic OTTHUMG00000000959.1
3    lncRNA MIR1302-2HG     2 HGNC:52482 basic OTTHUMG00000000959.1
4    lncRNA MIR1302-2HG     2 HGNC:52482 basic OTTHUMG00000000959.1
5    lncRNA MIR1302-2HG     2 HGNC:52482 basic OTTHUMG00000000959.1
6    lncRNA     FAM138A     2 HGNC:32334 basic OTTHUMG00000000960.1
      transcript_id transcript_type transcript_name transcript_support_level
1 ENST00000473358.1          lncRNA MIR1302-2HG-202                        5
2 ENST00000473358.1          lncRNA MIR1302-2HG-202                        5
3 ENST00000473358.1          lncRNA MIR1302-2HG-202                        5
4 ENST00000469289.1          lncRNA MIR1302-2HG-201                        5
5 ENST00000469289.1          lncRNA MIR1302-2HG-201                        5
6 ENST00000417324.1          lncRNA     FAM138A-201                        1
     havana_transcript exon_number           exon_id  ont
1 OTTHUMT00000002840.1           1 ENSE00001947070.1 <NA>
2 OTTHUMT00000002840.1           2 ENSE00001922571.1 <NA>
3 OTTHUMT00000002840.1           3 ENSE00001827679.1 <NA>
4 OTTHUMT00000002841.1           1 ENSE00001841699.1 <NA>
5 OTTHUMT00000002841.1           2 ENSE00001890064.1 <NA>
6 OTTHUMT00000002842.1           1 ENSE00001656588.1 <NA>

sessionInfo( )
txdbmaker GenomicScores phastCons100way.UCSC.hg38 GenomicFeatures • 47 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 10 minutes ago
United States

It's just a GScores object that you can use like any other S4Vector type thing. You extract using the gscores function, which takes a GRanges object as the second argument (and which you apparently had prior to converting to a data.frame, which is not something you normally want to do.)

Something like this should work

gtf_data_lncrna <- subset(gtf_data, gene_type == "lncRNA" & type == "exon")
gtf_data_lncrna$conservation <- gscores(phastCons100way.UCSC.hg38, gtf_data_lncrna)$default

You tagged GenomicScores and GenomicFeatures, so I would imagine you already understand how GRanges and other S4Vectors work, but if not, you should read the vignettes for those packages and GenomicRanges as well.

As an aside, you are mixing Ensembl and UCSC data here, and that might be OK or it might not be. There are fundamental differences between how the two annotation services (EBI/EMBL and NCBI) identify things (they have worked together for like the last five years or so just to come up with a single transcript per gene in human that they can agree on, so...). You might be better off getting the lncRNA locations from UCSC as well.

Login before adding your answer.

Traffic: 721 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6