Most efficient way to remove genes that are in a string of another fector
1
0
Entering edit mode
@bioinformatics-11531
Last seen 7.6 years ago

Hey ! 

 

I have a deseq object ( not ran the deseq pipeline yet). 

Because i want to compare rna-seq data from different sources, i want to remove the genes that are non-protein coding.

This is due to wanting cleaner data ( some rna is rrna ). 

 

So i downloaded the gene information for homo-sapiens from the NCBI ftp server. 

Now i have the deseq genes for example 

assay(dds):

                    SRR001      SRR002     SRR003
ENSG00000000001       1111       2222       3333
ENSG00000000002          2          3          1
ENSG00000000003       2222       1111       1220

The data from ncbi : 

#tax_id        dbXrefs                                              gene-type

9606    MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000000002|HPRD:00726   protein-coding 
9605    MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000000003|HPRD:00726   nonprotein 

 

Now what i want is an efficient way to remove the genes that are not protein-coding.

So compare a rowname from a gene find it in the string of a vector in the ncbi data, look at if it is protein-coding, if not remove that row. 

 

I am used to python , so this does give me a little headache.

I hope someone can help me ! 

Regards, 

Ben 

 

 

 

 

 

r deseq2 • 1.1k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 18 hours ago
United States

In R, you can extract parts of a string using regular expression and the sub() function.

e.g.:

> sub(".*Ensembl:(.*)\\|.*","\\1","MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000000002|HPRD:00726")
[1] "ENSG00000000002"

Then you can use the merge() function to add the columns from your data.frame.

If you name that data.frame with NCBI information 'dat', then:

dat$gene <-  sub(".*Ensembl:(.*)\\|.*","\\1",dat$dbXrefs)

coldata <- merge(data.frame(gene=rownames(dds)), dat, all.x=TRUE)

colData(dds) <- coldata

Now you should have merged the protein coding information onto the column data of the dds. You should then double check to see if the rows are properly aligned.

ADD COMMENT

Login before adding your answer.

Traffic: 768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6