Question

Most efficient way to remove genes that are in a string of another fector

0

Entering edit mode

bioinformatics ▴ 20

@bioinformatics-11531

Last seen 7.6 years ago

Hey !

I have a deseq object ( not ran the deseq pipeline yet).

Because i want to compare rna-seq data from different sources, i want to remove the genes that are non-protein coding.

This is due to wanting cleaner data ( some rna is rrna ).

So i downloaded the gene information for homo-sapiens from the NCBI ftp server.

Now i have the deseq genes for example

assay(dds):

                    SRR001      SRR002     SRR003
ENSG00000000001       1111       2222       3333
ENSG00000000002          2          3          1
ENSG00000000003       2222       1111       1220

The data from ncbi :

#tax_id        dbXrefs                                              gene-type

9606    MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000000002|HPRD:00726   protein-coding 
9605    MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000000003|HPRD:00726   nonprotein

Now what i want is an efficient way to remove the genes that are not protein-coding.

So compare a rowname from a gene find it in the string of a vector in the ncbi data, look at if it is protein-coding, if not remove that row.

I am used to python , so this does give me a little headache.

I hope someone can help me !

Regards,

Ben

r deseq2 • 1.1k views

ADD COMMENT • link updated 8.2 years ago by Michael Love 43k • written 8.2 years ago by bioinformatics ▴ 20

score 1 · Answer 1 · 2016-10-11

In R, you can extract parts of a string using regular expression and the sub() function.

e.g.:

> sub(".*Ensembl:(.*)\\|.*","\\1","MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000000002|HPRD:00726")
[1] "ENSG00000000002"

Then you can use the merge() function to add the columns from your data.frame.

If you name that data.frame with NCBI information 'dat', then:

dat$gene <-  sub(".*Ensembl:(.*)\\|.*","\\1",dat$dbXrefs)

coldata <- merge(data.frame(gene=rownames(dds)), dat, all.x=TRUE)

colData(dds) <- coldata

Now you should have merged the protein coding information onto the column data of the dds. You should then double check to see if the rows are properly aligned.