Hey !
I have a deseq object ( not ran the deseq pipeline yet).
Because i want to compare rna-seq data from different sources, i want to remove the genes that are non-protein coding.
This is due to wanting cleaner data ( some rna is rrna ).
So i downloaded the gene information for homo-sapiens from the NCBI ftp server.
Now i have the deseq genes for example
assay(dds):
SRR001 SRR002 SRR003 ENSG00000000001 1111 2222 3333 ENSG00000000002 2 3 1 ENSG00000000003 2222 1111 1220
The data from ncbi :
#tax_id dbXrefs gene-type 9606 MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000000002|HPRD:00726 protein-coding 9605 MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000000003|HPRD:00726 nonprotein
Now what i want is an efficient way to remove the genes that are not protein-coding.
So compare a rowname from a gene find it in the string of a vector in the ncbi data, look at if it is protein-coding, if not remove that row.
I am used to python , so this does give me a little headache.
I hope someone can help me !
Regards,
Ben