Merging annotation with differential expression output
1
0
Entering edit mode
AL • 0
@38b25ea9
Last seen 4 days ago
Japan

I have an output file containing upregulated genes from a non model organism ordered by adjusted p-value.

geneID
Rp.chr4.1864
Rp.chr1.1957
Rp.chr4.2000
Rp.chrX.1597
Rp.chr4.1782
Rp.chr4.1865

and a second file containing the same gene IDs as well as their best hits from different databases

geneID  Nr  Nt  SwissProt   KOG eggNOG  Interpro    GO  KEGG
Rp.chr1.0001    protein BUD31 homolog   PREDICTED: Megachile rotundata protein BUD31 homolog (LOC100880403), transcript variant X3, mRNA    Protein BUD31 homolog   KOG3404: G10 protein/predicted nuclear transcription regulator  G10 protein IPR001748: G10 protein; IPR018230: BUD31/G10-related, conserved site    GO:0000398: mRNA splicing, via spliceosome; GO:0005634: nucleus; GO:0010467: gene expression    K12873: BUD31,G10;bud site selection protein 31
Rp.chr1.0002    putative ATP synthase subunit f, mitochondrial  Riptortus pedestris mRNA for conserved hypothetical protein, complete cds, sequence id: Rped-0111   Putative ATP synthase subunit f, mitochondrial  KOG4092: Mitochondrial F1F0-ATP synthase, subunit f Mitochondrial F1F0-ATP synthase, subunit f  IPR019344: Mitochondrial F1-F0 ATP synthase subunit F, predicted    GO:0000276: mitochondrial proton-transporting ATP synthase complex, coupling factor F(o); GO:0005622: intracellular; GO:0005623: cell; GO:0005737: cytoplasm; GO:0005739: mitochondrion; GO:0005740: mitochondrial envelope; GO:0005743: mitochondrial inner membrane; GO:0005753: mitochondrial proton-transporting ATP synthase complex; GO:1902600: proton transmembrane transport   K02130: ATPeF0F,ATP5J2;F-type H+-transporting ATPase subunit f
Rp.chr1.0003    hypothetical protein EVAR_64278_1   PREDICTED: Bombyx mandarina uncharacterized LOC114246253 (LOC114246253), transcript variant X2, mRNA    -   -   DNA helicase activity   -   -   -
Rp.chr1.0005    Retrovirus-related Pol polyprotein from type-1 retrotransposable element R1 2   -   -   -   Reverse transcriptase (RNA-dependent DNA polymerase)    IPR000477: Reverse transcriptase domain -   -
Rp.chr1.0006    -   -   -   -   -   IPR005135: Endonuclease/exonuclease/phosphatase; IPR036691: Endonuclease/exonuclease/phosphatase superfamily    -   -
Rp.chr1.0007    piggyBac transposable element-derived protein 4-like; hypothetical protein AGLY_017479  -   -   -   DDE superfamily endonuclease    IPR029526: PiggyBac transposable element-derived protein    -   -
Rp.chr1.0008    hypothetical protein GE061_11589    -   -   -       -   -   -

I want a command which selects all upregulated genes from file1 and outputs the annotation from file2 next to the correct geneID e.g.:

geneID  Nr  Nt  SwissProt   KOG eggNOG  Interpro    GO  KEGG
Rp.chr4.1864    hexamerin   Riptortus clavatus mRNA for cyanoprotein alpha subunit precursor, complete cds  -   -   Hemocyanin, all-alpha domain    IPR000896: Hemocyanin/hexamerin middle domain; IPR005203: Hemocyanin, C-terminal; IPR005204: Hemocyanin, N-terminal; IPR008922: Uncharacterised domain, di-copper centre; IPR013788: Hemocyanin/hexamerin; IPR014756: Immunoglobulin E-set; IPR036697: Hemocyanin, N-terminal domain superfamily; IPR037020: Hemocyanin, C-terminal domain superfamily  -   -
Rp.chr1.1957    cuticle protein 7-like  -   Cuticle protein 19  -   pupal cuticle protein   IPR000618: Insect cuticle protein   GO:0005576: extracellular region; GO:0007275: multicellular organism development; GO:0008010: structural constituent of chitin-based larval cuticle; GO:0031012: extracellular matrix; GO:0040003: chitin-based cuticle development -
Rp.chr4.2000    prophenoloxidase    PREDICTED: Acyrthosiphon pisum phenoloxidase 1 (LOC100160034), mRNA Hemocyanin F chain; Phenoloxidase 1 -   Common central domain of tyrosinase IPR000896: Hemocyanin/hexamerin middle domain; IPR002227: Tyrosinase copper-binding domain; IPR005203: Hemocyanin, C-terminal; IPR005204: Hemocyanin, N-terminal; IPR008922: Uncharacterised domain, di-copper centre; IPR013788: Hemocyanin/hexamerin; IPR014756: Immunoglobulin E-set; IPR036697: Hemocyanin, N-terminal domain superfamily; IPR037020: Hemocyanin, C-terminal domain superfamily GO:0004503: monophenol monooxygenase activity; GO:0005576: extracellular region; GO:0005615: extracellular space; GO:0006583: melanin biosynthetic process from tyrosine; GO:0035011: melanotic encapsulation of foreign target; GO:0036263: L-DOPA monooxygenase activity; GO:0036264: dopamine monooxygenase activity; GO:0042417: dopamine metabolic process; GO:0050830: defense response to Gram-positive bacterium; GO:0050832: defense response to fungus; GO:0055114: oxidation-reduction process   -
Rp.chrX.1597    chitooligosaccharidolytic beta-N-acetylglucosaminidase isoform X1   Riptortus pedestris mRNA for beta-hexosaminidase, partial cds, sequence id: Rped-0394, expressed in midgut  Probable beta-hexosaminidase fdl; Chitooligosaccharidolytic beta-N-acetylglucosaminidase    KOG2499: Beta-N-acetylhexosaminidase    beta-acetyl hexosaminidase like IPR015883: Glycoside hydrolase family 20, catalytic domain; IPR017853: Glycoside hydrolase superfamily; IPR025705: Beta-hexosaminidase; IPR029018: Beta-hexosaminidase-like, domain 2; IPR029019: Beta-hexosaminidase, eukaryotic type, N-terminal  GO:0005623: cell; GO:0005886: plasma membrane; GO:0005975: carbohydrate metabolic process; GO:0006032: chitin catabolic process; GO:0006491: N-glycan processing; GO:0006517: protein deglycosylation; GO:0016063: rhodopsin biosynthetic process; GO:0016231: beta-N-acetylglucosaminidase activity; GO:0048069: eye pigmentation; GO:0071944: cell periphery  K12373: HEXA_B;hexosaminidase [EC:3.2.1.52]
Rp.chr4.1782    hypothetical protein GE061_16316    -   -   -       -   -   -
Rp.chr4.1865    hexamerin   Riptortus clavatus mRNA for cyanoprotein beta subunit precursor, complete cds   -   -   Hemocyanin, all-alpha domain    IPR000896: Hemocyanin/hexamerin middle domain; IPR005203: Hemocyanin, C-terminal; IPR005204: Hemocyanin, N-terminal; IPR008922: Uncharacterised domain, di-copper centre; IPR013788: Hemocyanin/hexamerin; IPR014756: Immunoglobulin E-set; IPR036697: Hemocyanin, N-terminal domain superfamily; IPR037020: Hemocyanin, C-terminal domain superfamily  -   -

I've tried several options in WSL and R such as join, awk, grep or somethings like:

comm -1 -3 <(sort gene_list.csv) <(sort upregulated_genes.csv) > upreg_genes.csv

or

df1<- read.csv("gene_list.csv")
df2<- read.csv("upregulated_genes.csv")

exporttab <- merge(x=df1, y=df2, by.x='geneID', by.y='gene_list', fill=-9999)

write.csv(exporttab, "known_genes.csv", row.names=FALSE)

However I can't get anything to work and am out of options online. Please help

annotation RNASeqData • 126 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

This is just a general R question that you should be able to easily answer by using Google. Or you could ask on stackoverflow or biostars, or R-help, which are more general forums than this (which is intended for Bioconductor-specific questions).

But anyway, it's a simple matter of matching.

> tomatch <- data.frame(ID = LETTERS)
> matchable <- data.frame(ID = sample(LETTERS, 26), prot = paste0("protein", 1:26))
> matchable
   ID      prot
1   E  protein1
2   A  protein2
3   F  protein3
4   O  protein4
5   U  protein5
6   V  protein6
7   M  protein7
8   K  protein8
9   S  protein9
10  I protein10
11  P protein11
12  G protein12
13  Z protein13
14  L protein14
15  W protein15
16  J protein16
17  N protein17
18  B protein18
19  Y protein19
20  D protein20
21  X protein21
22  Q protein22
23  C protein23
24  H protein24
25  T protein25
26  R protein26
> matched <- data.frame(ID = tomatch[,1], prot = matchable[match(tomatch[,1], matchable[,1]),2])
> matched
   ID      prot
1   A  protein2
2   B protein18
3   C protein23
4   D protein20
5   E  protein1
6   F  protein3
7   G protein12
8   H protein24
9   I protein10
10  J protein16
11  K  protein8
12  L protein14
13  M  protein7
14  N protein17
15  O  protein4
16  P protein11
17  Q protein22
18  R protein26
19  S  protein9
20  T protein25
21  U  protein5
22  V  protein6
23  W protein15
24  X protein21
25  Y protein19
26  Z protein13

You could use merge in this case as well, but then you may need to use match to reorder, so I mostly use match directly.

Login before adding your answer.

Traffic: 685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6