I want to use ENCprime
and found an updated implementation in https://rdrr.io/bioc/coRdon/man/codonUsage.html
However, in the original implementation a background sequence set is required: https://github.com/jnovembre/ENCprime/blob/master/doc/ENCprimedoc.pdf
I would like to use my protein coding genes as the background sequences (i.e. ExBackgroundSeqs.fasta) and ribosomal proteins as my query sequences (e.g. ExSeqs.fasta).
I have my ribosomal proteins in a separate file. For example,
all_proteins.fasta
ribosomal_proteins.fasta
How can I do this using coRdon
's implementation?
Here's my attempt:
library(coRdon)
# Load in the sequences
> ribosomal_proteins = readSet(file="ribosomal.fasta")
> all_proteins = readSet(file="all_proteins.fasta")
# Create codonTable objects
> cT_ribosomal = codonTable(x=ribosomal_proteins)
> cT_all = codonTable(x=all_proteins)
# There are 10 ribosomal sequences
> length(cT_ribosomal@len)
[1] 10
# There are 500 background sequences
> length(cT_all@len)
[1] 500
# Use ENCprime using:
# * ribosomal sequences as the query seqeuences
# * all proteins as the background sequences
# * Genetic code 11
# 1. Am I doing this correctly?
> df_encprime = ENCprime(cTobject=cT_ribosomal, subsets=list(all_proteins=cT_all), self=TRUE, id_or_name2="11", ribosomal=FALSE)
# 2. What is the difference between self and all_proteins here?
# 3. Are these values ENCprime values? The original program has a lot of outputs (https://github.com/jnovembre/ENCprime/blob/master/doc/ENCprimedoc.pdf)
> head(df_encprime)
self all_proteins
[1,] 61.00000 61.00000
[2,] 50.35134 48.54671
[3,] 48.15243 38.09560
[4,] 61.00000 61.00000
[5,] 47.96728 46.81694
[6,] 61.00000 61.00000
# 4. There are 10 rows here which is the same number of ribosomal proteins?
> dim(df_encprime)
[1] 10 2
# 5. If I want to have the Ncp values of the ribosomal proteins using the rest of the proteins as background sequences, am I doing this correctly? If so, would I use the `self` or the `all_proteins` column?
When I actually run ENCprime
from https://github.com/jnovembre/ENCprime
ENCprime/bin/SeqCount -c ribosomal.fasta 10
ENCprime/bin/SeqCount -n all_proteins.fasta 500
ENCprime/bin/ENCprime ribosomal.fasta.codcnt all_proteins.fasta.acgtfreq 11 ExResults 1 -q
Here's the output:
Name Nc Ncp ScaledChi SumChi df p B_KM n_codons
contig_413247_104_232_-: 59.7500 61.0000 1.4217 58.2882 43 0.0599 0.9152 41
contig_413248_1_282_-: 33.0645 40.4597 0.8027 74.6483 43 0.0020 0.7128 93
contig_413249_1_378_-: 49.8205 28.5148 1.5820 197.7461 43 0.0000 1.1962 125
contig_413250_1_201_-: 44.6197 57.6868 0.7850 51.8128 43 0.1678 0.8350 66
contig_413251_1_338_-: 55.7272 45.8360 0.6035 66.9923 43 0.0110 0.5787 111
contig_413252_6_263_-: 47.1154 46.4176 0.6758 56.7674 43 0.0777 0.6558 84
contig_413253_1_212_+: 36.0447 49.1137 0.6980 48.1599 43 0.2720 0.7760 69
contig_413254_86_154_+: 61.0000 61.0000 1.6488 34.6249 43 0.8149 1.0468 21
contig_413255_1_209_-: 50.7438 54.6340 0.6736 46.4805 43 0.3310 0.6667 69
contig_413256_66_167_-: 61.0000 61.0000 1.1630 37.2152 43 0.7195 0.9912 32
Totals: 56.0227 52.0516 0.2115 150.3532 43 0.0000 0.4008 711
They are different so I don't think I'm doing this correctly.