My project's goal is to understand how DNA sequence specifies gene expression changes in a fungus under stress. There are 2 datasets, count data contains rna_count and metadata contains data information. The metadata consists of temperature (30 and 37), media (YPD, egta, cfw, cfw-egta), and strain (KN99 and flc1). The objective explore the data, find genes that have different expression across relevant conditions, and find sequence motifs that are associated with genes that have different expression across design conditions. My question:
- The rna_count data is un-normalised. When doing EDA, should I normalise the data or not? Because if we use original data, we can't compare among samples and also among genes, right?
- If we have to normalise data, there are many methods, such as log transformation, TPM, DESeq2, EdgeR and others. What is the justification for choosing the model? Is there a metric evaluation to justify the best method in my case?
- If we want to include variance of replication, can we DESeq2 or another package?
- If we use DESeq2, we can use many experimental designs. How to decide the best design to use, and what is the metric evaluation?
- How to find sequence motif
- Please, provide the workflow from beginning to end in my case, because I am not a biologist and I have never used another biology package before.
Thanks
```