Question

Workflow to Analyse DNA Sequence specify gene expression in a fungus under stress

0

Entering edit mode

Ferdinand David • 0

@b39b3713

Last seen 5 months ago

United Kingdom

My project's goal is to understand how DNA sequence specifies gene expression changes in a fungus under stress. There are 2 datasets, count data contains rna_count and metadata contains data information. The metadata consists of temperature (30 and 37), media (YPD, egta, cfw, cfw-egta), and strain (KN99 and flc1). The objective explore the data, find genes that have different expression across relevant conditions, and find sequence motifs that are associated with genes that have different expression across design conditions. My question:

The rna_count data is un-normalised. When doing EDA, should I normalise the data or not? Because if we use original data, we can't compare among samples and also among genes, right?
If we have to normalise data, there are many methods, such as log transformation, TPM, DESeq2, EdgeR and others. What is the justification for choosing the model? Is there a metric evaluation to justify the best method in my case?
If we want to include variance of replication, can we DESeq2 or another package?
1. If we use DESeq2, we can use many experimental designs. How to decide the best design to use, and what is the metric evaluation?
2. How to find sequence motif
3. Please, provide the workflow from beginning to end in my case, because I am not a biologist and I have never used another biology package before.

Thanks

```

rnaseqGene • 853 views

ADD COMMENT • link updated 6 months ago by Michael Love 43k • written 6 months ago by Ferdinand David • 0

score 0 · Answer 1 · 2025-05-30

Check the workflow which addresses your Q1:

https://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#exploratory-analysis-and-visualization

The justification for our VST is in the paper and in the workflow.

How to decide the best design to use

The choice of design is really motivated by the biology and technical factors. Which factors affect the expression. Those should be included. In some cases the experimental design means certain factors cannot be estimated, e.g. if condition is confounded with batch. But aside from confounded designs, you typically include factors that are known to affect the measurements.