Hello,
This is my first time I am using the Ballgown package. Although the tutorial and paper provide the great explanation and support, I am getting the results that we do not expect. Therefore, I would like to clarify if my code is correct, as I can't find any mistake or incorrect value annotation. I load the data by providing a path; the pData, also looks good (pls see below). However, when it comes to the log-fold change we get an opposite result. Instead of seeing a decrease, we observe an increase. Another important thing to note is that I have been using an alternate differential expression analysis workflow (http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual), as we are not interested in novel isoforms (stringtie -merge steps was not used). I thought that the problem might be in the code design, but I can't find what is wrong.
Some additional question I would like to clarify:
- I am doing case-control comparison (KO vs WT each has four biological replicates), do I understand correctly that stattest() use KO/WT for log-fold change calculation based on provided phenotype data (pData) and the order of loaded files (KO, WT or WT, KO) does not matter, as phenotype data basically describes the order?
- When the number of biological replicates is not equal for example, two KOs and six WTs, would you advice to use limma ?
- I would like to extract raw counts to build bedGraph files to see base-by-base coverage. Is there a way to do so, by using the ballgown package directly in R, or prepDE.py is the only option?
Thank you so much in advance.
# 1. Load data
bg = ballgown (samples= c ('path_to_folder/55219_KO/',
'path_to_folder/55221_KO/',
'path_to_folder/55223_KO/',
'path_to_folder/55233_KO/',
'path_to_folder/55225_WT/',
'path_to_folder/55227_WT/',
'path_to_folder/55229_WT/',
'path_to_folder/55231_WT/'), meas='all')
bg
ballgown instance with 110379 transcripts and 8 samples
sampleNames(bg) # "55219_KO" "55221_KO" "55223_KO" "55233_KO" "55225_WT" "55227_WT" "55229_WT" "55231_WT"
# 2. pData - phenotype (four KOs vs four WTs (biolog.replicates)):
pData(bg) = data.frame(id=sampleNames(bg), group=rep(c('KO', 'WT'), each=4))
> pData(bg)
id group
1 55219_KO KO
2 55221_KO KO
3 55223_KO KO
4 55233_KO KO
5 55225_WT WT
6 55227_WT WT
7 55229_WT WT
8 55231_WT WT
# 3. Remove law-abundance
bg_filtered = subset( bg, "rowVars(texpr(bg)) >1",genomesubset = TRUE )
whole_tx_table1=texpr(bg, meas='all')
gene_IDs_names=unique(whole_tx_table1[,c(9,10)])
# 4.2 ADD gene names two-group (e.g. case/control) comparisons
results_genes = stattest(bg_filtered, feature="gene", covariate="group", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, gene_IDs_names, by.x=c("id"),by.y=c("gene_id"))
results_genes2=cbind(results_genes[,6],results_genes[,c(1,2,3,4,5)])
colnames(results_genes2)=c('gene_name','id','feature','fc','pval','qval')
# 4.3 SORT results from smallest p-value to the largest
results_genes_sorted=arrange(results_genes2, pval)
# 4.5 Q value < 0.05
results_genes1_q_val=subset(results_genes_sorted, results_genes_sorted$qval<0.05)

Thank you so much, Alyssa!