This is my first time I am using the Ballgown package. Although the tutorial and paper provide the great explanation and support, I am getting the results that we do not expect. Therefore, I would like to clarify if my code is correct, as I can't find any mistake or incorrect value annotation. I load the data by providing a path; the pData, also looks good (pls see below). However, when it comes to the log-fold change we get an opposite result. Instead of seeing a decrease, we observe an increase. Another important thing to note is that I have been using an alternate differential expression analysis workflow (http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual), as we are not interested in novel isoforms (stringtie -merge steps was not used). I thought that the problem might be in the code design, but I can't find what is wrong.
Some additional question I would like to clarify:
- I am doing case-control comparison (KO vs WT each has four biological replicates), do I understand correctly that stattest() use KO/WT for log-fold change calculation based on provided phenotype data (pData) and the order of loaded files (KO, WT or WT, KO) does not matter, as phenotype data basically describes the order?
- When the number of biological replicates is not equal for example, two KOs and six WTs, would you advice to use limma ?
- I would like to extract raw counts to build bedGraph files to see base-by-base coverage. Is there a way to do so, by using the ballgown package directly in R, or prepDE.py is the only option?
Thank you so much in advance.
# 1. Load data bg = ballgown (samples= c ('path_to_folder/55219_KO/', 'path_to_folder/55221_KO/', 'path_to_folder/55223_KO/', 'path_to_folder/55233_KO/', 'path_to_folder/55225_WT/', 'path_to_folder/55227_WT/', 'path_to_folder/55229_WT/', 'path_to_folder/55231_WT/'), meas='all') bg ballgown instance with 110379 transcripts and 8 samples
sampleNames(bg) # "55219_KO" "55221_KO" "55223_KO" "55233_KO" "55225_WT" "55227_WT" "55229_WT" "55231_WT" # 2. pData - phenotype (four KOs vs four WTs (biolog.replicates)): pData(bg) = data.frame(id=sampleNames(bg), group=rep(c('KO', 'WT'), each=4)) > pData(bg) id group 1 55219_KO KO 2 55221_KO KO 3 55223_KO KO 4 55233_KO KO 5 55225_WT WT 6 55227_WT WT 7 55229_WT WT 8 55231_WT WT # 3. Remove law-abundance bg_filtered = subset( bg, "rowVars(texpr(bg)) >1",genomesubset = TRUE ) whole_tx_table1=texpr(bg, meas='all') gene_IDs_names=unique(whole_tx_table1[,c(9,10)]) # 4.2 ADD gene names two-group (e.g. case/control) comparisons results_genes = stattest(bg_filtered, feature="gene", covariate="group", getFC=TRUE, meas="FPKM") results_genes = merge(results_genes, gene_IDs_names, by.x=c("id"),by.y=c("gene_id")) results_genes2=cbind(results_genes[,6],results_genes[,c(1,2,3,4,5)]) colnames(results_genes2)=c('gene_name','id','feature','fc','pval','qval') # 4.3 SORT results from smallest p-value to the largest results_genes_sorted=arrange(results_genes2, pval) # 4.5 Q value < 0.05 results_genes1_q_val=subset(results_genes_sorted, results_genes_sorted$qval<0.05)