Question

HISAT-StringTie-Ballgown, order of biological replicates in script design

0

Entering edit mode

Ir • 0

@ir-11113

Last seen 8.5 years ago

Hello,

This is my first time I am using the Ballgown package. Although the tutorial and paper provide the great explanation and support, I am getting the results that we do not expect. Therefore, I would like to clarify if my code is correct, as I can't find any mistake or incorrect value annotation. I load the data by providing a path; the pData, also looks good (pls see below). However, when it comes to the log-fold change we get an opposite result. Instead of seeing a decrease, we observe an increase. Another important thing to note is that I have been using an alternate differential expression analysis workflow (http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual), as we are not interested in novel isoforms (stringtie -merge steps was not used). I thought that the problem might be in the code design, but I can't find what is wrong.

Some additional question I would like to clarify:

I am doing case-control comparison (KO vs WT each has four biological replicates), do I understand correctly that stattest() use KO/WT for log-fold change calculation based on provided phenotype data (pData) and the order of loaded files (KO, WT or WT, KO) does not matter, as phenotype data basically describes the order?
When the number of biological replicates is not equal for example, two KOs and six WTs, would you advice to use limma ?
I would like to extract raw counts to build bedGraph files to see base-by-base coverage. Is there a way to do so, by using the ballgown package directly in R, or prepDE.py is the only option?

Thank you so much in advance.

# 1. Load data
bg = ballgown (samples= c ('path_to_folder/55219_KO/',
                           'path_to_folder/55221_KO/',
                           'path_to_folder/55223_KO/',
                           'path_to_folder/55233_KO/',
                           'path_to_folder/55225_WT/',
                           'path_to_folder/55227_WT/',
                           'path_to_folder/55229_WT/',
                           'path_to_folder/55231_WT/'), meas='all')
bg
ballgown instance with 110379 transcripts and 8 samples

sampleNames(bg)   # "55219_KO" "55221_KO" "55223_KO" "55233_KO" "55225_WT" "55227_WT" "55229_WT" "55231_WT"

# 2.  pData - phenotype (four KOs vs four WTs (biolog.replicates)):
pData(bg) = data.frame(id=sampleNames(bg), group=rep(c('KO', 'WT'), each=4))
> pData(bg)
id group
1 55219_KO    KO
2 55221_KO    KO
3 55223_KO    KO
4 55233_KO    KO
5 55225_WT    WT
6 55227_WT    WT
7 55229_WT    WT
8 55231_WT    WT

# 3. Remove law-abundance 
bg_filtered = subset( bg, "rowVars(texpr(bg)) >1",genomesubset = TRUE ) 

whole_tx_table1=texpr(bg, meas='all')
gene_IDs_names=unique(whole_tx_table1[,c(9,10)])

# 4.2 ADD gene names  two-group (e.g. case/control) comparisons
results_genes = stattest(bg_filtered, feature="gene", covariate="group", getFC=TRUE, meas="FPKM")
results_genes = merge(results_genes, gene_IDs_names, by.x=c("id"),by.y=c("gene_id"))
results_genes2=cbind(results_genes[,6],results_genes[,c(1,2,3,4,5)])
colnames(results_genes2)=c('gene_name','id','feature','fc','pval','qval')

# 4.3 SORT results from smallest p-value to the largest
results_genes_sorted=arrange(results_genes2, pval)

# 4.5 Q value < 0.05
results_genes1_q_val=subset(results_genes_sorted, results_genes_sorted$qval<0.05)

ballgown rna-seq • 2.7k views

ADD COMMENT • link updated 8.5 years ago by Alyssa Frazee ▴ 210 • written 8.5 years ago by Ir • 0

score 0 · Answer 1 · 2017-01-12

Hi there,

I hope we can provide some clarity here!

* "when it comes to the log-fold change we get an opposite result. Instead of seeing a decrease, we observe an increase" --> perhaps my earlier answer at How fold change is calculated in Ballgown? will clarify this difference? The differences reported are WT / KO. You can also verify the direction of the fold changes using the raw data (using the data in texpr(bg), you can graph the expression in both groups and see which group has a higher average expression).

* "stattest() use KO/WT for log-fold change calculation based on provided phenotype data" --> again please see How fold change is calculated in Ballgown?, the fold change is WT / KO.

* "When the number of biological replicates is not equal for example, two KOs and six WTs, would you advice to use limma ?" --> having unequal numbers of replicates isn't really a problem, but having a very small number of replicates (2) may be an issue. limma's empirical Bayes approach might be a good way to go there (since the information-sharing across genes might help stabilize the variance estimates), though 2 is a very small number for variance estimation even for limma.

* "I would like to extract raw counts to build bedGraph files to see base-by-base coverage. Is there a way to do so, by using the ballgown package directly in R, or prepDE.py is the only option?" --> hopefully https://github.com/alyssafrazee/ballgown#ballgown-readable-expression-output will help a bit: you can use `texpr(bg, 'cov') to get a matrix of transcript x average per-base coverage for that transcript (see the description of "cov"). That is as raw as the counting gets in ballgown, for transcripts. There are also a number of different count measurements for introns and exons (please see "rcount", "mcount", and "ucount". Depending on what you'd like to do, these may be useful to you. If you'd actually like per-base coverage from your assembly (structure + .sam/.bam files), ballgown is likely not the right tool for that (there are several other pieces of software designed to extract coverage from an assembly -- maybe samtools?)

Hope this helps!