Hi all,
I have a time course experiment with two conditions, control and treatment, with three biological replicates for each time point except for time 0 (control conditions) that has two biological replicates. I'm using the pipeline of "Hisat, strong tie, ballgown) for data analysis. I have the problem with ballgown.
ballgown directory contains 11 folder corresponding to 11 samples (from the string tie output) and the data_form.csv file . The name and order of folders are the same as .csv file. There are 5 files ending to .ctab in each folder.
setwd ("E:/sequencing/ballgown") data_form = read.csv ("data_form.csv") data_form ids treatment time sample1 control 0 sample2 control 0 sample3 drug 2 sample4 drug 2 sample5 drug 2 sample6 drug 12 sample7 drug 12 sample8 drug 12 sample9 drug 24 sample10 drug 24 sample11 drug 24
sample=c("sample1","sample2","sample3","sample4","sample5","sample6","sample7","sample8","sample9","sample10","sample11") But with the below command, I get the error: bg = ballgown(dataDir = "ballgown", samplePattern = "sample", pData = data_form) Error in file(file, "rt") : invalid 'description' argument
I searched on the net and saw the similar error, but it's not helpful for me. Could you please help me to solve this problem? I'm very grateful if you kindly tell me how to define two biological replicates for control (time 0) and three replicates for other times in ballgown?
Thank you in advance.
Thank you Jeff, yes that was the problem. Now another error appeared, while The name and order of sub-folders in the ballgown folder are the same with data_form.csv file and there are 5 files ending to .ctab in each sub-folder. I get error that
"Error in ballgown(dataDir = "ballgown", samplePattern = "sample", pData = data_form) :
first column of pData does not match the names of the folders containing the ballgown data.
In addition: Warning message:
In ballgown(dataDir = "ballgown", samplePattern = "sample", pData = data_form) :
Rows of pData did not seem to be in the same order as the columns of the expression data. Attempting to rearrange pData..."
and when I used the below command, it returned me FALSE (not TRUE)
To solve the problem, I simply remove the "pData" part, which sounds OK. But,
As it showed the samples are not ordered. However, I'm concerned about the correct way for making data_form.csv file. As I mentioned in my post, I have 2 biological replicates for control (time of 0 before applying drug) and 3 biological replicates for 2, 12, and 24 hours after using drug. Could you please help me out how to define the experiment at the pData? I found
But, I'm confused how to use it on the my experiment.
Thank you so much for your help.
Thank you for your quick response, Alyssa. In response to what you kindly suggested:
As I mentioned in the post, the sub-folder are ordered in the ballgown folder. They are as sample1, sample2, sample3,.....sample11, (actually they are ordered numerically that the same with data_form.csv file). But I don't understand why ballgown says they are not ordered!!. Anyway, I Inevitably changed the order of samples in the data_form.csv file, to sample1 sample10 sample11 sample2 sample3 sample4 ... sample9, which list.files("ballgown") showed and finally this problem was solved. Now, data_form is like below, which totally confused me.
I want to do differential expression analysis between control and drug at each time point and also between drug at the various time point of 2, 12, 24. Sorry, I read manual more and more, but I have still problem. Could you please help me out how to correctly label different conditions and biological replicates for statistical analysis?
As I found on the tutorial, the command for time course analysis will be:
But, I know nothing about the correct command of pData (bg) for my experiment, Please help me on this part, too. Regarding the second command, it is OK for my experiment except for "timecourse" that I should just change "TRUE" into "FALSE" as the number of my time point is 4 (fewer than 5), yes, is it right?, however, groups may be defined for ballgown or I should define custom model, which, unfortunately, I don't know how.
Thanks a lot for your help in advance.
Seems like the ordering issue has been solved -- great!
"how to correctly label different conditions and biological replicates for statistical analysis" -- in this case, it looks to me like each sample you have now is a biological replicate (i.e., that you have 11 biological replicates). If this is not true, you will need to add two columns to your phenotype data frame, one listing the biological replicate ID and one listing the technical replicate ID, and you likely will need to build a more complex model (or average/summarize across technical replicates so you have one row per biological replicate). I assume the "condition" column is correct for treatments as well.
As for how to do the timecourse experiment, I suggest reading more about the "timecourse" option in ?stattest, as it will give you more details on what the timecourse option actually does. You don't need 5 timepoints, necessarily (I don't believe it says that anywhere in the tutorial). If you want to assess whether expression changes over time, you can use timecourse=TRUE.
"groups may be defined for ballgown or I should define custom model, which, unfortunately, I don't know how" -- I'm not totally sure what this means. You need to define you own comparison groups (ballgown cannot infer what you want to compare and what you want to adjust for). If you don't know how to define custom models, I'd recommend not doing so (that option is recommended for users that already know what they want their models to be).
Thank you, Alyssa for your response!
Actually, I have 2 biological replicates for control (a1 and a2) and three biological replicates for treatment sample that showed with b1-b3, c1-c3, and d1-d3. I have not any technical replicates. I added a "rep" column as bellow, please kindly let me know if it's your mean?
My main issue is how to define pData (bg) here?, I found the below command but it is not right for my case:
Could you please help me out on this issue?
Regarding time course experiment, manual says "The timecourse option assumes that "time" in your study is truly continuous, i.e., that it takes several values along a time scale. If you have very few timepoints (e.g., fewer than 5), we recommend treating time as a categorical variable, since having very few values does not give much granularity for fitting a smooth curve using splines. You can do this by setting covariate equal to 'time' (or whatever your time variable is named) and simply leaving timecourse as FALSE, its default. If you don't have more timepoints than degrees of freedom in the spline model, a warning will be printed and time will be coerced to categorical"
Here I have 4 time points (fewer than 5), so I am confused if timecourse=TRUE or timecourse=FALSE, which one is right here based on the above explanation in the manual?
Thanks
(1) the "rep" column should probably just be the a, b, c, or d.
(2) the data frame you have listed (the one with "ids," "treatment," "time," and "rep") should be how you set pData. You can assign pData with "pData(bg) = data_form" (or whatever you've called that data frame).
(3) Ah, I see the explanation you've found in the manual! Thanks. As it says there, if you have fewer than 5 timepoints it's probably better to treat time as categorical. (timecourse=FALSE, which is the default). I believe it says this directly in the manual passage you provided.
Thank you Alyssa, now another issue appeared that I created a new post.