Question

The common error with ballgown has not been solved!

0

Entering edit mode

Sara ▴ 10

@sara-9865

Last seen 17 months ago

Germany

Hi all,

I have a time course experiment with two conditions, control and treatment, with three biological replicates for each time point except for time 0 (control conditions) that has two biological replicates. I'm using the pipeline of "Hisat, strong tie, ballgown) for data analysis. I have the problem with ballgown.

ballgown directory contains 11 folder corresponding to 11 samples (from the string tie output) and the data_form.csv file . The name and order of folders are the same as .csv file. There are 5 files ending to .ctab in each folder.

setwd ("E:/sequencing/ballgown")
data_form = read.csv ("data_form.csv")
data_form

ids    treatment    time
sample1    control    0
sample2    control    0
sample3    drug    2
sample4    drug    2
sample5    drug    2
sample6    drug    12
sample7    drug    12
sample8    drug    12
sample9    drug    24
sample10    drug    24
sample11    drug    24

sample=c("sample1","sample2","sample3","sample4","sample5","sample6","sample7","sample8","sample9","sample10","sample11")

But with the below command, I get the error:
bg = ballgown(dataDir = "ballgown", samplePattern = "sample", pData = data_form)
Error in file(file, "rt") : invalid 'description' argument

I searched on the net and saw the similar error, but it's not helpful for me. Could you please help me to solve this problem? I'm very grateful if you kindly tell me how to define two biological replicates for control (time 0) and three replicates for other times in ballgown?

Thank you in advance.

ballgown software error • 3.7k views

ADD COMMENT • link updated 7.2 years ago by Jeff Leek ▴ 650 • written 7.2 years ago by Sara ▴ 10

score 0 · Answer 1 · 2017-02-10

0

Entering edit mode

Jeff Leek ▴ 650

@jeff-leek-5015

Last seen 3.2 years ago

United States

Sara I think that part of the problem might be this line: setwd("E:/sequencing/ballgown") You need to set the directory to: setwd ( "E:/sequencing") because ballgown is looking for the "ballgown" directory in "E:/sequencing/ballgown" when you run the code: bg = ballgown(dataDir = "ballgown", samplePattern = "sample", pData = data_form) But there is no "ballgown" directory inside of "E:/sequencing/ballgown" right? Jeff On Fri, Feb 10, 2017 at 8:24 AM Sara [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Sara <https: support.bioconductor.org="" u="" 9865=""/> wrote Question: The > common error with ballgown has not been solved! > <https: support.bioconductor.org="" p="" 92336=""/>: > > Hi all, > > I have a time course experiment with two conditions, control and > treatment, with three biological replicates for each time point except for > time 0 (control conditions) that has two biological replicates. I'm using > the pipeline of "Hisat, strong tie, ballgown) for data analysis. For > ballgown, I define a working directory as bellow: > > ballgown directory contains 11 folder corresponding to 11 samples (from > the string tie output) and the data_form.csv file The name and order > of folders are the same as .csv file. There is 5 files ending to .ctab in > each folder. > > setwd ( "E:/sequencing/ballgown") > data_form = read.csv ("data_form.csv") > data_form > > > ids > > > treatment > > > time > > > sample1 > > > control > > > 0 > > > sample2 > > > control > > > 0 > > > sample3 > > > drug > > > 2 > > > sample4 > > > drug > > > 2 > > > sample5 > > > drug > > > 2 > > > sample6 > > > drug > > > 12 > > > sample7 > > > drug > > > 12 > > > sample8 > > > drug > > > 12 > > > sample9 > > > drug > > > 24 > > > sample10 > > > drug > > > 24 > > > sample11 > > > drug > > > 24 > > sample=c("sample1","sample2","sample3","sample4","sample5","sample6","sample7","sample8","sample9","sample10","sample11") > > But with the below command, I get the error: > bg = ballgown(dataDir = "ballgown", samplePattern = "sample", pData = data_form) > Error in file(file, "rt") : invalid 'description' argument > > > > I searched on net and saw similar error, but it's not helpful for me. > Could you please help me to solve this problem? I'm grateful if you kindlly > tell me how to define two biological replicates for control (time 0) and > three replicates for other times? > > Thank you in advance. > > > > > ------------------------------ > > Post tags: ballgown, software error > > You may reply via email or visit The common error with ballgown has not been solved! >

ADD COMMENT • link 7.2 years ago Jeff Leek ▴ 650

0

Entering edit mode

Thank you Jeff, yes that was the problem. Now another error appeared, while The name and order of sub-folders in the ballgown folder are the same with data_form.csv file and there are 5 files ending to .ctab in each sub-folder. I get error that

"Error in ballgown(dataDir = "ballgown", samplePattern = "sample", pData = data_form) :
first column of pData does not match the names of the folders containing the ballgown data.
In addition: Warning message:
In ballgown(dataDir = "ballgown", samplePattern = "sample", pData = data_form) :
Rows of pData did not seem to be in the same order as the columns of the expression data. Attempting to rearrange pData..."

and when I used the below command, it returned me FALSE (not TRUE)

all(data_form$ids == list.files("ballgown"))

To solve the problem, I simply remove the "pData" part, which sounds OK. But,

head(gene_expression)

FPKM.sample1 "FPKM.sample10" "FPKM.sample11" "FPKM.sample2" "FPKM.sample3" "FPKM.sample4" "FPKM.sample5" "FPKM.sample6" "FPKM.sample7" "FPKM.sample8" "FPKM.sample9"
EPlHVUG00000000002 0 0 0 0 0 0 0 0 0 0 0
EPlHVUG00000000003 0 0 0 0 0 0 0 0 0 0 0

As it showed the samples are not ordered. However, I'm concerned about the correct way for making data_form.csv file. As I mentioned in my post, I have 2 biological replicates for control (time of 0 before applying drug) and 3 biological replicates for 2, 12, and 24 hours after using drug. Could you please help me out how to define the experiment at the pData? I found

pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))

But, I'm confused how to use it on the my experiment.

Thank you so much for your help.

ADD REPLY • link 7.2 years ago Sara ▴ 10

0

Entering edit mode

I'd recommend checking the difference between `data_form$ids` and `list.files("ballgown")` -- it's telling you they're not all equal, so you should check to see what the differences are. A guess I have is that your files in your directory are ordered alphabetically, e.g.: "sample1", "sample10", "sample11", "sample2",... while your CSV has them ordered not strictly alphabetically, but numerically by sample number. (But I can't be sure, you'll need to check yourself). Biological replicates do not need to be placed next to each other in the data; labeling them correctly is sufficient for correct analysis statistical analysis. On Fri, Feb 10, 2017 at 9:40 AM, Sara [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Sara <https: support.bioconductor.org="" u="" 9865=""/> wrote Comment: The > common error with ballgown has not been solved! > <https: support.bioconductor.org="" p="" 92336="" #92348="">: > > Thank you jeff, yes that was the problem. Now another error appeared, *while The > name and order of folders are the same with data_form.**csv** file and > there are 5 files ending to .ctab in each folder. I get error that* > > "Error in ballgown(dataDir = "ballgown", samplePattern = "sample", pData = > data_form) : > first column of pData does not match the names of the folders containing > the ballgown data. > In addition: Warning message: > In ballgown(dataDir = "ballgown", samplePattern = "sample", pData = > data_form) : > Rows of pData did not seem to be in the same order as the columns of the > expression data. Attempting to rearrange pData..." > > and when I used the below command, it returned me FALSE (not TRUE) > > all(data_form$ids == list.files("ballgown")) > > To solve the problem, I simply remove the "pData" part, which sounds OK. > But, > > head(gene_expression) > > FPKM.sample1 "FPKM.sample10" "FPKM.sample11" "FPKM.sample2" "FPKM.sample3" "FPKM.sample4" "FPKM.sample5" "FPKM.sample6" "FPKM.sample7" "FPKM.sample8" "FPKM.sample9" > EPlHVUG00000000002 0 0 0 0 0 0 0 0 0 0 0 > EPlHVUG00000000003 0 0 0 0 0 0 0 0 0 0 0 > > As it showed the samples are not ordered. Since I have biological > replicates, say sample1 and sample2 are biological replicates, so they > should be placed together. Could you please help me out what is wrong here? > > Thank you so much > > > > ------------------------------ > > Post tags: ballgown, software error > > You may reply via email or visit https://support.bioconductor. > org/p/92336/#92348 >

ADD REPLY • link 7.2 years ago Alyssa Frazee ▴ 210

0

Entering edit mode

Thank you for your quick response, Alyssa. In response to what you kindly suggested:

> data_form$ids
[1] sample1  sample10 sample11 sample2  sample3  sample4  sample5  sample6
[9] sample7  sample8  sample9
11 Levels: sample1 sample10 sample11 sample2 sample3 sample4 ... sample9

and

> list.files("ballgown")
 [1] "sample1"  "sample10" "sample11" "sample2"  "sample3"  "sample4" 
 [7] "sample5"  "sample6"  "sample7"  "sample8"  "sample9"

As I mentioned in the post, the sub-folder are ordered in the ballgown folder. They are as sample1, sample2, sample3,.....sample11, (actually they are ordered numerically that the same with data_form.csv file). But I don't understand why ballgown says they are not ordered!!. Anyway, I Inevitably changed the order of samples in the data_form.csv file, to sample1 sample10 sample11 sample2 sample3 sample4 ... sample9, which list.files("ballgown") showed and finally this problem was solved. Now, data_form is like below, which totally confused me.

  ids treatment time
1   sample1   control    0
2  sample10      drug   24
3  sample11      drug   24
4   sample2   control    0
5   sample3      drug   2
6   sample4      drug    2
7   sample5      drug    2
8   sample6      drug   12
9   sample7      drug   12
10  sample8      drug   12
11  sample9      drug   24

I want to do differential expression analysis between control and drug at each time point and also between drug at the various time point of 2, 12, 24. Sorry, I read manual more and more, but I have still problem. Could you please help me out how to correctly label different conditions and biological replicates for statistical analysis?

As I found on the tutorial, the command for time course analysis will be:

pData(bg) = data.frame(pData(bg), time=rep(1:10, 2)) 
timecourse_results = stattest(bg, feature='transcript', meas='FPKM', covariate='time', timecourse=TRUE, adjustvars='group')

But, I know nothing about the correct command of pData (bg) for my experiment, Please help me on this part, too. Regarding the second command, it is OK for my experiment except for "timecourse" that I should just change "TRUE" into "FALSE" as the number of my time point is 4 (fewer than 5), yes, is it right?, however, groups may be defined for ballgown or I should define custom model, which, unfortunately, I don't know how.

Thanks a lot for your help in advance.

ADD REPLY • link 7.2 years ago Sara ▴ 10

0

Entering edit mode

Seems like the ordering issue has been solved -- great!

"how to correctly label different conditions and biological replicates for statistical analysis" -- in this case, it looks to me like each sample you have now is a biological replicate (i.e., that you have 11 biological replicates). If this is not true, you will need to add two columns to your phenotype data frame, one listing the biological replicate ID and one listing the technical replicate ID, and you likely will need to build a more complex model (or average/summarize across technical replicates so you have one row per biological replicate). I assume the "condition" column is correct for treatments as well.

As for how to do the timecourse experiment, I suggest reading more about the "timecourse" option in ?stattest, as it will give you more details on what the timecourse option actually does. You don't need 5 timepoints, necessarily (I don't believe it says that anywhere in the tutorial). If you want to assess whether expression changes over time, you can use timecourse=TRUE.

"groups may be defined for ballgown or I should define custom model, which, unfortunately, I don't know how" -- I'm not totally sure what this means. You need to define you own comparison groups (ballgown cannot infer what you want to compare and what you want to adjust for). If you don't know how to define custom models, I'd recommend not doing so (that option is recommended for users that already know what they want their models to be).

ADD REPLY • link 7.2 years ago Alyssa Frazee ▴ 210

0

Entering edit mode

Thank you, Alyssa for your response!

Actually, I have 2 biological replicates for control (a1 and a2) and three biological replicates for treatment sample that showed with b1-b3, c1-c3, and d1-d3. I have not any technical replicates. I added a "rep" column as bellow, please kindly let me know if it's your mean?

       ids treatment time rep
1   sample1   control    0  a1
2  sample10      drug   24  d2
3  sample11      drug   24  d3
4   sample2   control    0  a2
5   sample3      drug    2  b1
6   sample4      drug    2  b2
7   sample5      drug    2  b3
8   sample6      drug   12  c1
9   sample7      drug   12  c2
10  sample8      drug   12  c3
11  sample9      drug   24  d1

My main issue is how to define pData (bg) here?, I found the below command but it is not right for my case:

pData(bg) = data.frame(pData(bg), time=rep(1:10, 2))

Could you please help me out on this issue?

Regarding time course experiment, manual says "The timecourse option assumes that "time" in your study is truly continuous, i.e., that it takes several values along a time scale. If you have very few timepoints (e.g., fewer than 5), we recommend treating time as a categorical variable, since having very few values does not give much granularity for fitting a smooth curve using splines. You can do this by setting covariate equal to 'time' (or whatever your time variable is named) and simply leaving timecourse as FALSE, its default. If you don't have more timepoints than degrees of freedom in the spline model, a warning will be printed and time will be coerced to categorical"

Here I have 4 time points (fewer than 5), so I am confused if timecourse=TRUE or timecourse=FALSE, which one is right here based on the above explanation in the manual?

Thanks

ADD REPLY • link 7.2 years ago Sara ▴ 10

0

Entering edit mode

(1) the "rep" column should probably just be the a, b, c, or d.

(2) the data frame you have listed (the one with "ids," "treatment," "time," and "rep") should be how you set pData. You can assign pData with "pData(bg) = data_form" (or whatever you've called that data frame).

(3) Ah, I see the explanation you've found in the manual! Thanks. As it says there, if you have fewer than 5 timepoints it's probably better to treat time as categorical. (timecourse=FALSE, which is the default). I believe it says this directly in the manual passage you provided.