Question

processing data with 2 rep or one replicate by ballgown

0

Entering edit mode

boyashuang • 0

@boyashuang-11654

Last seen 8.5 years ago

Thanks Dr Leek, Salzberg and their collegues and for sharing the fascinating HISAT-StringTie-Ballgown pipeline, presented in nature protocol and NatureBiotech.

Now we are reprocessing the RNA-seq data using the pipeline. However, we have encounter one bottleneck in using Ballgown, maybe because of the insufficient replicates of SRA data. The phenotype data as following, but we obtained 0 differentially expressed genes and <10 differentially expressed transcripts. Furthermore, we can't figure out what do the differentially expressed transcripts mean? That is to say, do they present the difference between the control and treat?

ids time treat
Cd0h-1 0h control0
Cd0h-2 0h control0
Cd1h-1 1h treat1
Cd1h-2 1h treat1
CdCK-1 1h control1
CdCK-2 1h control1
Cdck24 24h control24 (without any replicate)

CdCd24 24h treat24 (without any replicate)

In fact, we want to obtain the differentially expressed transcripts between treat VS. control both at 1 hour, and the DE transcripts between treat VS. control both at 24 hour (without any replicate).

All right, we did some test by analyzing treat VS. control (4 samples "Cd1h-1,Cd1h-2,CdCK-1,CdCK-2"), both at 1 hour with 2 replicates. However, we obtained 0 differentially expressed genes or transcripts.

After change the code, R tell us as following:

Wrapping up the results
> bg_filt = subset(bg,"rowVars(texpr(bg)) >1",genomesubset=TRUE)
> results_transcripts = stattest(bg_filt,feature="transcript",group=rep(c(1,0), each=2), getFC=TRUE, meas="FPKM")
Error in stattest(bg_filt, feature = "transcript", group = rep(c(1, 0), :
unused argument (group = rep(c(1, 0), each = 2))

Then we try to tell R the replicate information, also returned error.

> bg = ballgown(dataDir="ballgown", samples=rep(c(1,0), each=2), meas='all')
Error in strsplit(x, split = "/") : non-character argument

Another try also get the error warning,

bg = ballgown(dataDir="ballgown", samples="Cd1h-1,Cd1h-2,CdCK-1,CdCK-2", meas='all')
Sun Oct 9 20:43:21 2016
Error in ballgown(dataDir = "ballgown", samples = "Cd1h-1,Cd1h-2,CdCK-1,CdCK-2", :
Something is wrong: are you missing .ctab files? Do extra files/folders (other than tablemaker output folders) match your samples/dataDir/samplePattern argument(s)?

Thus we are confused about how to process RNA-seq data without sufficient replicates, such as Cd1h VS. CK1h (2 replicates each), Cd24 VS.ck24 (one sample each).

Thanks for your concern.

ballgown differential gene expression • 4.4k views

ADD COMMENT • link updated 8.4 years ago by Alyssa Frazee ▴ 210 • written 8.5 years ago by boyashuang • 0

score 1 · Answer 1 · 2016-10-11

Addressing the issues here in order:

* "we did some test by analyzing treat VS. control (4 samples "Cd1h-1,Cd1h-2,CdCK-1,CdCK-2"), both at 1 hour with 2 replicates. However, we obtained 0 differentially expressed genes or transcripts." --> this isn't too surprising. It's likely going to be difficult to estimate statistical significance with two replicates per group -- similar to doing a 2 vs. 2 t-test, there's just not much we can do re: estimating the variance in a sample of 2. One suggestion I have is to use limma in order to borrow information across genes for the 2 vs. 2 comparison. You can use the expression matrices from ballgown directly in limma.

---

* " bg_filt = subset(bg,"rowVars(texpr(bg)) >1",genomesubset=TRUE)
> results_transcripts = stattest(bg_filt,feature="transcript",group=rep(c(1,0), each=2), getFC=TRUE, meas="FPKM")" gives the error

Error in stattest(bg_filt, feature = "transcript", group = rep(c(1, 0), :
unused argument (group = rep(c(1, 0), each = 2))

per the documentation for stattest (see `help(stattest)` or `?stattest` in R), the argument you want is `covariate`, and it requires a string. You've given a numeric vector called "group", which is causing the error. You need to pass a string that's the name of the column dividing your samples in to groups. I would recommend creating a ballgown object for 1 hour only:

bg1 = subset(bg, 'ids %in% c("Cd1h-1", "Cd1h-2", "CdCK-1", "CdCK-2")', genomesubset=FALSE)

pData(bg1) gives:

ids time treat
Cd1h-1 1h treat1
Cd1h-2 1h treat1
CdCK-1 1h control1
CdCK-2 1h control1

(I based this on the question above assuming the data set was the output of pData, though that wasn't stated).

Then you can do:

results_transcripts = stattest(bg1, feature="transcript", covariate='treat', getFC=TRUE, meas="FPKM")

(again keeping in mind that 2 replicates is a difficult situation to be in for statistical significance, and again I recommend limma).

---

"bg = ballgown(dataDir="ballgown", samples=rep(c(1,0), each=2), meas='all')" --> "Error in strsplit(x, split = "/") : non-character argument"

I'd suggest taking a look at the documentation for `ballgown` (help(ballgown)) -- "samples" is a vector of file paths (which are strings). It is not supposed to be used to identify replicates. All identification of phenotype data is supposed to be done in the `pData` component.

---

"bg = ballgown(dataDir="ballgown", samples="Cd1h-1,Cd1h-2,CdCK-1,CdCK-2", meas='all')
Sun Oct 9 20:43:21 2016
Error in ballgown(dataDir = "ballgown", samples = "Cd1h-1,Cd1h-2,CdCK-1,CdCK-2", :
Something is wrong: are you missing .ctab files? Do extra files/folders (other than tablemaker output folders) match your samples/dataDir/samplePattern argument(s)?"

Again, the documentation for the "ballgown" function should help here. Samples needs to be a vector of file paths. (This is essentially the same problem as the previous error).

score 1 · Answer 2 · 2016-10-24

"Which difference does the obtained FC (fold change) mean?" --> it is the difference specified by `covariate`. The reference category is the "first" category (e.g., numerically or alphabetically) of "sex". I'm not sure what the baseline category is for the Nature Protocols paper, but the reference category is whichever one is coded lower. You can verify this by plotting the expressions by group for a couple differentially expressed genes with positive and negative fold change and seeing which group (male or female) has higher expression. (If the FC is positive and the males have higher expression, the ratio is male expression : female expression, for example).

"If we gonna to analyze the difference between the Female-GBR and Male-GBR, what shall we set the parameters?" --> one strategy is to subset the ballgown object to just GBR samples, and set covariate to "sex".

"If we gonna to analyze the difference between the Female-GBR and Male-YRI, what shall we set the parameters of covariate and adjustvars?" --> you could subset to only the samples of interest (f/gbr and m/yri), label the samples accordingly with a new variable like "group" and then use covariate = "group" (that's just one strategy, another would be to fit the linear model using limma and the full data set, extract the coefficients from the model, and calculate the differences accordingly using standard linear regression math.)

"how to hook out the data of TPM" --> TPM is only available if you use RSEM and the ballgownrsem function to read in the data. Stringtie and Cufflinks do not calculate TPM, so ballgown can't read TPM from their outputs. (you can see this in help(Ballgown) which gives help for the full S4 class: https://github.com/alyssafrazee/ballgown/blob/cfdb05e6943eb7f4a83ff12b8d2006e223dd6b89/R/AllClass.R#L17).

score 0 · Answer 3 · 2016-10-21

Thanks for your guidance!

However, we can't figure out which difference do the differentialy expressed transcripts indicate?

Take "Table 2 | Population and sex associated with each sample used in the protocol" in your NatureProtocol for example,

results_transcripts = stattest(bg_chrX_filt,feature="transcript",covariate="sex",adjustvars =c("population"), getFC=TRUE, meas="FPKM")

Which difference does the obtained FC (fold change) mean?

If we gonna to analyze the difference between the Female-GBR and Male-GBR, what shall we set the parameters?

If we gonna to analyze the difference between the Female-GBR and Male-YRI, what shall we set the parameters of covariate and adjustvars?

Another point confused us is how to hook out the data of TPM.

Since ballgown code mentioned TPM in

meas = match.arg(meas, c('TPM', 'FPKM', 'all'))

and the output of Stringtie only held t_data.ctab with cov and FPKM (Cufflinks-estimated FPKM for the transcript), but we can't obtain the TPM datamatrix using the parameter "all". After all, the GTF file (Cd1h-1.gtf) output by Stringtie has one column for "TPM" value, but lost in the file t_data.ctab.

transcript_data_frame = texpr(bg, 'all')

which output only contained cov and FPKM.

So we post our confusions here.

Thanks for your concern.