Question

Ballgown stattest: Why do p-values change when changing filtration?

0

Entering edit mode

katrinegraversen • 0

@katrinegraversen-24419

Last seen 4.4 years ago

Hello,

I am not very experienced in data analysis. I am analysing data from a small RNAseq experiment (two conditions, 5 samples with each), and found great help in following the guidelines by Pertea M et al. (Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc. 2016 Sep;11(9):1650-67.)

I have reached the final step in the analysis and is puzzled about the output I get from Ballgown stattest , since my p-values are changing when I change my filtration settings. I am completely aware that q-values should change, but I thought that p-values should be unaffected. Can anyone explain that or find the mistake in my procedure?

Thank you very much in advance! Kind regards Katrine

#Read in samples overview
pheno_data = read.csv("samples_overview.csv")

#Read in the expression data from StringTie
bg_data = ballgown(dataDir = "ballgown", samplePattern = "sample_", pData=pheno_data)

Different filtering options - only one applied at the time :)

#1 Remove all transcripts with a variance across samples less than one.
bg_data_filt = subset(bg_data,"rowVars(texpr(bg_data)) >1",genomesubset=TRUE)

#2 Remove all transcripts with less than 10 reads across all samples
bg_data_filt = subset(bg_data,"rowSums(texpr(bg_data)) >= 10",genomesubset=TRUE)

#3 Do not filter
bg_data_filt = bg_data

Each filtering option followed by

#Identify transcripts that show statistically significant differences between groups 
group_transcripts = stattest(bg_data_filt, feature="transcript", covariate="group", meas="FPKM", getFC = T)

And then let's just look at Il10 as an example:

group_transcripts[5591,"pval"]

Depending on the filtering option this command returns either: 1: 0.9136172; 2: 0.874399; 3: 0.1992573;

ballgown • 1.2k views

ADD COMMENT • link updated 4.4 years ago by Leonardo Collado Torres ★ 1.1k • written 4.4 years ago by katrinegraversen • 0

score 0 · Answer 1 · 2020-10-21

Hi @katrinegraversen,

My understanding from looking at the source code is that you are getting different p-values because stattest(libadjust = NULL) is the default, and when that is the case, the expression data is automatically adjusted using the 75th percentile. Since the input data is different in all 3 cases, the 75th percentile will also be different, leading to different expression data values used in all 3 use cases you have, and thus different p-values.

Try using stattest(libadjust = FALSE) to try this out.

Details

https://github.com/alyssafrazee/ballgown/blob/master/R/stattest.R#L75-L79 explains the library size adjustment
Here's how the expression data is adjusted by default https://github.com/alyssafrazee/ballgown/blob/master/R/stattest.R#L194-L199

Best, Leo

PS This was an example for my team on learning how to help others.