Hello,
I am not very experienced in data analysis. I am analysing data from a small RNAseq experiment (two conditions, 5 samples with each), and found great help in following the guidelines by Pertea M et al. (Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. Nat Protoc. 2016 Sep;11(9):1650-67.)
I have reached the final step in the analysis and is puzzled about the output I get from Ballgown stattest , since my p-values are changing when I change my filtration settings. I am completely aware that q-values should change, but I thought that p-values should be unaffected. Can anyone explain that or find the mistake in my procedure?
Thank you very much in advance! Kind regards Katrine
#Read in samples overview
pheno_data = read.csv("samples_overview.csv")
#Read in the expression data from StringTie
bg_data = ballgown(dataDir = "ballgown", samplePattern = "sample_", pData=pheno_data)
Different filtering options - only one applied at the time :)
#1 Remove all transcripts with a variance across samples less than one.
bg_data_filt = subset(bg_data,"rowVars(texpr(bg_data)) >1",genomesubset=TRUE)
#2 Remove all transcripts with less than 10 reads across all samples
bg_data_filt = subset(bg_data,"rowSums(texpr(bg_data)) >= 10",genomesubset=TRUE)
#3 Do not filter
bg_data_filt = bg_data
Each filtering option followed by
#Identify transcripts that show statistically significant differences between groups
group_transcripts = stattest(bg_data_filt, feature="transcript", covariate="group", meas="FPKM", getFC = T)
And then let's just look at Il10 as an example:
group_transcripts[5591,"pval"]
Depending on the filtering option this command returns either: 1: 0.9136172; 2: 0.874399; 3: 0.1992573;