Question

edgeR: Unusual BCV plot

0

Entering edit mode

dsperley • 0

@dsperley-7315

Last seen 6.6 years ago

United States

Hi,

I'm doing an RNA-Seq analysis of blood samples from 28 lyme infected patients and 13 control samples. The reads were aligned with TopHat2 and read were summarized using featureCounts. When estimating the dispersion, I get a value of 3.29, and the following BCV plot:

This looks like something went horribly wrong.

The MDS plot does show a lack of separation between the control and infected group. However, I don't think that explains the BCV plot - or could it?.

Code:

##Load gene lengths

exonic.gene.sizes.df<-read.table("hg19_gene_Lengths.txt",header=TRUE)

##Load read counts from csv

counts_csv<-read.csv("counts.csv", header=TRUE)

##Make into matrix

##Function to create a matrix and set the right names

matrix.please<-function(x) {
    m<-as.matrix(x[,-1])
    rownames(m)<-x[,1]
    m
}

##Run on counts

counts <- matrix.please(counts_csv)

##assign groups

group<-c(rep("patient", times=28),rep("control", times=13))

##set up DGEList
RNA_data<-DGEList(counts=counts,group=group)

#add in Gene Length
RNA_data$genes<-exonic.gene.sizes.df

# filter low expressing genes
#Keeping Genes with greater than 8 count per million (cpm) in at least 13 sample

keep<-rowSums(cpm(RNA_data)>8)>=13
RNA_data_filtered<-RNA_data[keep,,keep.lib.sizes=F]

# normalization
RNA_data_filtered<- calcNormFactors(RNA_data_filtered)

RNA_data_filtered<-estimateDisp(RNA_data_filtered,design=design)

sqrt(RNA_data_filtered$common.disp)
###[1] 3.285567 

##plot BCV

plotBCV(RNA_data_filtered)

sessionInfo:

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.12.4              SummarizedExperiment_1.2.3 Biobase_2.32.0            
 [4] GenomicRanges_1.24.3       GenomeInfoDb_1.8.7         IRanges_2.6.1             
 [7] S4Vectors_0.10.3           BiocGenerics_0.18.0        edgeR_3.14.0              
[10] limma_3.28.21             

loaded via a namespace (and not attached):
 [1] genefilter_1.54.2       locfit_1.5-9.1          splines_3.3.2           lattice_0.20-34        
 [5] colorspace_1.3-0        htmltools_0.3.5         rtracklayer_1.32.2      GenomicFeatures_1.24.5 
 [9] chron_2.3-47            survival_2.40-1         XML_3.98-1.5            foreign_0.8-67         
[13] DBI_0.5-1               BiocParallel_1.6.6      RColorBrewer_1.1-2      plyr_1.8.4             
[17] stringr_1.1.0           zlibbioc_1.18.0         Biostrings_2.40.2       munsell_0.4.3          
[21] gtable_0.2.0            labeling_0.3            latticeExtra_0.6-28     knitr_1.15             
[25] geneplotter_1.50.0      biomaRt_2.28.0          AnnotationDbi_1.34.4    htmlTable_1.7          
[29] Rcpp_0.12.7             acepack_1.4.1           xtable_1.8-2            scales_0.4.1           
[33] Hmisc_4.0-0             annotate_1.50.1         XVector_0.12.1          Rsamtools_1.24.0       
[37] gridExtra_2.2.1         ggplot2_2.2.0           digest_0.6.10           stringi_1.1.2          
[41] grid_3.3.2              tools_3.3.2             bitops_1.0-6            magrittr_1.5           
[45] lazyeval_0.2.0          RCurl_1.95-4.8          tibble_1.2              RSQLite_1.0.0          
[49] Formula_1.2-1           cluster_2.0.5           Matrix_1.2-7.1          data.table_1.9.6       
[53] assertthat_0.1          rpart_4.1-10            GenomicAlignments_1.8.4 nnet_7.3-12

edger • 2.5k views

ADD COMMENT • link updated 7.4 years ago by Gordon Smyth 50k • written 7.4 years ago by dsperley • 0

0

Entering edit mode

The links to your plots seem to be inaccessible.

ADD REPLY • link 7.4 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Fixed the links.

ADD REPLY • link 7.4 years ago dsperley • 0

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 15 hours ago

The city by the bay

I daresay there's some systematic structure in the patients that you haven't accounted for, given the presence of a clear cluster of samples separated on the first dimension. This is most likely driving the increasing trend in the NB dispersion, because the presence of some systematic difference requires a larger dispersion to model it at high abundances. Consider including some possible factors of variation in your design matrix, either known from the patient information (e.g., sex, location, age) or unknown ones discovered using methods like RUVSeq or sva.

ADD COMMENT • link 7.4 years ago Aaron Lun ★ 28k

score 2 · Accepted Answer · 2016-12-13

The MDS plot does show a lack of separation between the control and infected group. However, I don't think that explains the BCV plot - or could it?

Yes, it absolutely does. Your samples absolutely do not separate into two groups, and the dispersion estimates are simply a reflection of the massive lack of fit of the model. The slight increase in the dispersion trend is the least of the problems. You should be more worried by the size of the dispersion, which is absolutely massive, and the unexpected patterns seen in the MDS plot.

As Aaron says, you need to trouble-shoot your data. Why do your samples seem to pair up in unexpected ways? What are the most variable genes in your experiment? Etc. As you say, something is horribly wrong, and you need to figure out what it is. The problem is with the data itself.