edgeR: Unusual BCV plot
2
0
Entering edit mode
dsperley • 0
@dsperley-7315
Last seen 6.6 years ago
United States

Hi, 

I'm doing an RNA-Seq analysis of blood samples from 28 lyme infected patients  and 13 control samples. The reads were aligned with TopHat2 and read were summarized using featureCounts. When estimating the dispersion, I get a value of 3.29, and the following BCV plot:

This looks like something went horribly wrong.

The MDS plot does show a lack of separation between the control and infected group. However, I don't think that explains the BCV plot - or could it?.

Code:

##Load gene lengths

exonic.gene.sizes.df<-read.table("hg19_gene_Lengths.txt",header=TRUE)

##Load read counts from csv

counts_csv<-read.csv("counts.csv", header=TRUE)

##Make into matrix

##Function to create a matrix and set the right names

matrix.please<-function(x) {
    m<-as.matrix(x[,-1])
    rownames(m)<-x[,1]
    m
}

##Run on counts

counts <- matrix.please(counts_csv)

##assign groups

group<-c(rep("patient", times=28),rep("control", times=13))

##set up DGEList
RNA_data<-DGEList(counts=counts,group=group)

#add in Gene Length
RNA_data$genes<-exonic.gene.sizes.df

# filter low expressing genes
#Keeping Genes with greater than 8 count per million (cpm) in at least 13 sample

keep<-rowSums(cpm(RNA_data)>8)>=13
RNA_data_filtered<-RNA_data[keep,,keep.lib.sizes=F]

# normalization
RNA_data_filtered<- calcNormFactors(RNA_data_filtered)

RNA_data_filtered<-estimateDisp(RNA_data_filtered,design=design)

sqrt(RNA_data_filtered$common.disp)
###[1] 3.285567 

##plot BCV

plotBCV(RNA_data_filtered)

 

sessionInfo:

R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.12.4              SummarizedExperiment_1.2.3 Biobase_2.32.0            
 [4] GenomicRanges_1.24.3       GenomeInfoDb_1.8.7         IRanges_2.6.1             
 [7] S4Vectors_0.10.3           BiocGenerics_0.18.0        edgeR_3.14.0              
[10] limma_3.28.21             

loaded via a namespace (and not attached):
 [1] genefilter_1.54.2       locfit_1.5-9.1          splines_3.3.2           lattice_0.20-34        
 [5] colorspace_1.3-0        htmltools_0.3.5         rtracklayer_1.32.2      GenomicFeatures_1.24.5 
 [9] chron_2.3-47            survival_2.40-1         XML_3.98-1.5            foreign_0.8-67         
[13] DBI_0.5-1               BiocParallel_1.6.6      RColorBrewer_1.1-2      plyr_1.8.4             
[17] stringr_1.1.0           zlibbioc_1.18.0         Biostrings_2.40.2       munsell_0.4.3          
[21] gtable_0.2.0            labeling_0.3            latticeExtra_0.6-28     knitr_1.15             
[25] geneplotter_1.50.0      biomaRt_2.28.0          AnnotationDbi_1.34.4    htmlTable_1.7          
[29] Rcpp_0.12.7             acepack_1.4.1           xtable_1.8-2            scales_0.4.1           
[33] Hmisc_4.0-0             annotate_1.50.1         XVector_0.12.1          Rsamtools_1.24.0       
[37] gridExtra_2.2.1         ggplot2_2.2.0           digest_0.6.10           stringi_1.1.2          
[41] grid_3.3.2              tools_3.3.2             bitops_1.0-6            magrittr_1.5           
[45] lazyeval_0.2.0          RCurl_1.95-4.8          tibble_1.2              RSQLite_1.0.0          
[49] Formula_1.2-1           cluster_2.0.5           Matrix_1.2-7.1          data.table_1.9.6       
[53] assertthat_0.1          rpart_4.1-10            GenomicAlignments_1.8.4 nnet_7.3-12       

 

 

edger • 2.5k views
ADD COMMENT
0
Entering edit mode

The links to your plots seem to be inaccessible.

ADD REPLY
0
Entering edit mode

Fixed the links.

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen 7 hours ago
WEHI, Melbourne, Australia

The MDS plot does show a lack of separation between the control and infected group. However, I don't think that explains the BCV plot - or could it?

Yes, it absolutely does. Your samples absolutely do not separate into two groups, and the dispersion estimates are simply a reflection of the massive lack of fit of the model. The slight increase in the dispersion trend is the least of the problems. You should be more worried by the size of the dispersion, which is absolutely massive, and the unexpected patterns seen in the MDS plot.

As Aaron says, you need to trouble-shoot your data. Why do your samples seem to pair up in unexpected ways? What are the most variable genes in your experiment? Etc. As you say, something is horribly wrong, and you need to figure out what it is. The problem is with the data itself.

ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 15 hours ago
The city by the bay

I daresay there's some systematic structure in the patients that you haven't accounted for, given the presence of a clear cluster of samples separated on the first dimension. This is most likely driving the increasing trend in the NB dispersion, because the presence of some systematic difference requires a larger dispersion to model it at high abundances. Consider including some possible factors of variation in your design matrix, either known from the patient information (e.g., sex, location, age) or unknown ones discovered using methods like RUVSeq or sva.

ADD COMMENT

Login before adding your answer.

Traffic: 738 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6