DESeq2 handling of technical replicates
1
1
Entering edit mode
@adeolu-adewoye-4377
Last seen 7.9 years ago
European Union

Dear All,

I am having trouble dealing with technical replicates in my RNAseq data analysis using Deseq2. I have generated a SummarizedExperiment object for my dataset but couldn't pass the collapseReplicates() stage. Please see below for more information.

> head(sampleInfo)
  sampleName                    fileName copyNumber population
1    NA06984   NA06984_accepted_hits.bam    control        CEU
2    NA06985   NA06985_accepted_hits.bam    control        CEU
3    NA06989   NA06989_accepted_hits.bam    control        CEU
4    NA07000   NA07000_accepted_hits.bam    control        CEU
5    NA07051  NA070512_accepted_hits.bam   control        CEU
6    NA07051   NA07051_accepted_hits.bam    control        CEU


> (colData(GenesSE)<- DataFrame(sampleInfo))
DataFrame with 486 rows and 4 columns
    sampleName                    fileName copyNumber population
      <factor>                    <factor>   <factor>   <factor>
1      NA06984   NA06984_accepted_hits.bam    control        CEU
2      NA06985   NA06985_accepted_hits.bam    control        CEU
3      NA06989   NA06989_accepted_hits.bam    control        CEU
4      NA07000   NA07000_accepted_hits.bam    control        CEU
5      NA07051  NA070512_accepted_hits.bam   control        CEU
...        ...                         ...        ...        ...
482    NA20540   NA20540_accepted_hits.bam       null        TSI
483    NA20589   NA20589_accepted_hits.bam       null        TSI
484    NA20754  NA207542_accepted_hits.bam      null        TSI
485    NA20754   NA20754_accepted_hits.bam       null        TSI
486    NA20778   NA20778_accepted_hits.bam       null        TSI
> colnames(GenesSE) <- sampleInfo$sampleName
Error in `rownames<-`(`*tmp*`, value = c(180L, 181L, 183L, 185L, 188L,  :
  duplicate rownames not allowed


## Oblivously the sampleName column contains duplicated names which was to be used for grouping with collapseReplicates() but below is what I got

techreplictes <-as.data.frame(colData( ddsFull )[ ,c("sampleName","fileName","copyNumber","population") ] )

> head(techreplictes)
  sampleName                    fileName copyNumber population
1    NA06984   NA06984_accepted_hits.bam    control        CEU
2    NA06985   NA06985_accepted_hits.bam    control        CEU
3    NA06989   NA06989_accepted_hits.bam    control        CEU
4    NA07000   NA07000_accepted_hits.bam    control        CEU
5    NA07051  NA070512_accepted_hits.bam    control        CEU
6    NA07051   NA07051_accepted_hits.bam    control        CEU


> ddsCollapsed <- collapseReplicates(ddsFull,groupby= ddsFull$sampleName, run= ddsFull$fileName)


Error: sum(assay(object)) == sum(assay(collapsed)) is not TRUE
In addition: Warning messages:
1: In sum(assay(object)) : integer overflow - use sum(as.numeric(.))
2: In sum(assay(collapsed)) : integer overflow - use sum(as.numeric(.))

> traceback()
3: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
       ch), call. = FALSE, domain = NA)
2: stopifnot(sum(assay(object)) == sum(assay(collapsed)))
1: collapseReplicates(ddsFull, groupby = ddsFull$sampleName, run = ddsFull$fileName)

 

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.0.0
 [2] GenomicAlignments_1.2.1                
 [3] DESeq2_1.6.3                           
 [4] RcppArmadillo_0.4.400.0                
 [5] Rcpp_0.11.4                            
 [6] Rsamtools_1.18.2                       
 [7] Biostrings_2.34.1                      
 [8] XVector_0.6.0                          
 [9] GenomicFeatures_1.18.3                 
[10] AnnotationDbi_1.28.1                   
[11] Biobase_2.26.0                         
[12] GenomicRanges_1.18.4                   
[13] GenomeInfoDb_1.2.4                     
[14] IRanges_2.0.1                          
[15] S4Vectors_0.4.0                        
[16] BiocGenerics_0.12.1                    

loaded via a namespace (and not attached):
 [1] annotate_1.42.1     BatchJobs_1.5       BBmisc_1.8         
 [4] BiocParallel_1.0.2  biomaRt_2.22.0      bitops_1.0-6       
 [7] brew_1.0-6          checkmate_1.5.1     cluster_2.0.1      
[10] codetools_0.2-10    colorspace_1.2-4    DBI_0.3.1          
[13] digest_0.6.8        fail_1.2            foreach_1.4.2      
[16] Formula_1.2-0       genefilter_1.48.1   geneplotter_1.44.0
[19] ggplot2_1.0.0       grid_3.1.1          gtable_0.1.2       
[22] Hmisc_3.14-4        iterators_1.0.7     lattice_0.20-29    
[25] latticeExtra_0.6-26 locfit_1.5-9.1      MASS_7.3-34        
[28] munsell_0.4.2       plyr_1.8.1          proto_0.3-10       
[31] RColorBrewer_1.0-5  RCurl_1.95-4.3      reshape2_1.4       
[34] RSQLite_1.0.0       rtracklayer_1.26.2  scales_0.2.4       
[37] sendmailR_1.1-2     splines_3.1.1       stringr_0.6.2      
[40] survival_2.37-7     tools_3.1.1         XML_3.98-1.1       
[43] xtable_1.7-3        zlibbioc_1.10.0    
>

I would appreciate some suggestions on how to proceed with my analysis. Ultimately, I am interested in differential expression and being able to aggregate reads from technical samples improves the analysis.

Many thanks

Adeolu

 

deseq2 • 1.9k views
ADD COMMENT
0
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

hi Adeolu,

I hadn't seen that error about summing a matrix and integer overflow. Can you try the following modified function and report back if this works?

collapseReplicatesBugFix <- function(object, groupby, run, renameCols=TRUE) {
  if (!is.factor(groupby)) groupby <- factor(groupby)
  groupby <- droplevels(groupby)
  stopifnot(length(groupby) == ncol(object))
  sp <- split(seq(along=groupby), groupby)
  countdata <- sapply(sp, function(i) rowSums(assay(object)[,i,drop=FALSE]))
  mode(countdata) <- "integer"
  colsToKeep <- sapply(sp, `[`, 1)
  collapsed <- object[,colsToKeep]
  assay(collapsed) <- countdata
  if (!missing(run)) {
    stopifnot(length(groupby) == length(run))
    colData(collapsed)$runsCollapsed <- sapply(sp, function(i) paste(run[i],collapse=","))
  }
  if (renameCols) {
    colnames(collapsed) <- levels(groupby)
  }
  stopifnot(sum(as.numeric(assay(object))) == sum(as.numeric(assay(collapsed))))
  collapsed
}
ADD COMMENT
0
Entering edit mode

Hi Michael, Thank you for your response.

The function works and no more error message when I collapsed the replicates.

Many thanks,

Adeolu

ADD REPLY
0
Entering edit mode

Great, I will fix this bug in the devel branch.

ADD REPLY

Login before adding your answer.

Traffic: 332 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6