Question

Cannot generate a DEseq2DataSet from my data - DESeqDataSetFromMatrix returns error

0

Entering edit mode

mschmidt ▴ 10

@mschmidt-18923

Last seen 2.4 years ago

Poznan

I want to explore RNAseq data from Expression Atlas (EMBL-EBI database). The dataset downloaded from the database was in TPM, so I changed NAs into '0's and multiplied the values by 1E+6 (to get integer values) and added 1 for later log-transformation

data4PCA[is.na(data4PCA)] = 0
data4PCA = data4PCA * 1000000
data4PCAint = as.matrix.data.frame(data4PCA + 1)

I want to do rlogTransformation the data (rlog) but got stuck at creation of a DESeqDataSet:

dds = DESeqDataSetFromMatrix(countData = data4PCAint,
                             colData = design4pca,
                             design = ~ System)

which returns an error:

converting counts to integer mode Error in validObject(.Object) :
invalid class “DESeqDataSet” object: NA values are not allowed in the count matrix In addition: Warning message: In mde(x) : NAs introduced by coercion to integer range

head(data4PCAint)

returns:

                bone.marrow    colon duodenum esophagus    liver lymph.node
ENSG00000000003      600001 65000001 30000001  61000001 62000001    6000001
ENSG00000000005           1  1000001   400001    800001        1     100001
ENSG00000000419    87000001 83000001 66000001  98000001 46000001  132000001
ENSG00000000457     3000001 11000001  9000001  10000001  5000001   17000001
ENSG00000000460     6000001  3000001  3000001   3000001  2000001    7000001
ENSG00000000938   308000001  6000001  6000001  12000001  6000001   63000001
                  rectum saliva.secreting.gland small.intestine    spleen
ENSG00000000003 67000001               53000001        22000001  17000001
ENSG00000000005  1000001                 400001          400001    500001
ENSG00000000419 83000001               35000001        78000001  97000001
ENSG00000000457 13000001                6000001         9000001  14000001
ENSG00000000460  5000001                1000001         3000001   5000001
ENSG00000000938  7000001                3000001        10000001 180000001
                 stomach    tonsil vermiform.appendix
ENSG00000000003 22000001  14000001           11000001
ENSG00000000005        1    300001            2000001
ENSG00000000419 63000001 114000001          100000001
ENSG00000000457 10000001  14000001           13000001
ENSG00000000460  2000001  10000001            8000001
ENSG00000000938  6000001  36000001          172000001

so it looks like the data is OK?!

sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] gplots_3.0.1.1              RColorBrewer_1.1-2         
 [3] reshape2_1.4.3              pheatmap_1.0.12            
 [5] forcats_0.4.0               stringr_1.4.0              
 [7] dplyr_0.8.3                 purrr_0.3.2                
 [9] readr_1.3.1                 tidyr_0.8.3                
[11] tibble_2.1.3                ggplot2_3.2.0              
[13] tidyverse_1.2.1             DESeq2_1.24.0              
[15] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
[17] BiocParallel_1.18.0         matrixStats_0.54.0         
[19] Biobase_2.44.0              GenomicRanges_1.36.0       
[21] GenomeInfoDb_1.20.0         IRanges_2.18.0             
[23] S4Vectors_0.22.0            BiocGenerics_0.30.0        

loaded via a namespace (and not attached):
 [1] nlme_3.1-140           bitops_1.0-6           lubridate_1.7.4       
 [4] bit64_0.9-7            httr_1.4.0             tools_3.6.0           
 [7] backports_1.1.4        R6_2.4.0               KernSmooth_2.23-15    
[10] rpart_4.1-15           Hmisc_4.2-0            DBI_1.0.0             
[13] lazyeval_0.2.2         colorspace_1.4-1       nnet_7.3-12           
[16] withr_2.1.2            tidyselect_0.2.5       gridExtra_2.3         
[19] bit_1.1-14             compiler_3.6.0         cli_1.1.0             
[22] rvest_0.3.4            htmlTable_1.13.1       xml2_1.2.0            
[25] labeling_0.3           caTools_1.17.1.2       scales_1.0.0          
[28] checkmate_1.9.4        genefilter_1.66.0      digest_0.6.20         
[31] foreign_0.8-71         XVector_0.24.0         base64enc_0.1-3       
[34] pkgconfig_2.0.2        htmltools_0.3.6        htmlwidgets_1.3       
[37] rlang_0.4.0            readxl_1.3.1           rstudioapi_0.10       
[40] RSQLite_2.1.1          generics_0.0.2         jsonlite_1.6          
[43] gtools_3.8.1           acepack_1.4.1          RCurl_1.95-4.12       
[46] magrittr_1.5           GenomeInfoDbData_1.2.1 Formula_1.2-3         
[49] Matrix_1.2-17          Rcpp_1.0.1             munsell_0.5.0         
[52] stringi_1.4.3          zlibbioc_1.30.0        plyr_1.8.4            
[55] grid_3.6.0             blob_1.2.0             gdata_2.18.0          
[58] crayon_1.3.4           lattice_0.20-38        haven_2.1.1           
[61] splines_3.6.0          annotate_1.62.0        hms_0.5.0             
[64] locfit_1.5-9.1         zeallot_0.1.0          knitr_1.23            
[67] pillar_1.4.2           geneplotter_1.62.0     XML_3.98-1.20         
[70] glue_1.3.1             latticeExtra_0.6-28    data.table_1.12.2     
[73] modelr_0.1.4           vctrs_0.2.0            cellranger_1.1.0      
[76] gtable_0.3.0           assertthat_0.2.1       xfun_0.8              
[79] xtable_1.8-4           broom_0.5.2            survival_2.44-1.1     
[82] AnnotationDbi_1.46.0   memoise_1.1.0          cluster_2.1.0

deseq2 rlog • 2.0k views

ADD COMMENT • link updated 6.5 years ago by Michael Love 43k • written 6.5 years ago by mschmidt ▴ 10

0

Entering edit mode

I found this example to test with:

df=data.frame("treat"=sample(c(80:100),6, replace=FALSE),
              "treat1"=sample(c(90:103),6, replace=FALSE),
              "treat2"=sample(c(80:100),6, replace=FALSE),
              "ctrl"=sample(c(60:90),6, replace=FALSE),
              "ctrl1"=sample(c(60:90),6, replace=FALSE),
              "ctrl2"=sample(c(60:90),6, replace=FALSE))
conds<-as.factor(c("Treat","Treat","Treat","Control","Control","Control"))
coldata <- data.frame(row.names=colnames(df), conds)
dds=DESeqDataSetFromMatrix(countData=df,colData=coldata,design=~conds)

...and it works fine?! but my data cannot get through.

ADD REPLY • link 6.5 years ago mschmidt ▴ 10

score 1 · Answer 1 · 2019-07-25

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

You’ve multiplied by a much too large number and made counts that are not anywhere close to estimated counts.

Why don’t you use tximport instead which was designed to do this transfer?

ADD COMMENT • link 6.5 years ago Michael Love 43k

0

Entering edit mode

Thanks for fast reply, I was testing several setups of the tximport for last several minutes. I have single file with 15 columns (Tx, Gene.Name, and 13 columns named by tissues with TPMs) I used:

txi <- tximport(file = "data4pcatxi.tsv", type="none", txOut = TRUE, 
+                 txIdCol = "Tx", geneIdCol = "Gene.Name",
+                 importer = read_tsv)

and got:

> 1 Parsed with column specification: cols(   Tx = col_character(),  
> Gene.Name = col_character(),   bone.marrow = col_double(),   colon =
> col_double(),   duodenum = col_double(),   esophagus = col_double(),  
> liver = col_double(),   lymph.node = col_double(),   rectum =
> col_double(),   saliva.secreting.gland = col_double(),  
> small.intestine = col_double(),   spleen = col_double(),   stomach =
> col_double(),   tonsil = col_double(),   vermiform.appendix =
> col_double() ) Error in c(abundanceCol, countsCol, lengthCol) %in%
> names(raw) :    argument "abundanceCol" is missing, with no default

Can you help me with that?

ADD REPLY • link 6.5 years ago mschmidt ▴ 10

0

Entering edit mode

The tximport function is for reading in per-sample transcript quantification files from various sources.

So all you have is a TPM matrix? Do you know the number of mapped reads per sample? Our approach requires at least those numbers in addition to the TPM matrix, and then the scaledTPM approach is to scale up the TPM to the range of counts by multiplying each sample (so each column) by (# mapped reads for this sample)/1e6.

ADD REPLY • link 6.5 years ago Michael Love 43k

0

Entering edit mode

Yes, I have a TPM matrix. It has 45946 rows but with NAs (empty cells replaced later by NAs). the data columns have different number of NAs. I can split the matrix into separate files but I do not know the number of mapped reads per sample. I have Summary of the expression results for this experiment ready to view in R (E-MTAB-2836-atlasExperimentSummary.Rdata) however do not know how to access the data after downloading to R with

expdata <- get(load('E-MTAB-2836-atlasExperimentSummary.Rdata'))

There is also some information that might be useful:

Analysis Methods Pipeline version: iRAP 0.5.1p1 Analyzed Libraries: Paired-end only Filtering Step 1: Discard reads below minimum quality threshold Filtering Step 2: Check of bacterial contamination; discard offending reads Filtering Step 3: Discard reads with common uncalled characters (e.g. N) Filtering Step 4: Remove reads from pair-end libraries that were orphaned by filtering steps 1-3 Read Mapping: Against genome reference (Ensembl release: 79) tophat2 version: 2.0.12 Quantification: htseq2 version: 0.6.1p1 Normalized Counts per Gene: (FPKMs) are calculated from the raw counts by iRAP. These are averaged for each set of technical replicates, and then quantile normalized within each set of biological replicates using limma. Finally, they are averaged for all biological replicates (if any)

It is possible to access more data (even fastq files), it looks like the TPM matrix contain averaged data in each column from several experiments (>8, sometimes even 16 Illumina runs/replicates) and each run has different number of Read Count and Base Count.

ADD REPLY • link 6.5 years ago mschmidt ▴ 10

0

Entering edit mode

So you can skip the TPM matrix and just use the counts prepared for that dataset.

I downloaded that file, and then you can take a look:

> names(experimentSummary)
[1] "rnaseq"

So there is a thing called "rnaseq" that is in this experimentSummary object.

Next step would be to take a look inside:

> experimentSummary$rnaseq
class: RangedSummarizedExperiment
dim: 65217 200
metadata(4): pipeline filtering mapping quantification
assays(1): counts
rownames(65217): ENSG00000000003 ENSG00000000005 ... ENSG00000281921 ENSG00000281922
rowData names(0):
colnames(200): ERR315325 ERR315326 ... ERR579153 ERR579155
colData names(6): AtlasAssayGroup organism_part ... developmental_stage
  technical_replicate_group

If you look through this output, you can see that there is an assay called "counts".

Since it's the first assay, you can just call assay() after loading the SummarizedExperiment package.

Looks like counts to me:

> assay(experimentSummary$rnaseq)[1:5,1:5]
                ERR315325 ERR315326 ERR315327 ERR315328 ERR315329
ENSG00000000003       763       107       302        62        77
ENSG00000000005         3         0         0         8         2
ENSG00000000419       185        85        72       294       371
ENSG00000000457       174       101        63        59       437
ENSG00000000460        38        24         6        11       137

ADD REPLY • link 6.5 years ago Michael Love 43k

0

Entering edit mode

Dear Michael, that's much easier way! Thanks! It looks that I was unable to generate dds because I multiplied TPM matrix by 1e+6 what produced too much counts. Best regards, Marcin

ADD REPLY • link 6.5 years ago mschmidt ▴ 10