Question

DESeq2 - input data questions

0

Entering edit mode

Ming ▴ 380

@ming-yi-6363

Last seen 3.4 years ago

United States

Hi, Mike: I aksed you some questions about DESeq2 input data about a week or so as shown below. Then based on some other's experience and suggestions, I did use the rounded RSEM raw counts to run DESeq2, due to the dataset has matched tumors vs normals, I did try to use either or paired test to derive the DEGs. But I noticed a strange observation: For 7 genes of our interest, if I used paired test, I got the following: entrezID symbols gene_id baseMean log2FoldChange lfcSE 6440 6440 SFTPC SFTPC|6440 560695.51884 -4.309999 0.4811095 7476 7476 WNT7A WNT7A|7476 210.61484 -2.609732 0.3639806 8328 8328 GFI1B GFI1B|8328 11.25229 -2.445345 0.3040805 27255 27255 CNTN6 CNTN6|27255 297.51966 -5.049929 0.3276390 89780 89780 WNT3A WNT3A|89780 295.59498 -5.037872 0.2976513 256815 256815 C10orf67 C10orf67|256815 84.88613 -4.841887 0.2890447 388939 388939 C2orf71 C2orf71|388939 51.81894 -4.567162 0.3601929 pvalue padj 6440 3.292485e-19 5.363882e-18 7476 7.501115e-13 6.306398e-12 8328 8.854943e-16 1.023742e-14 27255 1.336693e-53 4.232359e-51 89780 2.922717e-64 2.206764e-61 256815 5.536370e-63 3.881589e-60 388939 7.655804e-37 6.679604e-35 However, if I used pooled test, I got the following: entrezID symbols gene_id baseMean log2FoldChange lfcSE 6440 6440 SFTPC SFTPC|6440 560695.51884 -3.654053 0.4720463 7476 7476 WNT7A WNT7A|7476 210.61484 -1.826788 0.3769339 8328 8328 GFI1B GFI1B|8328 11.25229 -1.822568 0.3423880 27255 27255 CNTN6 CNTN6|27255 297.51966 -4.103604 0.3509631 89780 89780 WNT3A WNT3A|89780 295.59498 -4.424951 0.3262553 256815 256815 C10orf67 C10orf67|256815 84.88613 -4.293802 0.3183049 388939 388939 C2orf71 C2orf71|388939 51.81894 -4.068668 0.3828477 pvalue padj 6440 NA NA 7476 NA NA 8328 NA NA 27255 NA NA 89780 NA NA 256815 NA NA 388939 NA NA Noticed that when I used paired test, the p-values of these genes look quite good, whereas if used pooled test, the pvalues are all NAs (most of other genes in dataset seem OK with valid p-values at least not NA values). almost all commands are the same for these two settings, except for the modeling step: Pooled test: dds <- DESeqDataSetFromMatrix(countData = countD,colData = colD,design = ~Type); Paired test: dds <- DESeqDataSetFromMatrix(countData = countD,colData = colD,design = ~Subject+Type); The Subject holds the sample names with matched tumor and normal samples. I did pull out the counts data for these genes, which seem not uncommon (i posted below FYI). T is tumors, N is normals,T4626_01A mactched with N4626_11A for the sample patient as 4626, and so on so forth. any idea why caused this, any issue with my data or bug(s) in the package? Here is the count data for these genes: entrezID T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A N6145_11A 6440 6440 938904 1271864 963 1642706 10488 1032959 7476 7476 117 252 53 177 8 260 8328 8328 4 30 1 19 2 9 27255 27255 399 425 6 461 8 473 89780 89780 276 500 23 310 10 547 256815 256815 53 175 3 98 4 141 388939 388939 7 227 3 202 3 116 T6146_01A N6146_11A T6776_01A N6776_11A T6777_01A N6777_11A T6744_01A 6440 46915 610409 39544 855546 5535 658155 52865 7476 65 207 2 253 2 191 81 8328 3 9 2 13 2 11 3 27255 81 329 3 722 27 351 47 89780 34 330 8 555 3 465 31 256815 7 70 1 199 2 53 5 388939 1 22 1 73 18 116 9 N6744_11A T5932_01A N5932_11A T5933_01A N5933_11A T5936_01A N5936_11A 6440 667035 21596 690368 11 880241 184 858491 7476 235 29 258 1 240 55 323 8328 7 1 5 0 10 0 6 27255 501 4 545 0 676 9 674 89780 361 9 457 0 409 1 565 256815 85 2 97 0 175 6 173 388939 98 4 84 0 31 0 106 T6828_01A N6828_11A T6836_01A N6836_11A T4632_01A N4632_11A T2662_01A 6440 44239 737756 68 529280 2859 485130 13 7476 6 257 126 192 2 94 3 8328 0 14 1 9 0 10 2 27255 12 626 0 413 4 176 2 89780 11 587 8 214 0 14 3 256815 4 217 3 143 1 16 3 388939 6 94 1 57 2 0 0 N2662_11A T6147_01A N6147_11A T6743_01A N6743_11A T6829_01A N6829_11A 6440 1526175 16149 892008 9150 1138397 55 824547 7476 794 4 281 3 317 20 174 8328 19 5 16 2 5 5 8 27255 591 4 288 7 313 232 327 89780 1173 50 710 12 402 1 320 256815 254 68 330 0 114 18 80 388939 50 4 130 1 77 0 326 T6831_01A N6831_11A T4627_01A N4627_11A T4490_01A N4490_11A T6595_01A 6440 294 306899 8082 708377 115 1293737 81 7476 0 179 239 240 209 211 511 8328 1 12 6 37 3 5 2 27255 3 418 6 392 2 430 2 89780 1 333 67 488 7 298 8 256815 3 122 0 89 0 174 8 388939 0 10 0 63 1 31 0 N6595_11A T6835_01A N6835_11A T2657_01A N2657_11A T4676_01A N4676_11A 6440 1145375 41647 367263 334072 1082719 42168 959756 7476 728 283 196 10 244 18 282 8328 84 7 7 18 23 2 21 27255 975 88 348 13 448 19 606 89780 1479 31 185 145 631 23 383 256815 310 12 124 1 77 6 146 388939 82 1 6 1 108 6 31 T2661_01A N2661_11A T4625_01A N4625_11A T6745_01A N6745_11A T2665_01A 6440 413649 844142 202136 2070763 76113 1441306 82 7476 34 510 137 387 323 360 247 8328 72 32 4 44 3 8 6 27255 24 718 11 509 19 484 5 89780 15 1021 18 531 14 398 8 256815 17 121 1 152 8 192 1 388939 4 90 2 267 12 51 13 N2665_11A T2655_01A N2655_11A 6440 2200556 278151 1687880 7476 572 117 1096 8328 12 2 51 27255 895 89 921 89780 998 65 1490 256815 167 18 220 388939 123 88 142 Thanks in advance! Ming From: michaelisaiahlove@gmail.com Date: Wed, 29 Jan 2014 13:01:40 -0500 Subject: Re: DESeq2 - input data questions To: yi02@hotmail.com CC: bioconductor@r-project.org hi Ming, We do not recommend using rounded estimated values of read counts with DESeq2 (although I found an email from myself to the list one year ago contradicting this in the case that someone had no access to the raw data)â. Counts of reads which are proportionally assigned to genes and then rounded can be a bad fit for distributions like the Negative Binomial (and Poisson). For example, this procedure could generate values arising from a distribution that has variance less than the mean. As a rule, and to help avoid erroneous results, users should produce a matrix containing integer counts of reads uniquely aligned to features. Mike On Wed, Jan 29, 2014 at 11:16 AM, Ming Yi <yi02@hotmail.com> wrote: Hi, Dear Michael: Sorry to directly jump in a quick question, since I saw you respond to some DESeq questions yesterday and I know you are the author of DESeq2 package. I run into some RNAseq data issue with edgeR package (see my post in BioC yesterday), since the data I used are not integer counts of genes, but RSEM processed raw_count from TCGA RNA-seq data, in fact, are not integers, although TCGA named it as raw_count (I showed some piece of such a file from TCGA as below). file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem. genes.results" its content looks below: gene_id raw_count scaled_estimate transcript_id 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2 4 ?|10357 218.79 1.933490e-05 uc010zzl.1 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1 6 ?|136542 0.00 0.000000e+00 uc011krn.1 I used the raw_count of this file for edgeR analysis and run into some warning messages which indicating potential issues, and had been pointed out by Dr. Smyth as you may have seen in my post yesterday. Reading the user guide from both DESeq and DESeq2, both packages mentioned the input was expected to be count data in a form of a matrix of integer values, not even (rounded) normalized counts, which would lead to nonsensical results. The TCGA data above apparently are not true raw counts but some kind of RSEM processed raw_counts. Although both RSEM authors and TCGA data staff all suggested the raw_count can be used for edgeR and DESeq (or DESeq2) analysis, because of the issues I run into in edgeR. I wonder what would be your opinion on this data for analysis in DESeq2? Any advice would be appreciated very much! Thanks and best Ming Yi NIH Maryland,USA On Wed, Jan 29, 2014 at 11:16 AM, Ming Yi <yi02@hotmail.com> wrote: Hi, Dear Michael: Sorry to directly jump in a quick question, since I saw you respond to some DESeq questions yesterday and I know you are the author of DESeq2 package. I run into some RNAseq data issue with edgeR package (see my post in BioC yesterday), since the data I used are not integer counts of genes, but RSEM processed raw_count from TCGA RNA-seq data, in fact, are not integers, although TCGA named it as raw_count (I showed some piece of such a file from TCGA as below). file name: "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem. genes.results" its content looks below: gene_id raw_count scaled_estimate transcript_id 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2 4 ?|10357 218.79 1.933490e-05 uc010zzl.1 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1 6 ?|136542 0.00 0.000000e+00 uc011krn.1 I used the raw_count of this file for edgeR analysis and run into some warning messages which indicating potential issues, and had been pointed out by Dr. Smyth as you may have seen in my post yesterday. Reading the user guide from both DESeq and DESeq2, both packages mentioned the input was expected to be count data in a form of a matrix of integer values, not even (rounded) normalized counts, which would lead to nonsensical results. The TCGA data above apparently are not true raw counts but some kind of RSEM processed raw_counts. Although both RSEM authors and TCGA data staff all suggested the raw_count can be used for edgeR and DESeq (or DESeq2) analysis, because of the issues I run into in edgeR. I wonder what would be your opinion on this data for analysis in DESeq2? Any advice would be appreciated very much! Thanks and best Ming Yi NIH Maryland,USA [[alternative HTML version deleted]]

RNASeq edgeR DESeq DESeq2 RNASeq edgeR DESeq DESeq2 • 5.6k views

ADD COMMENT • link updated 12.0 years ago by Michael Love 43k • written 12.0 years ago by Ming ▴ 380

score 0 · Answer 1 · 2014-02-10

hi Ming, >From the DESeq2 vignette, we have: "The results for particular genes can be set to NA, for either one of the > following reasons: > ... > 2. If a row contains a sample with an extreme count then the p-value and > adjusted p-value are set to NA. These outlier counts are detected by Cook’s > distance. Customization of this outlier filtering is described in Section > 3.3, along with a method for replacing outlier counts and refitting." Since the p-value and adjusted p-value are set to NA, and the base mean is not 0, you know that these genes were filtered based on the Cook's distances for the samples for these genes. See the practical and theoretical sections of the vignette as well as the manual page for ?results for more background. Cook's distance for a sample is the distance the model parameters would move if the sample were removed. From first glance, this makes sense too, in that you have shown genes with counts that have high variance: "...963 1642706 10488 1032959..." If the Cook's filtering is undesirable for your analysis you can disable it with: res <- results(dds, cooksCutoff=FALSE) Then these p-values will no longer be set to NA. So why did the genes get filtered by Cook's distance in the pooled analysis but not the paired? Cook's distance uses the residuals from the observed counts to the fitted values. When you include the subject variable, the counts have smaller residuals because more of the variance can be explained by the subject differences. In addition, DESeq2 requires 3 replicates per "cell", i.e., unique combination of covariates, in order to determine a count as an outlier based on Cook's distance. Mike On Mon, Feb 10, 2014 at 10:31 AM, Ming Yi <yi02@hotmail.com> wrote: > Hi, Mike: > > I aksed you some questions about DESeq2 input data about a week or so as > shown below. Then based on some other's experience and suggestions, I did > use the rounded RSEM raw counts to run DESeq2, due to the dataset has > matched tumors vs normals, I did try to use either or paired test to > derive the DEGs. But I noticed a strange observation: > > For 7 genes of our interest, if I used paired test, I got the following: > entrezID symbols gene_id baseMean log2FoldChange > lfcSE > 6440 6440 SFTPC SFTPC|6440 560695.51884 -4.309999 > 0.4811095 > 7476 7476 WNT7A WNT7A|7476 210.61484 -2.609732 > 0.3639806 > 8328 8328 GFI1B GFI1B|8328 11.25229 -2.445345 > 0.3040805 > 27255 27255 CNTN6 CNTN6|27255 297.51966 -5.049929 > 0.3276390 > 89780 89780 WNT3A WNT3A|89780 295.59498 -5.037872 > 0.2976513 > 256815 256815 C10orf67 C10orf67|256815 84.88613 -4.841887 > 0.2890447 > 388939 388939 C2orf71 C2orf71|388939 51.81894 -4.567162 > 0.3601929 > pvalue padj > 6440 3.292485e-19 5.363882e-18 > 7476 7.501115e-13 6.306398e-12 > 8328 8.854943e-16 1.023742e-14 > 27255 1.336693e-53 4.232359e-51 > 89780 2.922717e-64 2.206764e-61 > 256815 5.536370e-63 3.881589e-60 > 388939 7.655804e-37 6.679604e-35 > > However, if I used pooled test, I got the following: > entrezID symbols gene_id baseMean log2FoldChange > lfcSE > 6440 6440 SFTPC SFTPC|6440 560695.51884 -3.654053 > 0.4720463 > 7476 7476 WNT7A WNT7A|7476 210.61484 -1.826788 > 0.3769339 > 8328 8328 GFI1B GFI1B|8328 11.25229 -1.822568 > 0.3423880 > 27255 27255 CNTN6 CNTN6|27255 297.51966 -4.103604 > 0.3509631 > 89780 89780 WNT3A WNT3A|89780 295.59498 -4.424951 > 0.3262553 > 256815 256815 C10orf67 C10orf67|256815 84.88613 -4.293802 > 0.3183049 > 388939 388939 C2orf71 C2orf71|388939 51.81894 -4.068668 > 0.3828477 > pvalue padj > 6440 NA NA > 7476 NA NA > 8328 NA NA > 27255 NA NA > 89780 NA NA > 256815 NA NA > 388939 NA NA > > Noticed that when I used paired test, the p-values of these genes look > quite good, whereas if used pooled test, the pvalues are all NAs (most of > other genes in dataset seem OK with valid p-values at least not NA > values). almost all commands are the same for these two settings, except > for the modeling step: > > Pooled test: > dds <- DESeqDataSetFromMatrix(countData = countD,colData = colD,design = > ~Type); > > Paired test: > dds <- DESeqDataSetFromMatrix(countData = countD,colData = colD,design = > ~Subject+Type); > > The Subject holds the sample names with matched tumor and normal samples. > > I did pull out the counts data for these genes, which seem not uncommon (i > posted below FYI). T is tumors, N is normals,T4626_01A mactched with > N4626_11A for the sample patient as 4626, and so on so forth. any idea why > caused this, any issue with my data or bug(s) in the package? > > Here is the count data for these genes: > > entrezID T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A > N6145_11A > 6440 6440 938904 1271864 963 1642706 10488 1032959 > 7476 7476 117 252 53 177 8 260 > 8328 8328 4 30 1 19 2 9 > 27255 27255 399 425 6 461 8 473 > 89780 89780 276 500 23 310 10 547 > 256815 256815 53 175 3 98 4 141 > 388939 388939 7 227 3 202 3 116 > T6146_01A N6146_11A T6776_01A N6776_11A T6777_01A N6777_11A > T6744_01A > 6440 46915 610409 39544 855546 5535 658155 > 52865 > 7476 65 207 2 253 2 191 > 81 > 8328 3 9 2 13 2 11 > 3 > 27255 81 329 3 722 27 351 > 47 > 89780 34 330 8 555 3 465 > 31 > 256815 7 70 1 199 2 53 > 5 > 388939 1 22 1 73 18 116 > 9 > N6744_11A T5932_01A N5932_11A T5933_01A N5933_11A T5936_01A > N5936_11A > 6440 667035 21596 690368 11 880241 184 > 858491 > 7476 235 29 258 1 240 55 > 323 > 8328 7 1 5 0 10 0 > 6 > 27255 501 4 545 0 676 9 > 674 > 89780 361 9 457 0 409 1 > 565 > 256815 85 2 97 0 175 6 > 173 > 388939 98 4 84 0 31 0 > 106 > T6828_01A N6828_11A T6836_01A N6836_11A T4632_01A N4632_11A > T2662_01A > 6440 44239 737756 68 529280 2859 485130 > 13 > 7476 6 257 126 192 2 94 > 3 > 8328 0 14 1 9 0 10 > 2 > 27255 12 626 0 413 4 176 > 2 > 89780 11 587 8 214 0 14 > 3 > 256815 4 217 3 143 1 16 > 3 > 388939 6 94 1 57 2 0 > 0 > N2662_11A T6147_01A N6147_11A T6743_01A N6743_11A T6829_01A > N6829_11A > 6440 1526175 16149 892008 9150 1138397 55 > 824547 > 7476 794 4 281 3 317 20 > 174 > 8328 19 5 16 2 5 5 > 8 > 27255 591 4 288 7 313 232 > 327 > 89780 1173 50 710 12 402 1 > 320 > 256815 254 68 330 0 114 18 > 80 > 388939 50 4 130 1 77 0 > 326 > T6831_01A N6831_11A T4627_01A N4627_11A T4490_01A N4490_11A > T6595_01A > 6440 294 306899 8082 708377 115 1293737 > 81 > 7476 0 179 239 240 209 211 > 511 > 8328 1 12 6 37 3 5 > 2 > 27255 3 418 6 392 2 430 > 2 > 89780 1 333 67 488 7 298 > 8 > 256815 3 122 0 89 0 174 > 8 > 388939 0 10 0 63 1 31 > 0 > N6595_11A T6835_01A N6835_11A T2657_01A N2657_11A T4676_01A > N4676_11A > 6440 1145375 41647 367263 334072 1082719 42168 > 959756 > 7476 728 283 196 10 244 18 > 282 > 8328 84 7 7 18 23 2 > 21 > 27255 975 88 348 13 448 19 > 606 > 89780 1479 31 185 145 631 23 > 383 > 256815 310 12 124 1 77 6 > 146 > 388939 82 1 6 1 108 6 > 31 > T2661_01A N2661_11A T4625_01A N4625_11A T6745_01A N6745_11A > T2665_01A > 6440 413649 844142 202136 2070763 76113 1441306 > 82 > 7476 34 510 137 387 323 360 > 247 > 8328 72 32 4 44 3 8 > 6 > 27255 24 718 11 509 19 484 > 5 > 89780 15 1021 18 531 14 398 > 8 > 256815 17 121 1 152 8 192 > 1 > 388939 4 90 2 267 12 51 > 13 > N2665_11A T2655_01A N2655_11A > 6440 2200556 278151 1687880 > 7476 572 117 1096 > 8328 12 2 51 > 27255 895 89 921 > 89780 998 65 1490 > 256815 167 18 220 > 388939 123 88 142 > > > > Thanks in advance! > > Ming > > > > > ------------------------------ > From: michaelisaiahlove@gmail.com > Date: Wed, 29 Jan 2014 13:01:40 -0500 > Subject: Re: DESeq2 - input data questions > To: yi02@hotmail.com > CC: bioconductor@r-project.org > > hi Ming, > > We do not recommend using rounded estimated values of read counts with > DESeq2 (although I found an email from myself to the list one year ago > contradicting this in the case that someone had no access to the raw > data). Counts of reads which are proportionally assigned to genes and then > rounded can be a bad fit for distributions like the Negative Binomial (and > Poisson). For example, this procedure could generate values arising from a > distribution that has variance less than the mean. As a rule, and to help > avoid erroneous results, users should produce a matrix containing integer > counts of reads uniquely aligned to features. > > Mike > > > > > On Wed, Jan 29, 2014 at 11:16 AM, Ming Yi <yi02@hotmail.com> wrote: > > > Hi, Dear Michael: > > > > Sorry to directly jump in a quick question, since I saw you respond to > some DESeq questions yesterday and I know you are the author of DESeq2 > package. > > > > I run into some RNAseq data issue with edgeR package (see my post in BioC > yesterday), since the data I used are not integer counts of genes, but RSEM > processed raw_count from TCGA RNA-seq data, in fact, are not integers, > although TCGA named it as raw_count (I showed some piece of such a file > from TCGA as below). > > > > file name: > "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem.genes.res ults" > > its content looks below: > > gene_id raw_count scaled_estimate transcript_id > > 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1 > > 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1 > > 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2 > > 4 ?|10357 218.79 1.933490e-05 uc010zzl.1 > > 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1 > > 6 ?|136542 0.00 0.000000e+00 uc011krn.1 > > > > I used the raw_count of this file for edgeR analysis and run into some > warning messages which indicating potential issues, and had been pointed > out by Dr. Smyth as you may have seen in my post yesterday. > > > > Reading the user guide from both DESeq and DESeq2, both packages mentioned > the input was expected to be count data in a form of a matrix of integer > values, not even (rounded) normalized counts, which would lead to > nonsensical results. > > > > The TCGA data above apparently are not true raw counts but some kind of > RSEM processed raw_counts. Although both RSEM authors and TCGA data staff > all suggested the raw_count can be used for edgeR and DESeq (or DESeq2) > analysis, because of the issues I run into in edgeR. I wonder what would be > your opinion on this data for analysis in DESeq2? > > > > Any advice would be appreciated very much! > > > > Thanks and best > > > > Ming Yi > > NIH > > Maryland,USA > > > > > On Wed, Jan 29, 2014 at 11:16 AM, Ming Yi <yi02@hotmail.com> wrote: > > > Hi, Dear Michael: > > > > Sorry to directly jump in a quick question, since I saw you respond to > some DESeq questions yesterday and I know you are the author of DESeq2 > package. > > > > I run into some RNAseq data issue with edgeR package (see my post in BioC > yesterday), since the data I used are not integer counts of genes, but RSEM > processed raw_count from TCGA RNA-seq data, in fact, are not integers, > although TCGA named it as raw_count (I showed some piece of such a file > from TCGA as below). > > > > file name: > "unc.edu.1b4bb160-191b-4796-8759-339b11fe386d.1096727.rsem.genes.res ults" > > its content looks below: > > gene_id raw_count scaled_estimate transcript_id > > 1 ?|100130426 0.00 0.000000e+00 uc011lsn.1 > > 2 ?|100133144 4.67 1.794813e-07 uc010unu.1,uc010uoa.1 > > 3 ?|100134869 15.33 4.271899e-07 uc002bgz.2,uc002bic.2 > > 4 ?|10357 218.79 1.933490e-05 uc010zzl.1 > > 5 ?|10431 1255.00 5.033411e-05 uc001jiu.2,uc010qhg.1 > > 6 ?|136542 0.00 0.000000e+00 uc011krn.1 > > > > I used the raw_count of this file for edgeR analysis and run into some > warning messages which indicating potential issues, and had been pointed > out by Dr. Smyth as you may have seen in my post yesterday. > > > > Reading the user guide from both DESeq and DESeq2, both packages mentioned > the input was expected to be count data in a form of a matrix of integer > values, not even (rounded) normalized counts, which would lead to > nonsensical results. > > > > The TCGA data above apparently are not true raw counts but some kind of > RSEM processed raw_counts. Although both RSEM authors and TCGA data staff > all suggested the raw_count can be used for edgeR and DESeq (or DESeq2) > analysis, because of the issues I run into in edgeR. I wonder what would be > your opinion on this data for analysis in DESeq2? > > > > Any advice would be appreciated very much! > > > > Thanks and best > > > > Ming Yi > > NIH > > Maryland,USA > > > [[alternative HTML version deleted]]