Question on RNA-Seq single sample data analysis in EdgeR
1
0
Entering edit mode
@mohammedtoufiq91-17679
Last seen 4 days ago
Qatar

Hi,

I have a question, we have generated pilot gene expression data with one sample per condition per subject. I am using edgeR package in R by following manual (2.12 What to do if you have no replicates). I have a control (baseline) sample followed by 5 different samples. I see that the below code tries to compare only two samples at a time (R Code: Two samples at a time - for instance; compares only Control vs Sample_1), and does not compare others; Control vs Sample_2, Control vs Sample_3, Control vs Sample_4, Control vs Sample_5). For this purpose, I am sub-setting the data accordingly to set only two samples at a time (both raw counts and sample metadata data.frame) then use the below code by repeating the same three times for each comparison and exporting the differential analysis results. I felt that is tedious process to do one at a time, hence, I tried to loop to consider and export data table for all comparisons, however, here the problems seems like the logcpm values are exported same values for all comparisons (which is not correct), but logFC and p-value columns are fine.

Basically, I would like to export the consolidated csv file from the object et$table or topTags for each comparison. For instance, Control vs Sample_1, Control vs Sample_2 till Control vs Sample_5. Is there a better way to do this? Also, is it necessary to run the calcNormFactors before performing the exactTest? print(Counts_Test) Control Sample_1 Sample_2 Sample_3 Sample_4 Sample_5 Gene1 0 0.0 0.0 0 0 0.0 Gene2 184 140.5 169.0 107 221 120.5 Gene3 60 64.0 45.0 67 44 45.0 Gene4 0 0.0 1.0 0 0 1.0 Gene5 7 4.0 3.0 5 1 1.0 Gene6 0 0.0 0.0 0 0 0.0 Gene7 87 83.0 122.0 99 139 100.0 Gene8 0 0.0 0.0 0 0 0.0 Gene9 0 1.0 0.0 0 0 0.0 Gene10 21 51.5 36.5 63 48 44.5 Gene11 193 199.0 179.0 178 222 202.0 Gene12 29 25.0 20.0 34 24 39.0 Gene13 0 0.0 0.0 0 1 1.0 Gene14 0 0.0 0.0 0 0 0.0 Gene15 3 5.0 1.0 2 5 3.0 Gene16 50 62.0 58.0 60 67 76.0 Gene17 0 0.0 0.0 0 0 0.0 Gene18 325 525.0 494.0 467 612 719.0 Gene19 442 407.0 570.0 283 451 681.0 print(Sample_Grouping) SampleID ID Control xx-xx-1551 Control Sample_1 xx-xx-1548 Sample_1 Sample_2 xx-xx-1549 Sample_2 Sample_3 xx-xx-1550 Sample_3 Sample_4 xx-xx-1552 Sample_4 Sample_5 xx-xx-0093 Sample_5 # R Code: Two samples at a time library(edgeR) bcv <- 0.4 y <- DGEList(counts=Counts_Test, group=Sample_Grouping$ID)
et <- exactTest(y, dispersion=bcv^2)
View(et$table) write.csv(et$table, file="./table_Control_vs_Sample_1", sep = ",")

# R Code: All samples comparisons at a time

sample_names <- c("Sample_1", "Sample_2", "Sample_3", "Sample_4", "Sample_5")
library(edgeR)
bcv <- 0.4
y <- DGEList(counts=Counts_Test, group=Sample_Grouping$ID) for(cur_name in sample_names){ et <- exactTest(y, pair=c("Control", cur_name), dispersion=bcv^2) if(cur_name=="Sample_1"){ # In the first iteration, capture the order geneOrder <- row.names(et$table)
}else{
# In the subsequent iterations, enforce the order
et$table <- et$table[geneOrder,]
}
# Now, you can write
write.csv(et$table, file=paste0("./table_Control_vs_",cur_name, ".csv")) } # The if/else statement will ensure the order is the same for all R edgeR limma RNASeq Normalization • 165 views ADD COMMENT 2 Entering edit mode @gordon-smyth Last seen 1 hour ago WEHI, Melbourne, Australia Yes, it is necessary to run calcNormFactors, as it always is for any DE analysis. It would also be sensible to filter non-expressed genes (again, that is done for any DE analysis), although in this case the only thing that will change after filtering is the FDR values. edgeR computes logCPM from all the samples. Computing logCPM in that way seems to me to be by far the simplest and most useful thing to do. This behaviour is clearly documented and it is correct. You have decided that you would rather have logCPM computed from two samples at a time. I do not know why that would be useful to you, but you know how to get it if that is what you want. You should use topTags rather than et$table because it is critically important to adjust for multiple testing, and et$table does not contain adjusted p-values. In other words, use a standard edgeR workflow. You do not need to "enforce the order" because edgeR preserves order for you. Just use topTags with n=Inf and sort="none". Overall, I don't quite see what is causing you problems. Are you trying to create a consolidated data.frame containing the results from all the comparison? edgeR won't do that for you automatically, but combining the individual topTags data.frames into one is one line of R code. For example: keep <- filterByExpr(y) y <- y[keep,,keep.lib.sizes=FALSE] y <- calcNormFactors(y) et.Sample1 <- exactTest(y,pair=c("Control","Sample_1"),dispersion=bcv^2) tt.Sample1 <- topTags(et.Sample1, n=Inf, sort="none") et.Sample2 <- exactTest(y,pair=c("Control","Sample_2"),dispersion=bcv^2) tt.Sample2 <- topTags(et.Sample2, n=Inf, sort="none") et.Sample3 <- exactTest(y,pair=c("Control","Sample_3"),dispersion=bcv^2) tt.Sample3 <- topTags(et.Sample3, n=Inf, sort="none") et.Sample4 <- exactTest(y,pair=c("Control","Sample_4"),dispersion=bcv^2) tt.Sample4 <- topTags(et.Sample4, n=Inf, sort="none") et.Sample5 <- exactTest(y,pair=c("Control","Sample_5"),dispersion=bcv^2) tt.Sample5 <- topTags(et.Sample5, n=Inf, sort="none") tt.Combined <- data.frame( Sample1 = tt.Sample1, Sample2 = tt.Sample2, Sample3 = tt.Sample3, Sample4 = tt.Sample4, Sample5 = tt.Sample5) ADD COMMENT 0 Entering edit mode Gordon Smyth Hi, thank you very much for the prompt response. Sure, I will create a DGEList object from a table of counts and samples > filter non-expressed genes > perform calcNormFactors > exactTest > topTags. Overall, I don't quite see what is causing you problems. Are you trying to create a consolidated data.frame containing the results from all the comparison? edgeR won't do that for you automatically, but combining the individual topTags data.frames into one is one line of R code. Yes, this was the expect consolidated data table I was looking. I have two related questions; 1. I have printed the consolidated the below. The logcpm column for all comparisons remains same. Is this known issue for single sample analysis? 2. I have currently used bcv <- 0.4 value to account for dispersion? Does this value vary from dataset to dataset w.r.t human samples? I tried to calculate dispersion using section "2.10.1 Estimating dispersions". It gives me below message: Using classic mode Warning message In estimateDisp.default(y = y$counts, design = design, group = group,: There is no replication, setting dispersion to NA

print(tt.Combined)
Sample1.logFC Sample1.logCPM Sample1.PValue Sample1.FDR Sample2.logFC Sample2.logCPM Sample2.PValue
Gene2    -0.38989465       16.57484      0.6503638   0.9652894 -0.0146515075       16.57484      0.9922805
Gene3     0.09186102       15.08978      0.9363024   0.9652894 -0.3062792049       15.08978      0.7545737
Gene7    -0.06886580       16.00445      0.9529300   0.9652894  0.5950878314       16.00445      0.4985985
Gene10    1.28805992       14.79434      0.1811577   0.9652894  0.9013943875       14.79434      0.3480947
Gene11    0.04307706       16.88583      0.9652894   0.9652894 -0.0006202095       16.88583      1.0000000
Gene12   -0.21419258       14.18460      0.8622259   0.9652894 -0.4258156601       14.18460      0.7041110
Gene16    0.30858296       15.25766      0.7463078   0.9652894  0.3213983917       15.25766      0.7376838
Gene18    0.69060367       18.28096      0.4140759   0.9652894  0.7118677774       18.28096      0.4003493
Gene19   -0.12004524       18.13221      0.8903737   0.9652894  0.4748174857       18.13221      0.5741195
Sample2.FDR Sample3.logFC Sample3.logCPM Sample3.PValue Sample3.FDR Sample4.logFC Sample4.logCPM
Gene2    1.0000000   -0.62251499       16.57484     0.47344787   0.8283621    0.08268399       16.57484
Gene3    0.9701662    0.31759408       15.08978     0.73632182   0.8283621   -0.62752875       15.08978
Gene7    0.9701662    0.34497002       16.00445     0.69208790   0.8283621    0.49382917       16.00445
Gene10   0.9701662    1.73765486       14.79434     0.06985061   0.6286555    1.00700196       14.79434
Gene11   1.0000000    0.04226987       16.88583     0.97068303   0.9706830    0.02034112       16.88583
Gene12   0.9701662    0.38696109       14.18460     0.68675158   0.8283621   -0.45247552       14.18460
Gene16   0.9701662    0.42109338       15.25766     0.65293957   0.8283621    0.24010942       15.25766
Gene18   0.9701662    0.68178431       18.28096     0.42337141   0.8283621    0.73127973       18.28096
Gene19   0.9701662   -0.48405076       18.13221     0.56916218   0.8283621   -0.15248176       18.13221
Sample4.PValue Sample4.FDR Sample5.logFC Sample5.logCPM Sample5.PValue Sample5.FDR
Gene2       0.9259435   0.9934196   -0.94134742       16.57484      0.2754197   0.9626476
Gene3       0.4881252   0.9934196   -0.74471178       15.08978      0.4192484   0.9626476
Gene7       0.5749516   0.9934196   -0.13038750       16.00445      0.8858180   0.9626476
Gene10      0.2847089   0.9934196    0.74886585       14.79434      0.4522755   0.9626476
Gene11      0.9934196   0.9934196   -0.26555215       16.88583      0.7625339   0.9626476
Gene12      0.6846765   0.9934196    0.09559433       14.18460      0.9626476   0.9626476
Gene16      0.7981067   0.9934196    0.27204863       15.25766      0.7691211   0.9626476
Gene18      0.3867184   0.9934196    0.81386635       18.28096      0.3371784   0.9626476
Gene19      0.8563209   0.9934196    0.29206836       18.13221      0.7304474   0.9626476
1
Entering edit mode

You questions are answered by the edgeR documentation.

I thought from your question that you already understood that you have no replication and therefore can't estimate the dispersion.