edgeR: topTags weird behavior - outputting NAs
1
0
Entering edit mode
venura • 0
@venura-22066
Last seen 7 weeks ago
United States

Hi,

I am using topTages to output DEs into a text file. I was using two datasets and one set with the same function was working fine while the other is throwing NAs for all non-significant accessions without filtering. Please see the two scenarios below;

Dataset 1

> summary(de <- decideTestsDGE(et, adjust.method="BH", p=0.05))
TuMV-Mock
Down          90
NotSig     18696
Up            87

> out <- topTags(et, n="Inf", adjust.method = "BH", sort.by = "PValue", p.value = 0.05)
> dim(out)
[1] 177   4


Dataset 2

> summary(de_Aphid <- decideTestsDGE(et_Aphid, adjust.method="BH", p=.05))
TuMV_Aphid-Mock_Aphid
Down                    2573
NotSig                 13432
Up                      2400
>
> out_Aphid <- topTags(et_Aphid, n="Inf", adjust.method = "BH", sort.by = "PValue", p.value = 0.05)
> dim(out_Aphid)
[1] 18405     4


When I export it to a file, it has the following at the end (for notsig 13432) ;

AT5G51140   -0.508549995231026  5.13973842760081    0.0134929828738446  0.0499373315489865
NA  NA  NA  NA  NA
NA.1    NA  NA  NA  NA
NA.2    NA  NA  NA  NA
NA.3    NA  NA  NA  NA


What am I doing wrong here? :| Can someone help?

edgeR • 298 views
0
Entering edit mode
@gordon-smyth
Last seen 22 minutes ago
WEHI, Melbourne, Australia

My best guess would be that you've introduced NAs by subsetting the top table inappropriately. Impossible to tell from the limited code you show though. edgeR itself does not allow NAs nor does it produce them in a top table.

0
Entering edit mode

It's soo strange! :( This is the full code.

#######################Aphid Study###################################

# make class labels
class_Aphid <- c( rep("Mock_Aphid",3), rep("TuMV_Aphid",3) )

# Check dimensions
dim(rawdata_Aphid)

# Require at least 1/6 of samples to have expressed count >= 10 #no need to run since its already ran above
sample_cutoff <- (1/6)
count_cutoff <- 10

#Define a function to calculate the fraction of values expressed above the count cutoff  #no need to run since its already ran above
getFE <- function(data,count_cutoff){
FE <- (sum(data >= count_cutoff)/length(data))
return(FE)
}

#Apply the function to all genes, and filter out genes not meeting the sample cutoff
fraction_expressed <- apply(rawdata_Aphid, 1, getFE, count_cutoff)
keep_aphid <- which(fraction_expressed >= sample_cutoff)
rawdata_Aphid <- rawdata_Aphid[keep_aphid,]

# Check dimensions again to see effect of filtering
dim(rawdata_Aphid)

#Running edgerR #no need to run since its already ran above
library(edgeR)

###extracting groups from target file

y_Aphid <- DGEList(counts = rawdata_Aphid, group = class_Aphid)

nrow(y_Aphid)

#TMM normalization

y_Aphid <- calcNormFactors(y_Aphid)
y_Aphid\$samples

# Estimate dispersion

y_Aphid <- estimateCommonDisp(y_Aphid, verbose=TRUE)
y_Aphid <- estimateTagwiseDisp(y_Aphid)

# Differential expressed genes test
et_Aphid <- exactTest(y_Aphid, pair = c("Mock_Aphid", "TuMV_Aphid"))

# Print top genes
topTags(et_Aphid)

# Print number of up/down significant genes at FDR = 0.05  significance level

out_Aphid <- topTags(et_Aphid, n="Inf", adjust.method = "BH", sort.by = "PValue", p.value = 0.05)
dim(out_Aphid)

write.table(out_Aphid, file = "TuMV_APhid_DE.txt", quote = FALSE, row.names = TRUE, sep = "\t")

1
Entering edit mode

I don't see any problems with your code. Your code is very old-fashioned, like edgeR code from >10 years ago, but not wrong.

I do not see any way that the code you give could produce the output that you show in your question. The values you show even have "NA" for row names, and edgeR does not allow that in any circumstances.

The output you show does occur if you attempt to access rows of the top table that don't actually exist. That is an R phenomenon and is external to edgeR. For example:

> library(edgeR)
> example(exactTest)
> tab <- topTags(de,n=Inf)
> tab[16:24,]
Comparison of groups:  2-1
logFC   logCPM    PValue FDR
19    0.17778712 13.28195 0.9138714   1
17   -0.14260440 13.54499 0.9282175   1
8    -0.09541957 14.04749 0.9491892   1
1    -0.06634249 13.63245 1.0000000   1
4    -0.06342538 13.68795 1.0000000   1
NA            NA       NA        NA  NA
NA.1          NA       NA        NA  NA
NA.2          NA       NA        NA  NA
NA.3          NA       NA        NA  NA


The NAs are not part of the edgeR object, but are shown by R because I have attempted to access rows of the top table that don't exist. The top table itself has only 20 rows but I have attempted to access rows 21:24, so R adds in NAs to represent the non-existent rows.

The dim(out_Aphid) output that you show in your question also seems impossible. Since you set p=0.05 in the topTable call, then out_Aphid should only have only 4973 rows, not 18405 as shown in your output.

I have to ask what version of R and what version of edgeR you are using. The latest version of edgeR is 3.38.0.

0
Entering edit mode

Thank you for the reply, Gordon. I adopted code from Griffith Lab (https://rnabio.org/) given I am still very new to R. Any advice (or learning material) to improve it will be greatly appreciated. I'm away from the office for the weekend. On Monday morning, I'll post the R and edgeR version information.

1
Entering edit mode

edgeR comes with a 120 page pdf User's Guide, which would be the natural place to start. If you have installed edgeR, then you already have the User's Guide on your own computer. Just type edgeRUsersGuide().

0
Entering edit mode

Thank you. It is very comprehensive. I already started using it. Your F1000 methods article "A guide to creating design matrices for gene expression experiments" is also extremely useful.

0
Entering edit mode

Hi Gordon Smyth , I was using edgeR ver 3.36.0 and R version 4.1.3.