please also include the results of running the following in an R session

Question

No counts, just fold change data

0

Entering edit mode

Manal • 0

@2cec7878

Last seen 3.5 years ago

United Kingdom

Enter the body of text here Hello,

I am trying to do secondary research on published data that lists only the fold change in genes between 2 stages for 2:3 different tissue types without any further analysis. I am trying to apply the protocol here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4934518/pdf/f1000research-5-9996.pdf

in order to connect the differentially expressed genes to pathways, filter what is relevant and further analyse the network.

I appreciate any advice on how to estimate the dispersion and continue the analysis, starting from the log change and not gene counts. I understand that edgeR from documentation specify that quantification of change is not important, and the ratio of change is what the analysis is based on. I doubt a silly work around of starting with arbitrary count for each gene and multiply for each group by the fold change would be any good.

Thank you in advance.

Code should be placed in three backticks as shown below


# include your problematic code here with any corresponding output 

xpressStage2<-data.frame(in[2:(in[["Nr_samples"]]+1)], row.names=in$"GeneName")
xpressStage3<-data.frame(in[(in[["Nr_samples"]]+2):length(in)], row.names=in$"GeneName")
names(xpressStage2) = c( "Whole_larvae", "Brain", "Muscle")
names(xpressStage3) = c( "Whole_larvae", "Brain", "Muscle")

#data frame with raw read counts from normal and tumor cells, here is the difference, this code is for read #counts, and I am trying to apply it for fold change per gene per group
xpress<- data.frame(xpressStage3, xpressStage2)
# DGEList object containing read counts, information about the samples and genes
dge <- DGEList(counts=xpress)
dge <- calcNormFactors(dge)

Tissue <- factor(c(colnames(xpressStage3), colnames(xpressStage2)))
Stage <- factor(c(rep("_3",length(xpressStage3)), rep("_2",length(xpressStage2))))

design <- model.matrix(~Tissue+Stage)

rownames(design) <- colnames(dge)

# estimate dispersion parameter for generalized linear model (glm), the code run fine, but make the majority # of genes as insignificant
dge <- estimateDisp(dge, design, robust=TRUE)

# fit glm
fit <- glmFit(dge, design)
# apply likelihoodratio test 
lrt <- glmLRT(fit)
cat ("lrt  coefficients \n\n")
head(lrt$coefficients)
# extract depth-normalized read counts (counts-per-million)
cpm <- cpm(dge)
normReads <- data.frame(cpm)
diffExpression <- as.data.frame(topTags(lrt, n=length(knime.in$"GeneName"))$table)
diffExpression <- diffExpression[sort(rownames(diffExpression)),]
out <- data.frame(cbind(normReads, diffExpression), row.names=rownames(diffExpression))

#        The output for Stage_3 , Down        33, NotSig    3080, Up          36
cat ("summary \n\n")
summary(decideTests(lrt))

please also include the results of running the following in an R session

sessionInfo( )

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_3.34.0 limma_3.48.0

loaded via a namespace (and not attached):
[1] compiler_4.1.0  Rcpp_1.0.6      splines_4.1.0   grid_4.1.0     
[5] locfit_1.5-9.4  statmod_1.4.36  lattice_0.20-44

EdgeR Estimation Dispersion • 1.5k views

ADD COMMENT • link 3.5 years ago Manal • 0

score 2 · Answer 1 · 2021-06-15

Assuming you've got logFC's for the (mostly) the same genes among your comparisons, and also that these stats are from all of the genes under test (ie. they aren't preselected to be just the significant ones) you can use a gene set enrichment analysis that is based on ranks – in this case, you would rank your genes based on their logFC.

You could then use a package like fgsea, or the limma::cameraPR() function, too (check the help in ?limma::cameraPR)

score 1 · Answer 2 · 2021-06-14

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 3 days ago

United States

If all you have are the logFC values from a publication, then you won't be able to do anything. Did they not submit the data to a public repository (GEO or similar)?

ADD COMMENT • link 3.5 years ago James W. MacDonald 67k

0

Entering edit mode

No, not submitted in any database, and not sure if it is LogFC, they only mentioned fold change above the threshold 1.5. The highest value is 269. They put the downregulated in tables, and the upregulated in other tables. I added negative sign to the downregulated to merge.

ADD REPLY • link 3.5 years ago Manal • 0

score 1 · Answer 3 · 2021-06-14

1

Entering edit mode

swbarnes2 ★ 1.4k

@swbarnes2-14086

Last seen 3 days ago

San Diego

in order to connect the differentially expressed genes to pathways, filter what is relevant and further analyse the network.

I think you can do this, but that is downstream of edgeR. EdgeR would get you fold changes, but since you have them, you probably don't need EdgeR.

You want pathway analysis tools.

ADD COMMENT • link 3.5 years ago swbarnes2 ★ 1.4k

0

Entering edit mode

Thank you for that. I did now. I am just hesitant if the values logFC logCPM LR PValue FDR coming out from EdgeR are the valuable, but the pathways linked seem relevant.

ADD REPLY • link 3.5 years ago Manal • 0