Question

do I have to do this step in creating DGEList ?

0

Entering edit mode

jscl1n22 • 0

@441fab0f

Last seen 14 months ago

United Kingdom

Hi guys,

at the moment I have a dataframe, which the Gene ID is in one column, the others are sample ID as column names and expression value as observations.

Can I simply create a DGEList with this dataframe? Or do I have to turn all Gene ID in the column into row names before I do that?

If yes, then I did below codes:

wide_vgsID <- column_to_rownames(vgsID , "GENE.SYMBOL")

#then I got 

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ........

I have use the duplicate() function to check, but the console return FALSE

Does anyone have any idea how to solve this? Thanks!

DifferentialExpression • 1.3k views

ADD COMMENT • link updated 14 months ago by Gordon Smyth 50k • written 14 months ago by jscl1n22 • 0

score 2 · Accepted Answer · 2023-02-25

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

No, there's no need for any steps like that.

All you need to do is to tell edgeR which columns of your data.frame are annotation and which columns hold read counts. You need to separate the counts from the annotation. Let's say the first two columns of the data.frame are annotation and the rest are counts. Then you would use:

j <- 1:2
Genes <- MyDataFrame[,j,drop=FALSE]
Counts <- MyDataFrame[,-j,drop=FALSE]
y <- DGEList(counts=Counts, genes=Genes)

If only the first column contains gene IDs then just set j <- 1. You can set j appropriately for any data.frame and run the above code.

I have long thought about trying to automatically guess which columns of the data.frame are supposed to be annotation, but people do input all sorts of things to DGEList() and I have been worried about guessing incorrectly. For example, the data.frame could contain a column of Entrez Gene IDs, which might be read in as integers and would therefore look like read counts.

ADD COMMENT • link 14 months ago Gordon Smyth 50k

0

Entering edit mode

thank you Gordon!

ADD REPLY • link 14 months ago jscl1n22 • 0

0

Entering edit mode

Hi Gordon,

May I have a follow up question?

My gene expression counts need transformation for sure. shall I do it before generating the list? i.e. log transform the expression value, then generate DGEList? Or can I do it after?

Plus I have blank spaces in the df too, I assume I need to replace them with zero when I create the list?

Thanks

ADD REPLY • link 14 months ago jscl1n22 • 0

1

Entering edit mode

Sequence read counts are never transformed, either before or after creating the DGEList.

edgeR always works on actual counts, not on transformed quantities. Please see the edgeR User's Guide and associated documentation.

Sorry, but I have no idea what you mean by "blank spaces". There are no data generation processes that I know of that produce blank spaces where data points should be. Please explain what your data actually represents because, on the face of it, it seems you might be trying to do an analysis that is not appropriate for your data.

ADD REPLY • link 14 months ago Gordon Smyth 50k

0

Entering edit mode

Thank you for the transformation part.

the df I'm working on originally is a long format table like below:-

	VALUE	Gene_Symbol	Sample_ID
	12253	BRCA	P1
	42356	CAMP	P2

Then for generating the DGEList, I decided to transform it into a wide format and generated below table:

	P1	P2	P3	P4
null	2423	46456	74564	523424
CAMP		42356
BRCA	12253		453	658665

because some samples may not express a certain gene, hence the console leave it blank when I wide pivot it. When I view() the df, it showed as blank. But when I do summery() it shows as NULL in the console.

Right now, I am trying to use apply() to replace the blank with 0 but with no luck, all values turned into 0.

I am still very new to this, appreciate the advices.

ADD REPLY • link 14 months ago jscl1n22 • 0

0

Entering edit mode

Is this proteomics data?

Your data is not actually in standard wide format. It is rather a type of matrix-market format where the unobserved genes are omitted. IMO it is dangerous to use general-purpose pivot or reshape functions to convert this to wide format because those functions don't know what to do with the missing values. On the other hand, converting to wide format using base R is straightforward. First, read your long format table into R as a data.frame. I will call it LongForm. Then

Value <- as.numeric(LongForm$VALUE)
Symbol - as.character(LongForm$Gene_Symbol)
Sample <- as.character(LongForm$Sample_ID)
Symbols <- sort(unique(Symbol))
Samples <- sort(unique(Sample))
Counts <- matrix(0,length(Symbols),length(Samples))
rownames(Counts) <- Symbols
colnames(Counts) <- Samples
for (i in 1:length(Value)) Counts[Symbol[i],Sample[i]] <- Value[i]

ADD REPLY • link 14 months ago Gordon Smyth 50k