do I have to do this step in creating DGEList ?
1
0
Entering edit mode
jscl1n22 • 0
@441fab0f
Last seen 23 months ago
United Kingdom

Hi guys,

at the moment I have a dataframe, which the Gene ID is in one column, the others are sample ID as column names and expression value as observations.

Can I simply create a DGEList with this dataframe? Or do I have to turn all Gene ID in the column into row names before I do that?

If yes, then I did below codes:

wide_vgsID <- column_to_rownames(vgsID , "GENE.SYMBOL")

#then I got 

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ........

I have use the duplicate() function to check, but the console return FALSE

Does anyone have any idea how to solve this? Thanks!

DifferentialExpression • 1.9k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

No, there's no need for any steps like that.

All you need to do is to tell edgeR which columns of your data.frame are annotation and which columns hold read counts. You need to separate the counts from the annotation. Let's say the first two columns of the data.frame are annotation and the rest are counts. Then you would use:

j <- 1:2
Genes <- MyDataFrame[,j,drop=FALSE]
Counts <- MyDataFrame[,-j,drop=FALSE]
y <- DGEList(counts=Counts, genes=Genes)

If only the first column contains gene IDs then just set j <- 1. You can set j appropriately for any data.frame and run the above code.

I have long thought about trying to automatically guess which columns of the data.frame are supposed to be annotation, but people do input all sorts of things to DGEList() and I have been worried about guessing incorrectly. For example, the data.frame could contain a column of Entrez Gene IDs, which might be read in as integers and would therefore look like read counts.

ADD COMMENT
0
Entering edit mode

thank you Gordon!

ADD REPLY
0
Entering edit mode

Hi Gordon,

May I have a follow up question?

My gene expression counts need transformation for sure. shall I do it before generating the list? i.e. log transform the expression value, then generate DGEList? Or can I do it after?

Plus I have blank spaces in the df too, I assume I need to replace them with zero when I create the list?

Thanks

ADD REPLY
1
Entering edit mode

Sequence read counts are never transformed, either before or after creating the DGEList.

edgeR always works on actual counts, not on transformed quantities. Please see the edgeR User's Guide and associated documentation.

Sorry, but I have no idea what you mean by "blank spaces". There are no data generation processes that I know of that produce blank spaces where data points should be. Please explain what your data actually represents because, on the face of it, it seems you might be trying to do an analysis that is not appropriate for your data.

ADD REPLY
0
Entering edit mode

Thank you for the transformation part.

the df I'm working on originally is a long format table like below:-

VALUE Gene_Symbol Sample_ID
12253 BRCA P1
42356 CAMP P2

Then for generating the DGEList, I decided to transform it into a wide format and generated below table:

P1 P2 P3 P4
null 2423 46456 74564 523424
CAMP 42356
BRCA 12253 453 658665

because some samples may not express a certain gene, hence the console leave it blank when I wide pivot it. When I view() the df, it showed as blank. But when I do summery() it shows as NULL in the console.

Right now, I am trying to use apply() to replace the blank with 0 but with no luck, all values turned into 0.

I am still very new to this, appreciate the advices.

ADD REPLY
0
Entering edit mode

Is this proteomics data?

Your data is not actually in standard wide format. It is rather a type of matrix-market format where the unobserved genes are omitted. IMO it is dangerous to use general-purpose pivot or reshape functions to convert this to wide format because those functions don't know what to do with the missing values. On the other hand, converting to wide format using base R is straightforward. First, read your long format table into R as a data.frame. I will call it LongForm. Then

Value <- as.numeric(LongForm$VALUE)
Symbol - as.character(LongForm$Gene_Symbol)
Sample <- as.character(LongForm$Sample_ID)
Symbols <- sort(unique(Symbol))
Samples <- sort(unique(Sample))
Counts <- matrix(0,length(Symbols),length(Samples))
rownames(Counts) <- Symbols
colnames(Counts) <- Samples
for (i in 1:length(Value)) Counts[Symbol[i],Sample[i]] <- Value[i]
ADD REPLY

Login before adding your answer.

Traffic: 1002 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6