Question

Problem converting first column to rownames

0

Entering edit mode

A ▴ 60

@a-14337

Last seen 2.1 years ago

United Kingdom

Hi all,

I am having a really frustrating problem reading in a CSV file (hope this is an appropriate forum as it is for use directly with deseq2!)

I am reading in a CSV count matrix, with first column as gene names. I am trying to make the first column a rownames columns so that that deseqfrommatrix function will work straight from the first sample! The code I am executing is as follows:

countdata<-read.csv(file = "non-norm_counts.csv")

countdata1<-countdata[,-1]
rownames(countdata1)<-countdata1[, 1]

I get the following error: duplicate 'row.names' are not allowed

I have tried so many ways of trying to get this to work, trying solutions other people have had but to no avail. Of course now on the deseq function, ncolcountdata==nrowcoldata is not true.

I have read in the CSV previously and used the following code to simply remove the gene data: countdata[1:ncol(Deannacountdata)]

Ultimately, on dds results, I would like the gene names next to their corresponding statistics, log2fold change, pval, adj pval etc.

But now following results(dds) and further downstream analysis I have no idea which genes I am working with apart from the corresponding row number. But it is worse than this: I am trying to further use a package called degreport (degpattern function) to find clusters of gene expression over time, but ultimately, the significant DE genes after LRT do not map back to any of the gene names as the column is missing and so the package throws up its own errors as it cant identify any significantly expressed genes on my list and cannot make sense of it..

Someone save me!!... If there is any further information, I will be happy to provide!

Many thanks!

deseq2 countdata • 3.0k views

ADD COMMENT • link updated 7.9 years ago by Michael Love 43k • written 7.9 years ago by A ▴ 60

score 0 · Answer 1 · 2018-02-05

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 3 days ago

United States

Duplicate row.names is an issue, because the genes names are used as identifiers. If two rows of countData have the same name, how can you pull out the correct row later with character indexing?

A quick fix is make.unique():

> make.unique(c("x","x","y","z"))
[1] "x"   "x.1" "y"   "z"

But you may also want to investigate the genes that are duplicated:

countdata[ duplicated(countdata[,1]), 1]

ADD COMMENT • link 7.9 years ago Michael Love 43k

0

Entering edit mode

Just a quick not on those data that are duplicated: looks as follows:

<NA> <NA> <NA> <NA> <NA> 37316 <NA> <NA>
[28] <NA> <NA> <NA> <NA> <NA> <NA> Fam205a2 <NA> Crybg3
[37] <NA> <NA> <NA> <NA> <NA> Pcdha11 Ccl27a <NA> <NA>
[46] <NA> <NA> Il11ra2 Il11ra2 Ccl27a <NA> <NA> <NA> Gm16701
[55] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
[64] <NA> <NA> <NA> <NA> <NA> Gm4430 <NA> <NA> <NA>
[73] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>

Is there a reason that the gene names appear on columns that should contain count data, and 2, make.unique seems like a very laborious task given the amount of duplicates that are present (my paste is just a small segment) is there a way of assigning all duplicates automatic unique identifiers without changing them manually?

Many thanks!

ADD REPLY • link 7.9 years ago A ▴ 60

0

Entering edit mode

You should figure out what to do with the NAs. How are you going to report results for a gene with an ID of NA, if it is differentially expressed according to DESeq2. One choice would be to remove these, another would be to figure out the problem upstream.

Play around with make.unique() in your R session. And read the help ?make.unique.

I use this forum to point people down the right path, but you'll learn more by experimenting and reading documentation.