Hi all,
I am having a really frustrating problem reading in a CSV file (hope this is an appropriate forum as it is for use directly with deseq2!)
I am reading in a CSV count matrix, with first column as gene names. I am trying to make the first column a rownames columns so that that deseqfrommatrix function will work straight from the first sample! The code I am executing is as follows:
countdata<-read.csv(file = "non-norm_counts.csv")
countdata1<-countdata[,-1]
rownames(countdata1)<-countdata1[, 1]
I get the following error: duplicate 'row.names' are not allowed
I have tried so many ways of trying to get this to work, trying solutions other people have had but to no avail. Of course now on the deseq function, ncolcountdata==nrowcoldata is not true.
I have read in the CSV previously and used the following code to simply remove the gene data: countdata[1:ncol(Deannacountdata)]
Ultimately, on dds results, I would like the gene names next to their corresponding statistics, log2fold change, pval, adj pval etc.
But now following results(dds) and further downstream analysis I have no idea which genes I am working with apart from the corresponding row number. But it is worse than this: I am trying to further use a package called degreport (degpattern function) to find clusters of gene expression over time, but ultimately, the significant DE genes after LRT do not map back to any of the gene names as the column is missing and so the package throws up its own errors as it cant identify any significantly expressed genes on my list and cannot make sense of it..
Someone save me!!... If there is any further information, I will be happy to provide!
Many thanks!
Just a quick not on those data that are duplicated: looks as follows:
<NA> <NA> <NA> <NA> <NA> 37316 <NA> <NA>
[28] <NA> <NA> <NA> <NA> <NA> <NA> Fam205a2 <NA> Crybg3
[37] <NA> <NA> <NA> <NA> <NA> Pcdha11 Ccl27a <NA> <NA>
[46] <NA> <NA> Il11ra2 Il11ra2 Ccl27a <NA> <NA> <NA> Gm16701
[55] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
[64] <NA> <NA> <NA> <NA> <NA> Gm4430 <NA> <NA> <NA>
[73] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
Is there a reason that the gene names appear on columns that should contain count data, and 2, make.unique seems like a very laborious task given the amount of duplicates that are present (my paste is just a small segment) is there a way of assigning all duplicates automatic unique identifiers without changing them manually?
Many thanks!
You should figure out what to do with the NAs. How are you going to report results for a gene with an ID of NA, if it is differentially expressed according to DESeq2. One choice would be to remove these, another would be to figure out the problem upstream.
Play around with make.unique() in your R session. And read the help ?make.unique.
I use this forum to point people down the right path, but you'll learn more by experimenting and reading documentation.
many thanks! will play around see if i can resolve the issue.