Question

DESEQ2 coldata creation from featureCounts and row.names length error

0

Entering edit mode

NGS_enthusiast • 0

@997c6d9a

Last seen 2.2 years ago

France

Hi all, I am having an issue with DESeq2. One is related to its use in galaxy (did not get an answer on galaxy forum so I thought why not ask here) and one is related to the introduction of coldata information in the matrix before running DESeq2 when using featureCounts data.

1) using Galaxy with 2 factors (2 batches/ 2 discinct studies from the litterature), 3 levels in each factor that are not the same. so 2 batches and in first batch I have non-treated, treated 1h and selected population treated 1h and in second batch I have 3 populations selected that I think could contribute to the 1h treatment of the first batch/study. also in the first study they have duplicates and in the second they have triplicates. I end up with the folliowing error: "Error in .rowNamesDF<-(x, value = value) : invalid 'row.names' length Calls: rownames<- ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-" I tried to change names of factors, of factorlevels and put duplicates everywhere, it did not work. However with only One factor and putting everything as factor levels, it works. My batch effect is not taken into account though... Could you tell me where could the error lie? Or at least what this row.names length error refers to?

2) I then tried to retrieve my featureCounts datasets from galaxy so that I can do deseq2 myself in R (I'm beginner in R) I will fuse my different featureCounts data using join under terminal to have my list of gene names in first column and counts for all replicates in a column each and import it in R and make it as a matrix. Here, I read the bioconductor Doc of DESeq2, but I'm not sure I understand how to create the colData information to inform about the factors. after some search I propose (condition <- factor(c(rep("cond1", 2), rep("cond2", 2), rep("cond3", 2), rep("cond4", 3), rep("cond5", 3), rep("cond6", 3)))) (batch <- factor(c(rep("batch1", 6), rep("batch2", 9)))) (coldata <- data.frame(row.names=colnames(countdata), condition, batch)) dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition, batch) dds <- DESeq(dds)

and then I can go on. Could you tell me if it is correct? Where could I find more explanation about the coldata implementation into the matrix? and if I have only one factor with factorlevels only how should I do? only the "condition" lane?

thanks for any help you could provide, and let me know if you need any more information. Best regards

DESeq2 • 2.0k views

ADD COMMENT • link updated 3.4 years ago by Michael Love 41k • written 3.4 years ago by NGS_enthusiast • 0

score 2 · Accepted Answer · 2020-12-08

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 day ago

United States

I'm not sure about the error in (1) without the underlying R code. Could you find a way to provide that? It looks like the wrong number of samples is being provided across the inputs.

For 2, yes you can create a data.frame like so. (Note that you can use e.g.: rep(c("cond1","cond2"), c(2,2)) where you give each repeated value and then the number of repeats.

Or you can write a CSV file using a text editor and read it into R as a data.frame with read.csv.

ADD COMMENT • link 3.4 years ago Michael Love 41k

0

Entering edit mode

Hi , thanks a lot for your answer. \ For 2) could I ask you the format of the csv? I guess first line ID second column could be first factor then second factor etc...\ For 1), I can tell you the structure: in galaxy I created 2 "factors"\ factorname1 => 3 levels (3 condtions of first paper): FactorLevel1_WT, FactorLevel2_injured, FactorLevel3_celltype1_injured and in each factor level, 2 replicates\ factorname2 => 3 levels (3 sorted cell type frm another tissue that may contribute to cells in injured condition): FactorLevel4_celltype2, FactorLevel5_celltype3, FactorLevel6_celltype4, 3 replicates each\ Putting everything under one unique factor works fine (This example does not have the 3rd level in factor one but I also tried it and also tried to put duplicates only for the factor2), I've also tried to put a different structure with factor one "condition" and factor 2 is "batch" but then I have duplicates as error because I have the same featureCounts (all of them) in both factors.\ I can attach a screenshot if that helps, the bug report tells me this

Rscript '/cvmfs/main.galaxyproject.org/shed_tools/toolshed.g2.bx.psu.edu/repos/iuc/deseq2/71bacea10eee/deseq2/deseq2.R' --cores ${GALAXY_SLOTS:-1} -o '/galaxy-repl/main/files/048/404/dataset_48404416.dat' -p '/galaxy-repl/main/files/048/404/dataset_48404417.dat'                                     -H  -f '[["FactorName1", [{"FactorLevel2_inj": ["/galaxy-repl/main/files/047/664/dataset_47664221.dat", "/galaxy-repl/main/files/047/664/dataset_47664223.dat"]}, {"FactorLevel1_WT": ["/galaxy-repl/main/files/047/553/dataset_47553086.dat", "/galaxy-repl/main/files/047/553/dataset_47553088.dat"]}]], ["FactorName2", [{"FactorLevel6_CD142": ["/galaxy-repl/main/files/048/247/dataset_48247834.dat", "/galaxy-repl/main/files/048/247/dataset_48247845.dat", "/galaxy-repl/main/files/048/247/dataset_48247890.dat"]}, {"FactorLevel5_ICAM1": ["/galaxy-repl/main/files/048/243/dataset_48243217.dat", "/galaxy-repl/main/files/048/247/dataset_48247822.dat", "/galaxy-repl/main/files/048/247/dataset_48247840.dat"]}, {"FactorLevel4_DPP4": ["/galaxy-repl/main/files/048/247/dataset_48247894.dat", "/galaxy-repl/main/files/048/247/dataset_48247901.dat", "/galaxy-repl/main/files/048/247/dataset_48247903.dat"]}]]]' -l '{"dataset_47553086.dat": "Counts_WT_Malecova_Rep1", "dataset_47553088.dat": "Counts_WT_Malecova_rep2", "dataset_47664221.dat": "Counts_Inj_d1_rep1", "dataset_47664223.dat": "Counts_Inj_d1", "dataset_48247894.dat": "featureCounts on data 234 and data 515: Counts", "dataset_48247901.dat": "featureCounts on data 234 and data 516: Counts", "dataset_48247903.dat": "featureCounts on data 234 and data 517: Counts", "dataset_48243217.dat": "featureCounts on data 234 and data 456: Counts", "dataset_48247822.dat": "featureCounts on data 234 and data 497: Counts", "dataset_48247840.dat": "featureCounts on data 234 and data 499: Counts", "dataset_48247834.dat": "featureCounts on data 234 and data 498: Counts", "dataset_48247845.dat": "featureCounts on data 234 and data 500: Counts", "dataset_48247890.dat": "featureCounts on data 234 and data 514: Counts"}' -t 1

stderr

Error in `.rowNamesDF<-`(x, value = value) : invalid 'row.names' length
Calls: rownames<- ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-

\ thank you again for your help.

ADD REPLY • link 3.4 years ago NGS_enthusiast • 0

0

Entering edit mode

Re: format of the CSV, this is some basic R input, I'd poke around on the online R guides as to how to read CSV data into R. Also if you feel more comfortable doing this with data.frame and factor, go ahead.

I won't be able to debug the Galaxy bit, sorry, due to time pressure. It just may not be possible to do all types of analyses within the Galaxy plugin.

ADD REPLY • link 3.4 years ago Michael Love 41k