My initial strategy was to retain the protein-coding genes and filter for low counts before creating the dds object.
GD = read.delim("mart_export (1).txt", header = T, sep = "\t", stringsAsFactors = F) counts = counts %>% inner_join(GD%>%unique, by=c("Gene" = "symbol")) counts_protein = counts%>% subset(rowSums(counts)>10) dds<-DESeqDataSetFromMatrix(counts_protein, colData, formula(~ sample_type)) dds<-estimateSizeFactors(dds)
I'm not entirely sure if I should be estimating the size factors before or after removing non-coding genes and filtering for low counts? There are a lot of genes with a rowSums of 0 (majority of these are the non-coding) and should these be included in the estimation; i.e. I'm wondering if this can affect the estimation?