Hello all! Novice edgeR user here. I am analyzing RNAseq data using individual files for each sample (output from RNA star; tabular files with 2 column, GeneID and counts) and was wondering if there is a better way to read my individual counts data so the path name does not show as a sample name. It is starting to get annoying and I am afraid it will mess with downstream analysis.
Below is what I have. *edited results to shorten the very long and specific pathnames and groups
setwd("pathname")
f_files<- list.files("pathname", pattern = "Counts.tabular", full.names = T)
raw_counts<- readDGE(f_files)
ge_group <- c("1", "1", "1", "2", "2", "2")
dds <-DGEList(counts = raw_counts, group = ge_group)
dds$samples
group
"pathname/filenameSample1"_Counts 1
"pathname/filenameSample2"_Counts 1
"pathname/filenameSample3"_Counts 1
"pathname/filenameSample4"_Counts 2
"pathname/filenameSample5"_Counts 2
"pathname/filenameSample6"_Counts 2
lib.size
"pathname/filenameSample1"_Counts 33721289
"pathname/filenameSample2"_Counts 56484335
"pathname/filenameSample3"_Counts 54540104
"pathname/filenameSample4"_Counts 50344281
"pathname/filenameSample5"_Counts 44695681
"pathname/filenameSample6"_Counts 47381894
norm.factors
"pathname/filenameSample1"_Counts 1
"pathname/filenameSample2"_Counts 1
"pathname/filenameSample3"_Counts 1
"pathname/filenameSample4"_Counts 1
"pathname/filenameSample5"_Counts 1
"pathname/filenameSample6"_Counts 1
sessionInfo( )
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)
Thanks!!
Also see
?gsub
Thanks! I was a bit apprehensive in writing in all the file names since I have 16 count files, which is why I tried to create a vector. Reading a bit more on the list.files command, by changing full.names=False, it takes care of my path problem. But, how can I further simplify the sample names? The current filename has the following format "featureCounts_on_SampleName_Counts" and I would like to rename just with sample name FYI, I did not create these count filenames haha.
My go-to for that sort of thing is
Which just splits the file name on the underscores and then returns the third thing (which in your case is 'SampleName').
James'
strsplit
code provides a one-line solution in this case. But it is worth pointing out the limma/edgeR functionremoveExt
, which is provided to simplify file names that have with common extensions or suffixes.removeExt
has the advantage of working even when the separator occurs an irregular number of times in different filenames.removeExt
also checks that the suffixes are the same for every filename, otherwise it does not remove them.To remove the suffix
_Counts
, useTo remove the prefix
featureCounts_
, you could useYou may need to run
removeExt
several times to remove several suffixes and prefixes.This works beautifully for the extensions, but I can't get it to remove the prefixes :(
My apologies, I had used
rev
where I should have usedstri_reverse
. I've now edited my previous comment to be correct.