Question

readDGE in edgeR giving me path names as sample names

0

Entering edit mode

aa.machado001 • 0

@391c91c2

Last seen 11 weeks ago

United States

Hello all! Novice edgeR user here. I am analyzing RNAseq data using individual files for each sample (output from RNA star; tabular files with 2 column, GeneID and counts) and was wondering if there is a better way to read my individual counts data so the path name does not show as a sample name. It is starting to get annoying and I am afraid it will mess with downstream analysis.

Below is what I have. *edited results to shorten the very long and specific pathnames and groups


setwd("pathname")
f_files<- list.files("pathname", pattern = "Counts.tabular", full.names = T)
raw_counts<- readDGE(f_files)
ge_group <- c("1", "1", "1", "2", "2", "2")
dds <-DGEList(counts = raw_counts, group = ge_group)
dds$samples


group
"pathname/filenameSample1"_Counts                  1
"pathname/filenameSample2"_Counts                  1
"pathname/filenameSample3"_Counts                  1
"pathname/filenameSample4"_Counts                  2
"pathname/filenameSample5"_Counts                  2
"pathname/filenameSample6"_Counts                  2
                                                                                                                                                         lib.size
"pathname/filenameSample1"_Counts                 33721289
"pathname/filenameSample2"_Counts                 56484335
"pathname/filenameSample3"_Counts                 54540104
"pathname/filenameSample4"_Counts                 50344281
"pathname/filenameSample5"_Counts                 44695681 
"pathname/filenameSample6"_Counts                 47381894

                                                                                                                                                         norm.factors

"pathname/filenameSample1"_Counts                  1
"pathname/filenameSample2"_Counts                  1
"pathname/filenameSample3"_Counts                  1
"pathname/filenameSample4"_Counts                  1
"pathname/filenameSample5"_Counts                  1
"pathname/filenameSample6"_Counts                  1

sessionInfo( )
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Thanks!!

edgeR • 554 views

ADD COMMENT • link updated 3 months ago by Gordon Smyth 50k • written 3 months ago by aa.machado001 • 0

score 1 · Answer 1 · 2024-01-26

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 day ago

United States

The help page for readDGE says this:

Description:

     Reads and merges a set of text files containing gene expression
     counts.

Usage:

     readDGE(files, path=NULL, columns=c(1,2), group=NULL, labels=NULL, ...)

Arguments:

   files: character vector of filenames, or a data.frame of sample
          information containing a column called 'files'.

    path: character string giving the directory containing the files.
          Defaults to the current working directory.

You are providing the full path to each sample rather than the file name and then the path. If you do the latter, you won't have the path in the file name.

ADD COMMENT • link 3 months ago James W. MacDonald 65k

1

Entering edit mode

Also see ?gsub

ADD REPLY • link 3 months ago James W. MacDonald 65k

0

Entering edit mode

Thanks! I was a bit apprehensive in writing in all the file names since I have 16 count files, which is why I tried to create a vector. Reading a bit more on the list.files command, by changing full.names=False, it takes care of my path problem. But, how can I further simplify the sample names? The current filename has the following format "featureCounts_on_SampleName_Counts" and I would like to rename just with sample name FYI, I did not create these count filenames haha.

ADD REPLY • link 3 months ago aa.machado001 • 0

2

Entering edit mode

My go-to for that sort of thing is

fixedname <- sapply(strsplit(<bustednamesgohere>, sep = "_"), "[", 3)

Which just splits the file name on the underscores and then returns the third thing (which in your case is 'SampleName').

ADD REPLY • link 3 months ago James W. MacDonald 65k

1

Entering edit mode

James' strsplit code provides a one-line solution in this case. But it is worth pointing out the limma/edgeR function removeExt, which is provided to simplify file names that have with common extensions or suffixes. removeExt has the advantage of working even when the separator occurs an irregular number of times in different filenames. removeExt also checks that the suffixes are the same for every filename, otherwise it does not remove them.

To remove the suffix _Counts, use

filename2 <- removeExt(filename, sep="_")

To remove the prefix featureCounts_, you could use

library(stringi)
filename2 <- stri_reverse(removeExt(stri_reverse(filename), sep="_"))

You may need to run removeExt several times to remove several suffixes and prefixes.

ADD REPLY • link 3 months ago Gordon Smyth 50k

0

Entering edit mode

This works beautifully for the extensions, but I can't get it to remove the prefixes :(

ADD REPLY • link 3 months ago aa.machado001 • 0

1

Entering edit mode

My apologies, I had used rev where I should have used stri_reverse. I've now edited my previous comment to be correct.

ADD REPLY • link 3 months ago Gordon Smyth 50k