differential expression- edgeR

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

I am new to R and edgeR. I am trying to follow to example 9. Case Study: deep-sequenced short tags from the edgeR manual. I download the data from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM272105 by clicking the full table and then saving webpage as .txt file containing First 4 rows as below folled by fifth which contains the tag seq and count. #SEQUENCE = #COUNT = #TPM = tags per million SEQUENCE COUNT TPM CATCGCCAGCGGGCACC 1 0.37 Now we I follow the steps of 9.3 reading data and creating the DGElist: After running < d<- readDGE(targets, skip = 5, comment.char = "#"), I get Error in taglist[[i]] : subscript out of bounds Can anyone please help, how I can solve this issue. Best regards, Ashutosh -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.4.2 limma_3.18.7 -- Sent via the guest posting facility at bioconductor.org.

edgeR edgeR • 2.0k views

ADD COMMENT • link updated 10.3 years ago by Mark Robinson ▴ 880 • written 10.3 years ago by Guest User ★ 13k

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 5.5 years ago

Dear Ashutosh, I wasn't able to (nicely) get exactly the same files from GEO, but here is an alternative: See: http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-71/samples/ f <- "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTA B/E-MTAB-71/E-MTAB-71.raw.1.zip" bf <- basename(f) download.file(f, bf) # download ZIP from ArrayExpress unzip(bf) # should put the 8 TXT files in current dir tg <- dir(".","^DLCK|WT") library("edgeR") counts <- readDGE(tg)$counts counts[is.na(counts)] <- 0 grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1]) d <- DGEList(counts=counts, group=grp) This should return: > d An object of class "DGEList" $counts DLCK.TG_1 DLCK.TG_2 DLCK.TG_3 DLCK.TG_4 WT_1 WT_3 WT_4 WT_6 AAAAAAAAAAAAAAAAA 22653 3059 1366 6574 7782 35096 6623 9633 AAAAAAAAAAAAAAAAC 82 51 55 93 412 134 335 519 AAAAAAAAAAAAAAAAG 2 3 7 9 59 5 45 84 AAAAAAAAAAAAAAAAT 118 471 359 717 1842 94 2465 3311 AAAAAAAAAAAAAAACA 67 4 4 12 17 108 12 21 844311 more rows ... $samples group lib.size norm.factors DLCK.TG_1 DLCK.TG 651172 1 DLCK.TG_2 DLCK.TG 2685418 1 DLCK.TG_3 DLCK.TG 3202246 1 DLCK.TG_4 DLCK.TG 2460753 1 WT_1 WT 3142262 1 WT_3 WT 294909 1 WT_4 WT 3517977 1 WT_6 WT 3558260 1 [? proceed from here ?] Alternatively, from your original code, I think you want 'skip=1' and is it possible that you have an extra unwanted file in your 'targets' list ? Anyways, hope that helps. Best, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics, Institute of Molecular Life Sciences University of Zurich http://ow.ly/riRea On 23.12.2013, at 16:50, Ashutosh [guest] <guest at="" bioconductor.org=""> wrote: > > I am new to R and edgeR. I am trying to follow to example > 9. Case Study: deep-sequenced short tags from the edgeR manual. > > I download the data from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM272105 > by clicking the full table and then saving webpage as .txt file containing > > First 4 rows as below folled by fifth which contains the tag seq and count. > > #SEQUENCE = > #COUNT = > #TPM = tags per million > SEQUENCE COUNT TPM > CATCGCCAGCGGGCACC 1 0.37 > > Now we I follow the steps of 9.3 reading data and creating the DGElist: After running > < d<- readDGE(targets, skip = 5, comment.char = "#"), I get > Error in taglist[[i]] : subscript out of bounds > > Can anyone please help, how I can solve this issue. > > Best regards, > Ashutosh > > > -- output of sessionInfo(): > > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.4.2 limma_3.18.7 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.3 years ago Mark Robinson ▴ 880

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 5.5 years ago

Hi Ashutosh, Please keep these exchanges on the BioC mailing list. I've cc'd the list here. > But when I exactly follow the script from edgeR manual on the data and on making DGE list, I get library size as NA. This depends on how the data is stored. readDGE() expects read counts where absence of a tag id in a single sample means a 0 count for that sample, while ArrayExpress appears to store 0s as NAs (with aligned ids across the dataset). So, I believe this is why your library sizes come out as NA (summing over columns with NAs present). As you see below, that's why I added: counts[is.na(counts)] <- 0 before I created the 'DGEList'. > Also can you please explain or give reference to function "grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1])". At the moment, I just copied into the terminal but I want to know what it signifies? It's just a way to convert this vector: > colnames(counts) [1] "DLCK.TG_1" "DLCK.TG_2" "DLCK.TG_3" "DLCK.TG_4" "WT_1" "WT_3" [7] "WT_4" "WT_6" into this: > grp DLCK.TG_1 DLCK.TG_2 DLCK.TG_3 DLCK.TG_4 WT_1 WT_3 WT_4 WT_6 "DLCK.TG" "DLCK.TG" "DLCK.TG" "DLCK.TG" "WT" "WT" "WT" "WT" There are many other ways to do this, including: grp <- gsub("_[0-9]","",colnames(counts)) but, these are typically customized to the dataset (and filename structure) on hand. Best, Mark On 26.12.2013, at 11:02, "Dhingra, Ashutosh" <a.dhingra at="" vumc.nl=""> wrote: > Dear Mark, > > Thanks a lot for your kind help! It works with your script. I was also able to find a error in the target list because of which I was getting out of bound error. > > > But when I exactly follow the script from edgeR manual on the data and on making DGE list, I get library size as NA. > > Also can you please explain or give reference to function "grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1])". At the moment, I just copied into the terminal but I want to know what it signifies? > > > > Best regards, > Ashutosh > ________________________________________ > From: Mark Robinson [mark.robinson at imls.uzh.ch] > Sent: 23 December 2013 17:35 > To: Ashutosh [guest] > Cc: bioconductor at r-project.org; Dhingra, Ashutosh > Subject: Re: [BioC] differential expression- edgeR > > Dear Ashutosh, > > I wasn't able to (nicely) get exactly the same files from GEO, but here is an alternative: > > See: > http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-71/samples/ > > f <- "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/M TAB/E-MTAB-71/E-MTAB-71.raw.1.zip" > bf <- basename(f) > download.file(f, bf) # download ZIP from ArrayExpress > > unzip(bf) # should put the 8 TXT files in current dir > > tg <- dir(".","^DLCK|WT") > > library("edgeR") > counts <- readDGE(tg)$counts > counts[is.na(counts)] <- 0 > > grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1]) > d <- DGEList(counts=counts, group=grp) > > This should return: > >> d > An object of class "DGEList" > $counts > DLCK.TG_1 DLCK.TG_2 DLCK.TG_3 DLCK.TG_4 WT_1 WT_3 WT_4 WT_6 > AAAAAAAAAAAAAAAAA 22653 3059 1366 6574 7782 35096 6623 9633 > AAAAAAAAAAAAAAAAC 82 51 55 93 412 134 335 519 > AAAAAAAAAAAAAAAAG 2 3 7 9 59 5 45 84 > AAAAAAAAAAAAAAAAT 118 471 359 717 1842 94 2465 3311 > AAAAAAAAAAAAAAACA 67 4 4 12 17 108 12 21 > 844311 more rows ... > > $samples > group lib.size norm.factors > DLCK.TG_1 DLCK.TG 651172 1 > DLCK.TG_2 DLCK.TG 2685418 1 > DLCK.TG_3 DLCK.TG 3202246 1 > DLCK.TG_4 DLCK.TG 2460753 1 > WT_1 WT 3142262 1 > WT_3 WT 294909 1 > WT_4 WT 3517977 1 > WT_6 WT 3558260 1 > > [? proceed from here ?] > > Alternatively, from your original code, I think you want 'skip=1' and is it possible that you have an extra unwanted file in your 'targets' list ? > > Anyways, hope that helps. > > Best, Mark > > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics, Institute of Molecular Life Sciences > University of Zurich > http://ow.ly/riRea > > > > > > > On 23.12.2013, at 16:50, Ashutosh [guest] <guest at="" bioconductor.org=""> wrote: > >> >> I am new to R and edgeR. I am trying to follow to example >> 9. Case Study: deep-sequenced short tags from the edgeR manual. >> >> I download the data from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM272105 >> by clicking the full table and then saving webpage as .txt file containing >> >> First 4 rows as below folled by fifth which contains the tag seq and count. >> >> #SEQUENCE = >> #COUNT = >> #TPM = tags per million >> SEQUENCE COUNT TPM >> CATCGCCAGCGGGCACC 1 0.37 >> >> Now we I follow the steps of 9.3 reading data and creating the DGElist: After running >> < d<- readDGE(targets, skip = 5, comment.char = "#"), I get >> Error in taglist[[i]] : subscript out of bounds >> >> Can anyone please help, how I can solve this issue. >> >> Best regards, >> Ashutosh >> >> >> -- output of sessionInfo(): >> >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] edgeR_3.4.2 limma_3.18.7 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 10.3 years ago Mark Robinson ▴ 880

Login before adding your answer.