differential expression- edgeR
2
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 10.3 years ago
I am new to R and edgeR. I am trying to follow to example 9. Case Study: deep-sequenced short tags from the edgeR manual. I download the data from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM272105 by clicking the full table and then saving webpage as .txt file containing First 4 rows as below folled by fifth which contains the tag seq and count. #SEQUENCE = #COUNT = #TPM = tags per million SEQUENCE COUNT TPM CATCGCCAGCGGGCACC 1 0.37 Now we I follow the steps of 9.3 reading data and creating the DGElist: After running < d<- readDGE(targets, skip = 5, comment.char = "#"), I get Error in taglist[[i]] : subscript out of bounds Can anyone please help, how I can solve this issue. Best regards, Ashutosh -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.4.2 limma_3.18.7 -- Sent via the guest posting facility at bioconductor.org.
edgeR edgeR • 2.2k views
ADD COMMENT
0
Entering edit mode
Mark Robinson ▴ 880
@mark-robinson-4908
Last seen 6.2 years ago
Dear Ashutosh, I wasn't able to (nicely) get exactly the same files from GEO, but here is an alternative: See: http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-71/samples/ f <- "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/MTA B/E-MTAB-71/E-MTAB-71.raw.1.zip" bf <- basename(f) download.file(f, bf) # download ZIP from ArrayExpress unzip(bf) # should put the 8 TXT files in current dir tg <- dir(".","^DLCK|WT") library("edgeR") counts <- readDGE(tg)$counts counts[is.na(counts)] <- 0 grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1]) d <- DGEList(counts=counts, group=grp) This should return: > d An object of class "DGEList" $counts DLCK.TG_1 DLCK.TG_2 DLCK.TG_3 DLCK.TG_4 WT_1 WT_3 WT_4 WT_6 AAAAAAAAAAAAAAAAA 22653 3059 1366 6574 7782 35096 6623 9633 AAAAAAAAAAAAAAAAC 82 51 55 93 412 134 335 519 AAAAAAAAAAAAAAAAG 2 3 7 9 59 5 45 84 AAAAAAAAAAAAAAAAT 118 471 359 717 1842 94 2465 3311 AAAAAAAAAAAAAAACA 67 4 4 12 17 108 12 21 844311 more rows ... $samples group lib.size norm.factors DLCK.TG_1 DLCK.TG 651172 1 DLCK.TG_2 DLCK.TG 2685418 1 DLCK.TG_3 DLCK.TG 3202246 1 DLCK.TG_4 DLCK.TG 2460753 1 WT_1 WT 3142262 1 WT_3 WT 294909 1 WT_4 WT 3517977 1 WT_6 WT 3558260 1 [? proceed from here ?] Alternatively, from your original code, I think you want 'skip=1' and is it possible that you have an extra unwanted file in your 'targets' list ? Anyways, hope that helps. Best, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics, Institute of Molecular Life Sciences University of Zurich http://ow.ly/riRea On 23.12.2013, at 16:50, Ashutosh [guest] <guest at="" bioconductor.org=""> wrote: > > I am new to R and edgeR. I am trying to follow to example > 9. Case Study: deep-sequenced short tags from the edgeR manual. > > I download the data from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM272105 > by clicking the full table and then saving webpage as .txt file containing > > First 4 rows as below folled by fifth which contains the tag seq and count. > > #SEQUENCE = > #COUNT = > #TPM = tags per million > SEQUENCE COUNT TPM > CATCGCCAGCGGGCACC 1 0.37 > > Now we I follow the steps of 9.3 reading data and creating the DGElist: After running > < d<- readDGE(targets, skip = 5, comment.char = "#"), I get > Error in taglist[[i]] : subscript out of bounds > > Can anyone please help, how I can solve this issue. > > Best regards, > Ashutosh > > > -- output of sessionInfo(): > > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.4.2 limma_3.18.7 > > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Mark Robinson ▴ 880
@mark-robinson-4908
Last seen 6.2 years ago
Hi Ashutosh, Please keep these exchanges on the BioC mailing list. I've cc'd the list here. > But when I exactly follow the script from edgeR manual on the data and on making DGE list, I get library size as NA. This depends on how the data is stored. readDGE() expects read counts where absence of a tag id in a single sample means a 0 count for that sample, while ArrayExpress appears to store 0s as NAs (with aligned ids across the dataset). So, I believe this is why your library sizes come out as NA (summing over columns with NAs present). As you see below, that's why I added: counts[is.na(counts)] <- 0 before I created the 'DGEList'. > Also can you please explain or give reference to function "grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1])". At the moment, I just copied into the terminal but I want to know what it signifies? It's just a way to convert this vector: > colnames(counts) [1] "DLCK.TG_1" "DLCK.TG_2" "DLCK.TG_3" "DLCK.TG_4" "WT_1" "WT_3" [7] "WT_4" "WT_6" into this: > grp DLCK.TG_1 DLCK.TG_2 DLCK.TG_3 DLCK.TG_4 WT_1 WT_3 WT_4 WT_6 "DLCK.TG" "DLCK.TG" "DLCK.TG" "DLCK.TG" "WT" "WT" "WT" "WT" There are many other ways to do this, including: grp <- gsub("_[0-9]","",colnames(counts)) but, these are typically customized to the dataset (and filename structure) on hand. Best, Mark On 26.12.2013, at 11:02, "Dhingra, Ashutosh" <a.dhingra at="" vumc.nl=""> wrote: > Dear Mark, > > Thanks a lot for your kind help! It works with your script. I was also able to find a error in the target list because of which I was getting out of bound error. > > > But when I exactly follow the script from edgeR manual on the data and on making DGE list, I get library size as NA. > > Also can you please explain or give reference to function "grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1])". At the moment, I just copied into the terminal but I want to know what it signifies? > > > > Best regards, > Ashutosh > ________________________________________ > From: Mark Robinson [mark.robinson at imls.uzh.ch] > Sent: 23 December 2013 17:35 > To: Ashutosh [guest] > Cc: bioconductor at r-project.org; Dhingra, Ashutosh > Subject: Re: [BioC] differential expression- edgeR > > Dear Ashutosh, > > I wasn't able to (nicely) get exactly the same files from GEO, but here is an alternative: > > See: > http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-71/samples/ > > f <- "ftp://ftp.ebi.ac.uk/pub/databases/microarray/data/experiment/M TAB/E-MTAB-71/E-MTAB-71.raw.1.zip" > bf <- basename(f) > download.file(f, bf) # download ZIP from ArrayExpress > > unzip(bf) # should put the 8 TXT files in current dir > > tg <- dir(".","^DLCK|WT") > > library("edgeR") > counts <- readDGE(tg)$counts > counts[is.na(counts)] <- 0 > > grp <- sapply(colnames(counts),function(u) strsplit(u,"_")[[1]][1]) > d <- DGEList(counts=counts, group=grp) > > This should return: > >> d > An object of class "DGEList" > $counts > DLCK.TG_1 DLCK.TG_2 DLCK.TG_3 DLCK.TG_4 WT_1 WT_3 WT_4 WT_6 > AAAAAAAAAAAAAAAAA 22653 3059 1366 6574 7782 35096 6623 9633 > AAAAAAAAAAAAAAAAC 82 51 55 93 412 134 335 519 > AAAAAAAAAAAAAAAAG 2 3 7 9 59 5 45 84 > AAAAAAAAAAAAAAAAT 118 471 359 717 1842 94 2465 3311 > AAAAAAAAAAAAAAACA 67 4 4 12 17 108 12 21 > 844311 more rows ... > > $samples > group lib.size norm.factors > DLCK.TG_1 DLCK.TG 651172 1 > DLCK.TG_2 DLCK.TG 2685418 1 > DLCK.TG_3 DLCK.TG 3202246 1 > DLCK.TG_4 DLCK.TG 2460753 1 > WT_1 WT 3142262 1 > WT_3 WT 294909 1 > WT_4 WT 3517977 1 > WT_6 WT 3558260 1 > > [? proceed from here ?] > > Alternatively, from your original code, I think you want 'skip=1' and is it possible that you have an extra unwanted file in your 'targets' list ? > > Anyways, hope that helps. > > Best, Mark > > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics, Institute of Molecular Life Sciences > University of Zurich > http://ow.ly/riRea > > > > > > > On 23.12.2013, at 16:50, Ashutosh [guest] <guest at="" bioconductor.org=""> wrote: > >> >> I am new to R and edgeR. I am trying to follow to example >> 9. Case Study: deep-sequenced short tags from the edgeR manual. >> >> I download the data from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM272105 >> by clicking the full table and then saving webpage as .txt file containing >> >> First 4 rows as below folled by fifth which contains the tag seq and count. >> >> #SEQUENCE = >> #COUNT = >> #TPM = tags per million >> SEQUENCE COUNT TPM >> CATCGCCAGCGGGCACC 1 0.37 >> >> Now we I follow the steps of 9.3 reading data and creating the DGElist: After running >> < d<- readDGE(targets, skip = 5, comment.char = "#"), I get >> Error in taglist[[i]] : subscript out of bounds >> >> Can anyone please help, how I can solve this issue. >> >> Best regards, >> Ashutosh >> >> >> -- output of sessionInfo(): >> >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] edgeR_3.4.2 limma_3.18.7 >> >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT

Login before adding your answer.

Traffic: 575 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6