DexSeq extracols option

0

Entering edit mode

Ioannis Vlachos ▴ 40

@ioannis-vlachos-5634

Last seen 9.6 years ago

Hello, A (hopefully) simple question. As I saw in ??DEXSeqHTML, there is an option to add attributes in the HTML results (extracols). I downloaded from biomaRt a set aof attributes to test this (gene name, biotype and description), and created a data frame having as a first column the exons of my ecs (retaining exon order), geneID and the new attributes. The first two columns were taken by accessing fData from the ecs. The resulting data frame naturally had an equal number of rows with the exons present in my ecs object (355320). I tried to set the extracols option in a different number of ways (putting as row.names the geneID or exonID, having only one column etc), but I always got this error: Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 355320, 396 Where 355320 is the correct number of rows of both ecs exons and of my extra attributes data.frame. (I haven't figured out what has 396 rows yet). I checked the online reference manual and there is no mention of the extracols option. I also checked the code of the relevant function and I don't see it get used somewhere. Any ideas? Is there another way to add gene or exon attributes in the HTML? Thank you all, Best regards, Ioannis -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of bioconductor-request at r-project.org Sent: Sunday, July 28, 2013 1:00 PM To: bioconductor at r-project.org Subject: Bioconductor Digest, Vol 125, Issue 28 Send Bioconductor mailing list submissions to bioconductor at r-project.org To subscribe or unsubscribe via the World Wide Web, visit https://stat.ethz.ch/mailman/listinfo/bioconductor or, via email, send a message with subject or body 'help' to bioconductor-request at r-project.org You can reach the person managing the list at bioconductor-owner at r-project.org When replying, please edit your Subject line so it is more specific than "Re: Contents of Bioconductor digest..." Today's Topics: 1. problem with installation package useR2013 (Bartek [guest]) 2. Re: Quickest way to convert IDs in a data frame? (Enrico Ferrero) 3. issue on removing all NAs (UNANNOTATED) rows (Guest [guest]) 4. Re: problem with installation package useR2013 (Dan Tenenbaum) 5. Re: limma for homemade microarray - question on NAs and multiple probes for one gene (Wang Peter) 6. Re: weird model design for DE analysis (Gordon K Smyth) ---------------------------------------------------------------------- Message: 1 Date: Sat, 27 Jul 2013 04:04:43 -0700 (PDT) From: "Bartek [guest]" <guest@bioconductor.org> To: bioconductor at r-project.org, bartek.taciak at gmail.com Subject: [BioC] problem with installation package useR2013 Message-ID: <20130727110443.B4CAA142E45 at mamba.fhcrc.org> Content-Type: text/plain; charset=iso-8859-1 Hello, I cannot install package "useR2013", I try all possible commands from bioconductor web site and nothing. All the time I get the same message from R. Can You give any tips? -- output of sessionInfo(): install.packages("useR2013", repos=NULL, type="source") Warning in install.packages : package ???useR2013??? is not available (for R version 3.0.1) Installing package into ???C:/Users/Bartek/Documents/R/win-library/3.0??? (as ???lib??? is unspecified) Ostrze??enie: b????dny pakiet 'useR2013' B????D: B????D: nie okre??lono pakiet??w Warning in install.packages : running command '"C:/PROGRA~1/R/R-30~1.1/bin/x64/R" CMD INSTALL -l "C:\Users\Bartek\Documents\R\win-library\3.0" "useR2013"' had status 1 Warning in install.packages : installation of package ???useR2013??? had non-zero exit status OR > source("http://bioconductor.org/biocLite.R") > install.packages("useR2013") Warning in install.packages : package ???useR2013??? is not available (for R version 3.0.1) Installing package into ???C:/Users/Bartek/Documents/R/win-library/3.0??? (as ???lib??? is unspecified) Warning in install.packages : package ???useR2013??? is not available (for R version 3.0.1) -- Sent via the guest posting facility at bioconductor.org. ------------------------------ Message: 2 Date: Sat, 27 Jul 2013 14:40:03 +0100 From: Enrico Ferrero <enricoferrero86@gmail.com> To: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Cc: James Macdonald <jmacdon at="" uw.edu=""> Subject: Re: [BioC] Quickest way to convert IDs in a data frame? Message-ID: <cao22hxdt-twxhubzbpkuxc-2+v7_rk_+tgrb_ojifypu1dnq4a at="" mail.gmail.com=""> Content-Type: text/plain; charset=ISO-8859-1 Hi everybody, Marc, thanks for clarifying things. The behaviour of the select() function is absolutely sensible. Maybe it should be made explicit somewhere in the documentation that, when working with data frames, the user is expected to use the merge() function in conjunction with it. I also agree with Herv? that having options to tweak and customize the output would be an extremely positive thing and a step in the right direction. In addition to a "select" argument, one can also think of a "remove.na.rows" that evaluates to either TRUE or FALSE. But then again, using merge() after select() already deals with these issues quite well. What I think should be investigated more closely at the moment is the unexpected behaviour select() exhibits when one SYMBOL or ALIAS (and potentially other types of ID, I don't know) maps to more than one ENTREZID. As exemplified by James' code below, this causes the output to be truncated, and I highly doubt this is intentional: > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") ALIAS ENTREZID 1 ADORA2A 135 2 ADORA2A 135 3 ADORA2A 135 4 ADORA2A 135 > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") ALIAS ENTREZID 1 AGT 183 2 AGT 189 Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' and duplicate query keys resulted in 1:many mapping between keys and return rows It would be great to have your views on this. Best, On 26 July 2013 21:46, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > Hi Marc, > > On 07/26/2013 12:57 PM, Marc Carlson wrote: > ... > >> Hello everyone, >> >> Sorry that I saw this thread so late. Basically, select() does *try* >> to keep your initial keys and map them each to an equivalent number >> of unique values. We did actually anticipate that people would >> *want* to >> cbind() their results. >> >> But as you discovered there are many circumstances where the data >> make this kind of behavior impossible. >> >> So passing in NAs as keys for example can't ever find anything >> meaningful. Those will simply have to be removed before we can >> proceed. And, it is also impossible to maintain a 1:1 mapping if you >> retrieve fields that have many to one relationships with your initial >> keys (also seen here). >> >> For convenience, when this kind of 1:1 output is already impossible >> (as it is for most of your examples), select will also try to >> simplify the output by removing rows that are identical all the way across etc.. >> >> My aim was that select should try to do the most reasonable thing >> possible based on the data we have in each case. The rationale is >> that in the case where there are 1:many mappings, you should not be >> planning to bind those directly onto any other data.frames anyways >> (as this circumstance would require you to call merge() instead). So >> in that case, non-destructive simplification seems beneficial. > > > Other tools in our infrastructure use an extra argument to pick-up 1 > thing in case of multiple mapping e.g. findOverlaps() has the 'select' > argument with possible values "all", "first", "last", and "arbitrary". > Also nearest() and family have this argument and it accepts similar > values. > > Couldn't select() use a similar approach? The default should be "all" > so the current behavior is preserved but if it's something else then > the returned data.frame should align with the input. > > Thanks, > > H. > > >> >> I hope this clarifies things, >> >> >> Marc >> >> >> >>> >>>> As I >>>> mentioned in my first post, the for loop function works, but it's >>>> highly inefficient. >>>> >>>> Any help is greatly appreciated, thank you. >>>> >>>> Best, >>>> >>>> >>>> >>>> On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: >>>>> >>>>> Hi James, >>>>> >>>>> You're right. >>>>> >>>>> It's actually both: NAs *and* duplicated keys that are mapped to >>>>> more than 1 row are removed from the input. I don't think this is >>>>> documented. >>>>> >>>>> I wonder if select() behavior couldn't be a little bit simpler by >>>>> either preserving or removing all duplicated keys, and not just >>>>> some of them (on a somewhat arbitrary criteria). >>>>> >>>>> Thanks, >>>>> H. >>>>> >>>>> >>>>> >>>>> On 07/25/2013 02:57 PM, James W. MacDonald wrote: >>>>>> >>>>>> >>>>>> Hi Enrico and Herve, >>>>>> >>>>>> This has to do with duplicate entries, but only when the >>>>>> duplicate entry maps to many ENTREZID: >>>>>> >>>>>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS") >>>>>> ALIAS ENTREZID >>>>>> 1 ADORA2A 135 >>>>>> 2 ADORA2A 135 >>>>>> 3 ADORA2A 135 >>>>>> 4 ADORA2A 135 >>>>>> >>>>>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS") >>>>>> ALIAS ENTREZID >>>>>> 1 AGT 183 >>>>>> 2 AGT 189 >>>>>> Warning message: >>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>> 'select' and duplicate query keys resulted in 1:many mapping >>>>>> between keys and return rows >>>>>> >>>>>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS") >>>>>> ALIAS ENTREZID >>>>>> 1 AGT 183 >>>>>> 2 AGT 189 >>>>>> Warning message: >>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>> 'select' resulted in 1:many mapping between keys and return >>>>>> rows >>>>>> >>>>>> >>>>>> So in the instances where a gene symbol maps to more than one >>>>>> ENTREZID, the output gets truncated, whereas if it is a >>>>>> one-to-one mapping, it does not. >>>>>> >>>>>> Best, >>>>>> >>>>>> Jim >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote: >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Herv?, that's exactly what I'm trying to say. >>>>>>> >>>>>>> Attached to this email is a tab delimited file with two columns >>>>>>> of GeneSymbols (or Aliases), and here is some simple code to >>>>>>> reproduce the unexpected behaviour: >>>>>>> >>>>>>> library(org.Hs.eg.db) >>>>>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE, >>>>>>> as.is=TRUE) >>>>>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1, >>>>>>> keytype="ALIAS", >>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>> # check that mytest has less rows than mydf >>>>>>> nrow(mydf) >>>>>>> nrow(mytest) >>>>>>> # pick a random row: they don't match mydf[250,] mytest[250,] >>>>>>> >>>>>>> Ideally, mytest should have the same number and position of rows >>>>>>> of mydf so that I can then cbind them. >>>>>>> If mytest has more rows because of multiple mappings that's also >>>>>>> fine: >>>>>>> I can always use merge(mydf, mytest), right? >>>>>>> >>>>>>> Thanks a lot to both for your help, it's very appreciated. >>>>>>> Best, >>>>>>> >>>>>>> >>>>>>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org=""> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hi Enrico, >>>>>>>> >>>>>>>> >>>>>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Enrico, >>>>>>>>> >>>>>>>>> Please don't take things off-list (e.g., use reply-all). >>>>>>>>> >>>>>>>>> >>>>>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi James, >>>>>>>>>> >>>>>>>>>> Thanks very much for your help. >>>>>>>>>> There is an issue that needs to be solved before thinking >>>>>>>>>> about what's the best approach in my opinion. >>>>>>>>>> >>>>>>>>>> I don't understand why, but the object created with the call >>>>>>>>>> to select (test in my example, first.two in yours) has a >>>>>>>>>> different number of rows from the original object (df in my >>>>>>>>>> example). Specifically it has >>>>>>>>>> *less* rows. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I'm surprised it has less rows. It can definitely have more, >>>>>>>> when some of the keys passed to select() are mapped to more >>>>>>>> than 1 row, but my understanding was that select() would >>>>>>>> propagate unmapped keys to the output by placing them in rows >>>>>>>> stuffed with NAs. So maybe I misunderstood how select() works, >>>>>>>> or its behavior was changed, or there is a bug somewhere. Could >>>>>>>> you please send the code that allows us to reproduce this? >>>>>>>> Thanks. >>>>>>>> >>>>>>>> H. >>>>>>>> >>>>>>>> >>>>>>>>> If all symbols were converted to all possible Entrez IDs, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I would expect it to have more rows, not less. To me, it >>>>>>>>>> looks like not all rows are looked up and returned. >>>>>>>>>> >>>>>>>>>> Do you see what I mean? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Sure. You could be using outdated gene symbols. Or perhaps you >>>>>>>>> are using a mixture of symbols and aliases. Which is even >>>>>>>>> cooler than just all >>>>>>>>> symbols: >>>>>>>>> >>>>>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10], >>>>>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45]) >>>>>>>>> > symb >>>>>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2" "AACP" >>>>>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP" "AANAT" >>>>>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV" "AAVS1" >>>>>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1" "ABC1" >>>>>>>>> [25] "ABCA1" >>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL") >>>>>>>>> SYMBOL ENTREZID >>>>>>>>> 1 A1BG 1 >>>>>>>>> 2 A2M 2 >>>>>>>>> 3 A2MP1 3 >>>>>>>>> 4 NAT1 9 >>>>>>>>> 5 NAT2 10 >>>>>>>>> 6 AACP 11 >>>>>>>>> 7 SERPINA3 12 >>>>>>>>> 8 AADAC 13 >>>>>>>>> 9 AAMP 14 >>>>>>>>> 10 AANAT 15 >>>>>>>>> 11 AAMP 14 >>>>>>>>> 12 AANAT 15 >>>>>>>>> 13 DSPS<na> >>>>>>>>> 14 SNAT<na> >>>>>>>>> 15 AARS 16 >>>>>>>>> 16 CMT2N<na> >>>>>>>>> 17 AAV<na> >>>>>>>>> 18 AAVS1 17 >>>>>>>>> 19 ABAT 18 >>>>>>>>> 20 GABA-AT<na> >>>>>>>>> 21 GABAT<na> >>>>>>>>> 22 NPD009<na> >>>>>>>>> 23 ABC-1<na> >>>>>>>>> 24 ABC1<na> >>>>>>>>> 25 ABCA1 19 >>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS") >>>>>>>>> ALIAS ENTREZID >>>>>>>>> 1 A1BG 1 >>>>>>>>> 2 A2M 2 >>>>>>>>> 3 A2MP1 3 >>>>>>>>> 4 NAT1 9 >>>>>>>>> 5 NAT1 1982 >>>>>>>>> 6 NAT1 6530 >>>>>>>>> 7 NAT1 10991 >>>>>>>>> 8 NAT2 10 >>>>>>>>> 9 NAT2 81539 >>>>>>>>> 10 AACP 11 >>>>>>>>> 11 SERPINA3 12 >>>>>>>>> 12 AADAC 13 >>>>>>>>> 13 AAMP 14 >>>>>>>>> 14 AANAT 15 >>>>>>>>> 15 DSPS 15 >>>>>>>>> 16 SNAT 15 >>>>>>>>> 17 AARS 16 >>>>>>>>> 18 CMT2N 16 >>>>>>>>> 19 AAV 17 >>>>>>>>> 20 AAVS1 17 >>>>>>>>> 21 ABAT 18 >>>>>>>>> 22 GABA-AT 18 >>>>>>>>> 23 GABAT 18 >>>>>>>>> 24 NPD009 18 >>>>>>>>> 25 ABC-1 19 >>>>>>>>> 26 ABC1 19 >>>>>>>>> 27 ABC1 63897 >>>>>>>>> 28 ABCA1 19 >>>>>>>>> Warning message: >>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>> mapping between keys and return rows >>>>>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME) >>>>>>>>> $`1982` [1] "eukaryotic translation initiation factor 4 gamma, >>>>>>>>> 2" >>>>>>>>> >>>>>>>>> $`6530` >>>>>>>>> [1] "solute carrier family 6 (neurotransmitter transporter, >>>>>>>>> noradrenalin), member 2" >>>>>>>>> >>>>>>>>> $`10991` >>>>>>>>> [1] "solute carrier family 38, member 3" >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jim >>>>>>>>> >>>>>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Enrico, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Dear James, >>>>>>>>>>>> >>>>>>>>>>>> Thanks very much for your prompt reply. >>>>>>>>>>>> I knew the problem was the for loop and the select function >>>>>>>>>>>> is indeed a lot faster than that and works perfectly with >>>>>>>>>>>> toy data. >>>>>>>>>>>> >>>>>>>>>>>> However, this is what happens when I try to use it with >>>>>>>>>>>> real >>>>>>>>>>>> data: >>>>>>>>>>>> >>>>>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol, >>>>>>>>>>>>> keytype="ALIAS", >>>>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL")) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Warning message: >>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) : >>>>>>>>>>>> 'select' and duplicate query keys resulted in 1:many >>>>>>>>>>>> mapping between keys and return rows >>>>>>>>>>>> >>>>>>>>>>>> which is probably the warning you mentioned. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> That's not the warning I mentioned, but it does point out >>>>>>>>>>> the same issue, which is that there is a one to many mapping >>>>>>>>>>> between symbol and entrez gene ID. >>>>>>>>>>> >>>>>>>>>>> So now you have to decide if you want to be naive (or >>>>>>>>>>> stupid, depending on your perspective) or not. You could >>>>>>>>>>> just cover your eyes and do this: >>>>>>>>>>> >>>>>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),] >>>>>>>>>>> >>>>>>>>>>> which will choose for you the first symbol -> gene ID >>>>>>>>>>> mapping and nuke the rest. That's nice and quick, but you >>>>>>>>>>> are making huge assumptions. >>>>>>>>>>> >>>>>>>>>>> Or you could decide to be a bit more sophisticated and do >>>>>>>>>>> something like >>>>>>>>>>> >>>>>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL, >>>>>>>>>>> function(x) >>>>>>>>>>> first.two[x,]) >>>>>>>>>>> >>>>>>>>>>> At this point you can take a look at e.g., thelst[1:10] to >>>>>>>>>>> see what we just did >>>>>>>>>>> >>>>>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x) >>>>>>>>>>> c(x[1,1], paste(x[,2], collapse = "|"))) >>>>>>>>>>> >>>>>>>>>>> and here you can look at head(thelst). >>>>>>>>>>> >>>>>>>>>>> Then you can check to ensure that the first column of thelst >>>>>>>>>>> is identical to the first column of df, and proceed as >>>>>>>>>>> before. >>>>>>>>>>> >>>>>>>>>>> But there is still the problem of the multiple mappings. As >>>>>>>>>>> an >>>>>>>>>>> example: >>>>>>>>>>> >>>>>>>>>>>> thelst[1:5] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> $HBD >>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>> 2535 HBD 3045 >>>>>>>>>>> 2536 HBD 100187828 >>>>>>>>>>> >>>>>>>>>>> $KIR3DL3 >>>>>>>>>>> SYMBOL ENTREZID >>>>>>>>>>> 17513 KIR3DL3 115653 >>>>>>>>>>> 17514 KIR3DL3 100133046 >>>>>>>>>>> >>>>>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> $`3045` >>>>>>>>>>> [1] "hemoglobin, delta" >>>>>>>>>>> >>>>>>>>>>> $`100187828` >>>>>>>>>>> [1] "hypophosphatemic bone disease" >>>>>>>>>>> >>>>>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> $`115653` >>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor, three >>>>>>>>>>> domains, long cytoplasmic tail, 3" >>>>>>>>>>> >>>>>>>>>>> $`100133046` >>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor three domains >>>>>>>>>>> long cytoplasmic tail 3" >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> So HBD is the gene symbol for two different genes! If this >>>>>>>>>>> gene symbol is in your data, you will now have attributed >>>>>>>>>>> your data to two genes that apparently are not remotely >>>>>>>>>>> similar. if KIR3DL3 is in your data, then it worked out OK >>>>>>>>>>> for that gene. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> Jim >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> The real problem is that the number of rows is now >>>>>>>>>>>> different for the 2 >>>>>>>>>>>> objects: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> nrow(df); nrow(test) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> [1] 573 >>>>>>>>>>>> [1] 201 >>>>>>>>>>>> >>>>>>>>>>>> So I obviously can't put the new data into the original df. >>>>>>>>>>>> My impression is that when the 1 to many mapping arises, >>>>>>>>>>>> the select functions exits, with that warning message. As a >>>>>>>>>>>> result, my test object is incomplete. >>>>>>>>>>>> >>>>>>>>>>>> On top of that, and I can't really explain this, the row >>>>>>>>>>>> positions are messed up, e.g. >>>>>>>>>>>> >>>>>>>>>>>>> all.equal(df[100,],test[100,]) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> returns FALSE. >>>>>>>>>>>> >>>>>>>>>>>> How can I work around this? >>>>>>>>>>>> >>>>>>>>>>>> Thanks a lot! >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> >>>>>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu=""> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Enrico, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I often have data frames where I need to perform ID >>>>>>>>>>>>>> conversions on one or more of the columns while >>>>>>>>>>>>>> preserving the order of the rows, >>>>>>>>>>>>>> e.g.: >>>>>>>>>>>>>> >>>>>>>>>>>>>> GeneSymbol Value1 Value2 >>>>>>>>>>>>>> GS1 2.5 0.1 >>>>>>>>>>>>>> GS2 3 0.2 >>>>>>>>>>>>>> .. >>>>>>>>>>>>>> >>>>>>>>>>>>>> And I want to obtain: >>>>>>>>>>>>>> >>>>>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2 >>>>>>>>>>>>>> GS1 EG1 2.5 0.1 >>>>>>>>>>>>>> GS2 EG2 3 0.2 >>>>>>>>>>>>>> .. >>>>>>>>>>>>>> >>>>>>>>>>>>>> What I've done so far was to create a function that uses >>>>>>>>>>>>>> org.Hs.eg.db to loop over the rows of the column and does >>>>>>>>>>>>>> the conversion: >>>>>>>>>>>>>> >>>>>>>>>>>>>> library(org.Hs.eg.db) >>>>>>>>>>>>>> alias2EG<- function(x) { >>>>>>>>>>>>>> for (i in 1:length(x)) { >>>>>>>>>>>>>> if (!is.na(x[i])) { >>>>>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] if (!is.null(repl)) { >>>>>>>>>>>>>> x[i]<- repl >>>>>>>>>>>>>> } >>>>>>>>>>>>>> else { >>>>>>>>>>>>>> x[i]<- NA >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> return(x) >>>>>>>>>>>>>> } >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I should first note that gene symbols are not unique, so >>>>>>>>>>>>> you are taking a chance on your mappings. Is there no >>>>>>>>>>>>> other annotation for your data? >>>>>>>>>>>>> >>>>>>>>>>>>> In addition, you should note that it is almost always >>>>>>>>>>>>> better to think of objects as vectors and matrices in R, >>>>>>>>>>>>> rather than as things that need to be looped over (e.g., R >>>>>>>>>>>>> isn't Perl or C). >>>>>>>>>>>>> >>>>>>>>>>>>> first.two<- select(org.Hs.eg.db, >>>>>>>>>>>>> as.character(df$GeneSymbol), "ENTREZID", >>>>>>>>>>>>> "SYMBOL") >>>>>>>>>>>>> >>>>>>>>>>>>> Note that there used to be a warning or an error (don't >>>>>>>>>>>>> remember >>>>>>>>>>>>> which) >>>>>>>>>>>>> when >>>>>>>>>>>>> you did something like this, stating that gene symbols are >>>>>>>>>>>>> not unique, and that you shouldn't do this sort of thing. >>>>>>>>>>>>> Apparently this warning has been removed, but the issue >>>>>>>>>>>>> remains valid. >>>>>>>>>>>>> >>>>>>>>>>>>> ## check yourself >>>>>>>>>>>>> >>>>>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL) >>>>>>>>>>>>> >>>>>>>>>>>>> ## if true, proceed >>>>>>>>>>>>> >>>>>>>>>>>>> df<- data.frame(first.two, df[,-1]) >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> >>>>>>>>>>>>> Jim >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> and then call the function like this: >>>>>>>>>>>>>> >>>>>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol) >>>>>>>>>>>>>> >>>>>>>>>>>>>> This works well, but gets very slow when I need to do >>>>>>>>>>>>>> multiple conversions on large datasets. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is there any way I can achieve the same result but in a >>>>>>>>>>>>>> quicker, more efficient way? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you. >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>>>> Biostatistician >>>>>>>>>>>>> University of Washington >>>>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 >>>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> James W. MacDonald, M.S. >>>>>>>>>>> Biostatistician >>>>>>>>>>> University of Washington >>>>>>>>>>> Environmental and Occupational Health Sciences >>>>>>>>>>> 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 >>>>>>>>>>> >>>>>>>>>> >>>>>>>> -- >>>>>>>> Hervé Pagès >>>>>>>> >>>>>>>> Program in Computational Biology Division of Public Health >>>>>>>> Sciences Fred Hutchinson Cancer Research Center >>>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>>> P.O. Box 19024 >>>>>>>> Seattle, WA 98109-1024 >>>>>>>> >>>>>>>> E-mail: hpages at fhcrc.org >>>>>>>> Phone: (206) 667-5791 >>>>>>>> Fax: (206) 667-1319 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences Fred Hutchinson Cancer Research >>>>> Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpages at fhcrc.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>> >>>> >>>> >>>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Enrico Ferrero PhD Student Steve Russell Lab - Department of Genetics FlyChip - Cambridge Systems Biology Centre University of Cambridge e.ferrero at gen.cam.ac.uk http://flypress.gen.cam.ac.uk/ ------------------------------ Message: 3 Date: Sat, 27 Jul 2013 06:51:31 -0700 (PDT) From: "Guest [guest]" <guest@bioconductor.org> To: bioconductor at r-project.org, jial2 at mail.nih.gov Subject: [BioC] issue on removing all NAs (UNANNOTATED) rows Message-ID: <20130727135131.22869142E5F at mamba.fhcrc.org> Hi, I am using limma package on HuGene2.0st array. When I tried to remove the unannotated rows (NAs) after mapping probeID to gene names. Some significant gene lists I got have the differentially expressed genes, but some of gene lists have all of the unsignificantly expressed genes. The code I used as follows: cont.matrix <- makeContrasts( interaction=(T.Ko-T.wt)-(ctrl.Ko-ctrl.wt), levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2) sig.interaction = topTable(fit2, coef = "interaction", number=nrow(fit2), p.value=0.05, lfc=1) interaction.ids=sig.interaction[["ID"]] ## map probe ids to gene names... interaction.sig.SYMBOL=mget(interaction.ids, hugene20sttranscriptclusterSYMBOL, ifnotfound=NA) #REMOVE ALL NAs (UNANNOTATED) ROWS and unlist them for easy formatting later interaction.sig.ann=unlist(interaction.sig.SYMBOL[!is.na(interaction.s ig.SYM BOL)]) Does anyone know what's wrong with the code? No error message. Any input is appreciated. Thanks, -- output of sessionInfo(): > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base -- Sent via the guest posting facility at bioconductor.org. ------------------------------ Message: 4 Date: Sat, 27 Jul 2013 07:39:59 -0700 From: Dan Tenenbaum <dtenenba@fhcrc.org> To: "Bartek [guest]" <guest at="" bioconductor.org=""> Cc: bartek.taciak at gmail.com, "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> Subject: Re: [BioC] problem with installation package useR2013 Message-ID: <caf42j238ctzetfhfyfsp8hc6elpv8puy975io9+sg1q8rszpew at="" mail.gmail.com=""> Content-Type: text/plain; charset=UTF-8 On Sat, Jul 27, 2013 at 4:04 AM, Bartek [guest] <guest at="" bioconductor.org=""> wrote: > > Hello, > I cannot install package "useR2013", I try all possible commands from bioconductor web site and nothing. All the time I get the same message from R. Can You give any tips? > > -- output of sessionInfo(): > > install.packages("useR2013", repos=NULL, type="source") Warning in > install.packages : > package ???useR2013??? is not available (for R version 3.0.1) > Installing package into ???C:/Users/Bartek/Documents/R/win- library/3.0??? > (as ???lib??? is unspecified) > Ostrze??enie: b????dny pakiet 'useR2013' > B? ??D: B? ??D: nie okre??lono pakiet??w Warning in install.packages : > running command '"C:/PROGRA~1/R/R-30~1.1/bin/x64/R" CMD INSTALL -l > "C:\Users\Bartek\Documents\R\win-library\3.0" "useR2013"' had status 1 Warning in install.packages : > installation of package ???useR2013??? had non-zero exit status > > OR > >> source("http://bioconductor.org/biocLite.R") >> install.packages("useR2013") > Warning in install.packages : > package ???useR2013??? is not available (for R version 3.0.1) > Installing package into ???C:/Users/Bartek/Documents/R/win- library/3.0??? > (as ???lib??? is unspecified) > Warning in install.packages : > package ???useR2013??? is not available (for R version 3.0.1) > Follow the instructions at: http://bioconductor.org/help/course-materials/2013/useR2013/ Dan > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Message: 5 Date: Sun, 28 Jul 2013 11:02:01 +0800 From: Wang Peter <wng.peter@gmail.com> To: Gordon K Smyth <smyth at="" wehi.edu.au=""> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> Subject: Re: [BioC] limma for homemade microarray - question on NAs and multiple probes for one gene Message-ID: <cakh7rfadl2qp6mp-r=ptuqs+5lhtq04s_z3gfyobetpw+uzseg at="" mail.gmail.com=""> Content-Type: text/plain sorry, i have a question. why did you do log(M) did you only extract matched probe signals? shan On Thu, Jul 25, 2013 at 7:21 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Zhengyu, > > > On Mon, 22 Jul 2013, zhengyu jiang wrote: > > Hi Gordon, >> >> Sorry to bother you again. I have two questions. >> >> (1) I started the "eset<-as.matrix(log(M)) #### take log intensities" >> with a matrix and I have a separate annotation file which contains >> columns of "ID", "Sample Name", "Chr", "Coordinate", ""GeneSymbol". >> The following code to top list is below. How can I add the >> annotation to the final toplist output ? >> > > The first and best way to answer questions is to read the documentation. > If you type ?topTable, you will see that there is an argument for a > data.frame containing gene annotation. So you read the annotation > into a data.frame, then use > > topTable(fit, ..., genelist=yourdatannotation) > > > eset <- normalizeBetweenArrays(eset, method="quantile") ## quantile >> normalization for all arrays >> design<-model.matrix(~Pair+**Treat) >> targets<-readTargets("targets.**txt",row.names=1) ## or row name =1 >> Pair<-factor(targets$Pair) >> Treat<-factor(targets$**Treatment,levels=c("N","T")) >> fit_pair<-lmFit(eset,design) >> plotSA(fit_pair) >> fit= eBayes(fit_pair, trend=TRUE) >> R=topTable(fit, coef="TreatT", adjust="BH",number=30 >> >> (2) I have 6 sample data (3 control and 3 treatment) in one file. >> > > Is this an Illumina BeadChip. You don't say but, if not, using > read.ilmn is not appropriate. > > > It contains columns of "ID", "Sample Name", "Chr", "Coordinate", >> ""GeneSymbol", "XRaw" (expression raw data). I don't have control probes. >> If I want to use the Spot quality weights function "wt.fun" to read >> in these data as described in the limma manual to define all XRaw >> values below >> 1000 as 0 weight, >> > > And yet I have already asked you please not to do this. > > > is that possible to modify the code? I can change all column names if >> needed. But I don't know what columns are required for read.illmn? >> > > Again, please read the documentation. Type ?read.ilmn and you will > see that there is no argument called wt.fun. You can only use > read.ilmn for Illumina BeadChips. This type of array does not have > spots, so trying to set spot weights would obviously make no sense. > > The Biconductor posting guide says > > "Read the relevant R documentation ... If you are having trouble with > function somefunc, try ?somefunc." > > The limma User's Guide says > > "Mailing list etiquette requires that you read the relevant help page > carefully before posting a problem to the list." > > Best wishes > Gordon > > > myfun <- function(x) as.numeric(x$XRaw <1000) >> read.ilmn("062813_data.txt",**probeid="ID",annotation=c("** >> Chr","Coordinate","GeneSymbol"**,expr="XRaw"), >> wt.fun=myfun) >> Thanks, >> Zhengyu >> >> >> On Tue, Jul 9, 2013 at 7:34 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >> >> Dear Zhengyu, >>> >>> Date: Mon, 8 Jul 2013 20:40:14 -0400 >>> >>>> From: zhengyu jiang <zhyjiang2006 at="" gmail.com=""> >>>> To: bioconductor at r-project.org >>>> Subject: [BioC] limma for homemade microarray - question on NAs and >>>> multiple probes for one gene >>>> >>>> Dear Bioconductor experts, >>>> >>>> We have data from a homemade one-channel microarray that I tried to >>>> apply limma for differential expression analysis between matched >>>> paired Normal >>>> (N) and Tumor (Tumor) samples - 8 biological replicates (one tech >>>> replicate has been averaged after normalization). All samples are >>>> formatted in one matrix (M). >>>> >>>> >>> This is a standard problem, well covered in the limma User's Guide. >>> >>> Signals have been quantile normalized between each paired normal >>> and >>> >>>> tumor. >>>> >>>> >>> I don't know what you mean by this. I recommend ordinary quantile >>> normalization of all the arrays together, without regard to which >>> sample is paired with which. >>> >>> Signal values below 5 (log scale) have been replaced by "NA" since >>> they >>> >>>> are >>>> potentially noises. So there are many NAs in M. >>>> >>>> >>> Please don't do this. limma can deal with low intensities perfectly >>> weel, and can figure out whether they are noise or not. You are >>> simply removing valid data. >>> >>> If you are concerned about high variability of low intensity probes, >>> then look at >>> >>> plotSA(fit_pair) >>> >>> to examine the mean-variance trend, and use >>> >>> eBayes(fit_pair, trend=TRUE) >>> >>> if there is a strong trend. >>> >>> I followed the user manual and made the codes below. >>> >>>> >>>> I think the code is correct? >>>> >>>> >>> Looks ok. >>> >>> My questions are (1) how to deal with NAs - as I did a search but >>> no >>> >>>> clear idea >>>> >>>> >>> Don't create them in the first place. This has been said many times >>> before! >>> >>> (2) how do people do the statistics at the gene level for one gene >>> having >>> >>>> multiple probes - averaging or taking median? >>>> >>>> >>> Usually we do not find it necessary to aggregate the multiple >>> probes. The multiple probes might represent different transcripts or >>> variants, and some of the probes might not work. Why do you need to >>> take any special action? >>> >>> However, if you feel that you must, the best method to aggregate the >>> multiple probes is probably to retain the probe for each gene with >>> highest fit_pair$Amean value. We have used this strategy in a >>> number of published papers. The rationale for this is to choose the >>> probe corresponding to the most highly expressed transcript for the >>> gene. Alternatively, you could average the probes using the >>> avereps() function in limma. >>> >>> Best wishes >>> Gordon >>> >>> Thanks, >>> >>>> Zhengyu >>>> >>>> >>>> head(M) >>>>> >>>> N1 N2 N3 N4 N5 N6 N7 >>>> N8 T1 T2 T3 >>>> 2 8.622724 7.423568 NA NA 7.487174 NA 8.516293 >>>> NA >>>> 7.876259 7.856707 NA >>>> T4 T5 T6 T7 T8 >>>> 2 NA 7.720018 NA 7.752550 NA >>>> >>>> eset<-as.matrix(M) >>>> >>>>> Pair=factor(targets$Pair) >>>>> Treat=factor(targets$****Treatment,levels=c("N","T")) # >>>>> compared matched >>>>> >>>>> normal to tumors >>>> >>>> design<-model.matrix(~Pair+****Treat) >>>>> targets >>>>> >>>>> FilenName Pair Treatment >>>> 1 N1 1 N >>>> 2 N2 2 N >>>> 3 N3 3 N >>>> 4 N4 4 N >>>> 5 N5 5 N >>>> 6 N6 6 N >>>> 7 N7 7 N >>>> 8 N8 8 N >>>> 9 T1 1 T >>>> 10 T2 2 T >>>> 11 T3 3 T >>>> 12 T4 4 T >>>> 13 T5 5 T >>>> 14 T6 6 T >>>> 15 T7 7 T >>>> 16 T8 8 T >>>> fit_pair<-lmFit(eset,design) >>>> fit_pair<-eBayes(fit_pair) >>>> >>>> R=topTable(fit_pair, coef="TreatT", adjust="BH",number=30) # >>>> display top >>>> 30 >>>> >>> > ______________________________**______________________________**______ > ____ The information in this email is confidential and > inte...{{dropped:26}} ------------------------------ Message: 6 Date: Sun, 28 Jul 2013 13:05:17 +1000 (AUS Eastern Standard Time) From: Gordon K Smyth <smyth@wehi.edu.au> To: Lorena Pantano <lorena.pantano at="" gmail.com=""> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> Subject: Re: [BioC] weird model design for DE analysis Message-ID: <pine.wnt.4.64.1307281300370.6028 at="" pc975.wehi.edu.au=""> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Dear Lorena, You are right to recognise that this design needs special treatment. You can analyse all the data together, but you will need a statistical procedure that can account for the fact that the control samples in the two batches are biologically paired, whereas all the other samples are biologically independent. This requires a method that can fit random effects for the individuals or can estimate within individual correlations. You cannot do this in edgeR, DESeq, DESeq2 or in any of the other RNA- seq packages based on the negative binomial distribution. These packages assume that all samples are statistically independent. The only valid approach that I know of would be to use voom and the duplicate correlation approach of the limma package. First, read in your data (for both batches) and normalize your data as in the edgeR package. If y is your DGEList object, then y <- calcNormFactors(y) Then setup you design matrix as you would do for limma or edgeR. This includes effects for the batch and treatment, but not effects for patients. Something like design <- model.matrix(~batch+condition) where batch takes two levels (A and B) and condition takes three levels (control, preclinical, clinical). Then transform to logCPM using the voom function. v <- voom(y, design, plot=TRUE) Then estimate the correlation between the repeat samples for the same individual: dupcor <- duplicateCorrelation(v, design, block=individual) Here 'individual' is a vector or factor indicating which individual each sample comes from. It has length equal to the number of samples and takes 21 levels, one for each individual in your data (7 controls, 7 preclinic patients, 7 clinical patients). Then conduct a differential expression analysis: fit <- lmFit(v, design, block=individual, correlation=dupcor$consensus) fit <- eBayes(fit) Be sure to check that the value of dupcor$conensus is positive. This analysis keeps track of the fact that the two samples from each control individual are correlated. Then you can use the topTable() function in the limma package to extract genes that follow any pattern of differential expression that you are interested in. Best wishes Gordon Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.statsci.org/smyth -------------- original message --------------- [BioC] weird model design for DE analysis Lorena Pantano lorena.pantano at gmail.com Fri Jul 26 14:53:36 CEST 2013 Hi, thanks for the quick reply! The control group are 14 samples, 7 batch A, 7 batch B. But those 7 are the same individuals, same RNA. So sample c1 in batch A, is coming from same individual than sample c1 in batch B. Hope that is cleat. And, yes, I also treated as separate level factor, but I want to check also this scenario, the increase or decrease of gene expression from control -> pre-case -> case. thanks again! Lo On Fri, Jul 26, 2013 at 2:29 PM, Michael Love <michaelisaiahlove at="" gmail.com="">wrote: >hi Lorena > >Can you clarify on a few points below: > >On Fri, Jul 26, 2013 at 1:29 PM, Lorena Pantano ><lorena.pantano at="" gmail.com=""> wrote: >> Hi, >> >> I have some doubts about my data. I would like to do DE analysis of >> RNAseq data. >> >> I have two set of experiments done in two different time points, so >> there is probably batch effect: let's say batch A and B. >> >> For A, I have 7 controls and 7 cases (clinical identified) For B, I >> have the same 7 controls and other 7 pre-cases (preclinical >> identified) >> >> I wanted to know if I can analyze all together, but I have questions >> like: >> >> 1-7 cases from A are only in batch A, and the others 7 are only in >> batch B, is it correct to setup a blocked variable indicating the >> batch effect although I only have one entire group for one batch, and >> another entire group for another batch? > > You can still use blocking. Because the control group spans batch A > and batch B, you can still estimate the batch effect (the model matrix > should still be full rank). > >> 2-and 7 control for A and B are same people, so it is not quite right >> to merge all together because it would be more like technical >> replicates. > > I don't understand this last sentence. Do you mean that some, but not > all, of individuals in the control group from batch A are also in > batch B? > > I wouldn't say they are technical replicates if the samples are drawn > from the same individual. RNA-Seq experiments of the same individual > will contain biological variability. > >> My goal is: >> >> 1-get DE genes that follow additive effect. Something like increase a >> little in pre-cases, and increase more in cases. > > It sounds to me like you might want to treat pre-cases and cases as > separate levels of a factor variable, rather than assume that the > trend will necessarily be: control < pre-case < case. But to receive > more specific advice, you should probably explain more about the share > of individuals across the 28 samples. > > Mike ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:15}}

RNASeq Microarray Genetics Annotation Normalization Infrastructure Cancer probe limma • 1.8k views

ADD COMMENT • link updated 10.7 years ago by Alejandro Reyes ★ 1.9k • written 10.7 years ago by Ioannis Vlachos ▴ 40

0

Entering edit mode

Alejandro Reyes ★ 1.9k

@alejandro-reyes-5124

Last seen 4 days ago

Novartis Institutes for BioMedical Rese…

Dear Ioannis, Thanks for your message, it is usually nice and easier to track the problem if you add the output of your sessionInfo and reproducible code, but I am guessing that you are using a old version of DEXSeq. The table that the HTML reporter produces a table with information for each gene, not for each exon. I would be almost sure that the 396 are the number of genes that have significant exons. As the documentation indicates, the parameter extraCols accepts a data frame in which the rownames are genes of your ExonCountSet object. Let me know if this is the problem, Best regards, Alejandro > Hello, > > A (hopefully) simple question. > > As I saw in ??DEXSeqHTML, there is an option to add attributes in the HTML > results (extracols). > > I downloaded from biomaRt a set aof attributes to test this (gene name, > biotype and description), and created a data frame having as a first column > the exons of my ecs (retaining exon order), geneID and the new attributes. > > The first two columns were taken by accessing fData from the ecs. > > The resulting data frame naturally had an equal number of rows with the > exons present in my ecs object (355320). > > I tried to set the extracols option in a different number of ways (putting > as row.names the geneID or exonID, having only one column etc), but I always > got this error: > > Error in data.frame(..., check.names = FALSE) : > arguments imply differing number of rows: 355320, 396 > > Where 355320 is the correct number of rows of both ecs exons and of my extra > attributes data.frame. (I haven't figured out what has 396 rows yet). > > I checked the online reference manual and there is no mention of the > extracols option. I also checked the code of the relevant function and I > don't see it get used somewhere. > > Any ideas? > > Is there another way to add gene or exon attributes in the HTML? > > Thank you all, > > Best regards, > > Ioannis >

ADD COMMENT • link 10.7 years ago Alejandro Reyes ★ 1.9k

0

Entering edit mode

Hello Alejandro and thanks for the prompt response, I am using DEXSeq 1.6, so that was not the problem. The second part of your advice was the solution to my torment. I will elaborate further, just in case anyone faces a similar situation. I was using all the geneIDs present in my ecs object. However, the HTML output used only the ones with significant exons (396, as you pointed out). So, I took all the genes from ecs, removed those with padjust less than what I've selected for the HTML formation, kept unique geneIDs and then downloaded the relevant info from biomaRt. I used also as rownames the geneIDs just to play safe but I'm not sure if it was necessary. The derived data.frame was my input to extraCols. It worked very nice and it embedded the extra info on the index page of the html. Quite handy. Thank you again for your help, Best regards, Ioannis -----Original Message----- From: Alejandro Reyes [mailto:alejandro.reyes@embl.de] Sent: Monday, July 29, 2013 10:42 AM To: Ioannis Vlachos Cc: bioconductor at r-project.org Subject: Re: [BioC] DexSeq extracols option Dear Ioannis, Thanks for your message, it is usually nice and easier to track the problem if you add the output of your sessionInfo and reproducible code, but I am guessing that you are using a old version of DEXSeq. The table that the HTML reporter produces a table with information for each gene, not for each exon. I would be almost sure that the 396 are the number of genes that have significant exons. As the documentation indicates, the parameter extraCols accepts a data frame in which the rownames are genes of your ExonCountSet object. Let me know if this is the problem, Best regards, Alejandro > Hello, > > A (hopefully) simple question. > > As I saw in ??DEXSeqHTML, there is an option to add attributes in the > HTML results (extracols). > > I downloaded from biomaRt a set aof attributes to test this (gene > name, biotype and description), and created a data frame having as a > first column the exons of my ecs (retaining exon order), geneID and the new attributes. > > The first two columns were taken by accessing fData from the ecs. > > The resulting data frame naturally had an equal number of rows with > the exons present in my ecs object (355320). > > I tried to set the extracols option in a different number of ways > (putting as row.names the geneID or exonID, having only one column > etc), but I always got this error: > > Error in data.frame(..., check.names = FALSE) : > arguments imply differing number of rows: 355320, 396 > > Where 355320 is the correct number of rows of both ecs exons and of my > extra attributes data.frame. (I haven't figured out what has 396 rows yet). > > I checked the online reference manual and there is no mention of the > extracols option. I also checked the code of the relevant function > and I don't see it get used somewhere. > > Any ideas? > > Is there another way to add gene or exon attributes in the HTML? > > Thank you all, > > Best regards, > > Ioannis >

ADD REPLY • link 10.7 years ago Ioannis Vlachos ▴ 40

0

Entering edit mode

Hello, One small tip that I found out the hard way. In DEXSeqHTML the results table [ results <- DEUresultTable(ecs)] is rounded to the third digit prior to padj filtering. Therefore, this might create discrepancies in nrows between the extraCols data.frame and the result table. So the recipe for a successful extraCols data.frame has been changed to: Get the results table-> round to the third digit-> filter with desired p-value (or q value)-> get the required info from biomaRt or wherever-> keep unique lines only and sort accordingly -> use as ecols dataframe Cheers. IV -----Original Message----- From: Alejandro Reyes [mailto:alejandro.reyes@embl.de] Sent: Monday, July 29, 2013 10:42 AM To: Ioannis Vlachos Cc: bioconductor at r-project.org Subject: Re: [BioC] DexSeq extracols option Dear Ioannis, Thanks for your message, it is usually nice and easier to track the problem if you add the output of your sessionInfo and reproducible code, but I am guessing that you are using a old version of DEXSeq. The table that the HTML reporter produces a table with information for each gene, not for each exon. I would be almost sure that the 396 are the number of genes that have significant exons. As the documentation indicates, the parameter extraCols accepts a data frame in which the rownames are genes of your ExonCountSet object. Let me know if this is the problem, Best regards, Alejandro > Hello, > > A (hopefully) simple question. > > As I saw in ??DEXSeqHTML, there is an option to add attributes in the > HTML results (extracols). > > I downloaded from biomaRt a set aof attributes to test this (gene > name, biotype and description), and created a data frame having as a > first column the exons of my ecs (retaining exon order), geneID and the new attributes. > > The first two columns were taken by accessing fData from the ecs. > > The resulting data frame naturally had an equal number of rows with > the exons present in my ecs object (355320). > > I tried to set the extracols option in a different number of ways > (putting as row.names the geneID or exonID, having only one column > etc), but I always got this error: > > Error in data.frame(..., check.names = FALSE) : > arguments imply differing number of rows: 355320, 396 > > Where 355320 is the correct number of rows of both ecs exons and of my > extra attributes data.frame. (I haven't figured out what has 396 rows yet). > > I checked the online reference manual and there is no mention of the > extracols option. I also checked the code of the relevant function > and I don't see it get used somewhere. > > Any ideas? > > Is there another way to add gene or exon attributes in the HTML? > > Thank you all, > > Best regards, > > Ioannis >

ADD REPLY • link 10.7 years ago Ioannis Vlachos ▴ 40

0

Entering edit mode

Hi Ioannis,

It might sound stupid but does the data.frame that is used with extraCols need to have the exonBaseMean, dispersion, counts and all the other details that the DEXSeqHTML object has? I followed the steps that you mentioned and could successfully incorporate the genenames and symbols in the html doc but DEXSeqHTML somehow doesn't generate the svg files for splicing/counts/expression when I use the extraCols. My data frame for extraCols has geneIDs as row names and two more columns with symbols and gene names.