Entering edit mode
Ioannis Vlachos
▴
40
@ioannis-vlachos-5634
Last seen 10.3 years ago
Hello,
A (hopefully) simple question.
As I saw in ??DEXSeqHTML, there is an option to add attributes in the
HTML
results (extracols).
I downloaded from biomaRt a set aof attributes to test this (gene
name,
biotype and description), and created a data frame having as a first
column
the exons of my ecs (retaining exon order), geneID and the new
attributes.
The first two columns were taken by accessing fData from the ecs.
The resulting data frame naturally had an equal number of rows with
the
exons present in my ecs object (355320).
I tried to set the extracols option in a different number of ways
(putting
as row.names the geneID or exonID, having only one column etc), but I
always
got this error:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 355320, 396
Where 355320 is the correct number of rows of both ecs exons and of my
extra
attributes data.frame. (I haven't figured out what has 396 rows yet).
I checked the online reference manual and there is no mention of the
extracols option. I also checked the code of the relevant function
and I
don't see it get used somewhere.
Any ideas?
Is there another way to add gene or exon attributes in the HTML?
Thank you all,
Best regards,
Ioannis
-----Original Message-----
From: bioconductor-bounces@r-project.org
[mailto:bioconductor-bounces at r-project.org] On Behalf Of
bioconductor-request at r-project.org
Sent: Sunday, July 28, 2013 1:00 PM
To: bioconductor at r-project.org
Subject: Bioconductor Digest, Vol 125, Issue 28
Send Bioconductor mailing list submissions to
bioconductor at r-project.org
To subscribe or unsubscribe via the World Wide Web, visit
https://stat.ethz.ch/mailman/listinfo/bioconductor
or, via email, send a message with subject or body 'help' to
bioconductor-request at r-project.org
You can reach the person managing the list at
bioconductor-owner at r-project.org
When replying, please edit your Subject line so it is more specific
than
"Re: Contents of Bioconductor digest..."
Today's Topics:
1. problem with installation package useR2013 (Bartek [guest])
2. Re: Quickest way to convert IDs in a data frame? (Enrico
Ferrero)
3. issue on removing all NAs (UNANNOTATED) rows (Guest [guest])
4. Re: problem with installation package useR2013 (Dan Tenenbaum)
5. Re: limma for homemade microarray - question on NAs and
multiple probes for one gene (Wang Peter)
6. Re: weird model design for DE analysis (Gordon K Smyth)
----------------------------------------------------------------------
Message: 1
Date: Sat, 27 Jul 2013 04:04:43 -0700 (PDT)
From: "Bartek [guest]" <guest@bioconductor.org>
To: bioconductor at r-project.org, bartek.taciak at gmail.com
Subject: [BioC] problem with installation package useR2013
Message-ID: <20130727110443.B4CAA142E45 at mamba.fhcrc.org>
Content-Type: text/plain; charset=iso-8859-1
Hello,
I cannot install package "useR2013", I try all possible commands from
bioconductor web site and nothing. All the time I get the same message
from
R. Can You give any tips?
-- output of sessionInfo():
install.packages("useR2013", repos=NULL, type="source") Warning in
install.packages :
package ???useR2013??? is not available (for R version 3.0.1)
Installing
package into ???C:/Users/Bartek/Documents/R/win-library/3.0???
(as ???lib??? is unspecified)
Ostrze??enie: b????dny pakiet 'useR2013'
B????D: B????D: nie okre??lono pakiet??w Warning in install.packages :
running command '"C:/PROGRA~1/R/R-30~1.1/bin/x64/R" CMD INSTALL -l
"C:\Users\Bartek\Documents\R\win-library\3.0" "useR2013"' had status 1
Warning in install.packages :
installation of package ???useR2013??? had non-zero exit status
OR
> source("http://bioconductor.org/biocLite.R")
> install.packages("useR2013")
Warning in install.packages :
package ???useR2013??? is not available (for R version 3.0.1)
Installing
package into ???C:/Users/Bartek/Documents/R/win-library/3.0???
(as ???lib??? is unspecified)
Warning in install.packages :
package ???useR2013??? is not available (for R version 3.0.1)
--
Sent via the guest posting facility at bioconductor.org.
------------------------------
Message: 2
Date: Sat, 27 Jul 2013 14:40:03 +0100
From: Enrico Ferrero <enricoferrero86@gmail.com>
To: "bioconductor at r-project.org" <bioconductor at="" r-project.org="">
Cc: James Macdonald <jmacdon at="" uw.edu="">
Subject: Re: [BioC] Quickest way to convert IDs in a data frame?
Message-ID:
<cao22hxdt-twxhubzbpkuxc-2+v7_rk_+tgrb_ojifypu1dnq4a at="" mail.gmail.com="">
Content-Type: text/plain; charset=ISO-8859-1
Hi everybody,
Marc, thanks for clarifying things. The behaviour of the select()
function
is absolutely sensible. Maybe it should be made explicit somewhere in
the
documentation that, when working with data frames, the user is
expected to
use the merge() function in conjunction with it. I also agree with
Herv?
that having options to tweak and customize the output would be an
extremely
positive thing and a step in the right direction. In addition to a
"select"
argument, one can also think of a "remove.na.rows" that evaluates to
either
TRUE or FALSE.
But then again, using merge() after select() already deals with these
issues
quite well.
What I think should be investigated more closely at the moment is the
unexpected behaviour select() exhibits when one SYMBOL or ALIAS (and
potentially other types of ID, I don't know) maps to more than one
ENTREZID.
As exemplified by James' code below, this causes the output to be
truncated,
and I highly doubt this is intentional:
> select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID", "ALIAS")
ALIAS ENTREZID
1 ADORA2A 135
2 ADORA2A 135
3 ADORA2A 135
4 ADORA2A 135
> select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS")
ALIAS ENTREZID
1 AGT 183
2 AGT 189
Warning message:
In .generateExtraRows(tab, keys, jointype) :
'select' and duplicate query keys resulted in 1:many mapping between
keys
and return rows
It would be great to have your views on this.
Best,
On 26 July 2013 21:46, Hervé Pagès <hpages at="" fhcrc.org=""> wrote:
> Hi Marc,
>
> On 07/26/2013 12:57 PM, Marc Carlson wrote:
> ...
>
>> Hello everyone,
>>
>> Sorry that I saw this thread so late. Basically, select() does
*try*
>> to keep your initial keys and map them each to an equivalent number
>> of unique values. We did actually anticipate that people would
>> *want* to
>> cbind() their results.
>>
>> But as you discovered there are many circumstances where the data
>> make this kind of behavior impossible.
>>
>> So passing in NAs as keys for example can't ever find anything
>> meaningful. Those will simply have to be removed before we can
>> proceed. And, it is also impossible to maintain a 1:1 mapping if
you
>> retrieve fields that have many to one relationships with your
initial
>> keys (also seen here).
>>
>> For convenience, when this kind of 1:1 output is already impossible
>> (as it is for most of your examples), select will also try to
>> simplify the output by removing rows that are identical all the way
across etc..
>>
>> My aim was that select should try to do the most reasonable thing
>> possible based on the data we have in each case. The rationale is
>> that in the case where there are 1:many mappings, you should not be
>> planning to bind those directly onto any other data.frames anyways
>> (as this circumstance would require you to call merge() instead).
So
>> in that case, non-destructive simplification seems beneficial.
>
>
> Other tools in our infrastructure use an extra argument to pick-up 1
> thing in case of multiple mapping e.g. findOverlaps() has the
'select'
> argument with possible values "all", "first", "last", and
"arbitrary".
> Also nearest() and family have this argument and it accepts similar
> values.
>
> Couldn't select() use a similar approach? The default should be
"all"
> so the current behavior is preserved but if it's something else then
> the returned data.frame should align with the input.
>
> Thanks,
>
> H.
>
>
>>
>> I hope this clarifies things,
>>
>>
>> Marc
>>
>>
>>
>>>
>>>> As I
>>>> mentioned in my first post, the for loop function works, but it's
>>>> highly inefficient.
>>>>
>>>> Any help is greatly appreciated, thank you.
>>>>
>>>> Best,
>>>>
>>>>
>>>>
>>>> On 25 July 2013 23:18, Hervé Pagès <hpages at="" fhcrc.org=""> wrote:
>>>>>
>>>>> Hi James,
>>>>>
>>>>> You're right.
>>>>>
>>>>> It's actually both: NAs *and* duplicated keys that are mapped to
>>>>> more than 1 row are removed from the input. I don't think this
is
>>>>> documented.
>>>>>
>>>>> I wonder if select() behavior couldn't be a little bit simpler
by
>>>>> either preserving or removing all duplicated keys, and not just
>>>>> some of them (on a somewhat arbitrary criteria).
>>>>>
>>>>> Thanks,
>>>>> H.
>>>>>
>>>>>
>>>>>
>>>>> On 07/25/2013 02:57 PM, James W. MacDonald wrote:
>>>>>>
>>>>>>
>>>>>> Hi Enrico and Herve,
>>>>>>
>>>>>> This has to do with duplicate entries, but only when the
>>>>>> duplicate entry maps to many ENTREZID:
>>>>>>
>>>>>> > select(org.Hs.eg.db, rep("ADORA2A", 4), "ENTREZID",
"ALIAS")
>>>>>> ALIAS ENTREZID
>>>>>> 1 ADORA2A 135
>>>>>> 2 ADORA2A 135
>>>>>> 3 ADORA2A 135
>>>>>> 4 ADORA2A 135
>>>>>>
>>>>>> > select(org.Hs.eg.db, rep("AGT", 4), "ENTREZID", "ALIAS")
>>>>>> ALIAS ENTREZID
>>>>>> 1 AGT 183
>>>>>> 2 AGT 189
>>>>>> Warning message:
>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>> 'select' and duplicate query keys resulted in 1:many
mapping
>>>>>> between keys and return rows
>>>>>>
>>>>>> > select(org.Hs.eg.db, "AGT", "ENTREZID", "ALIAS")
>>>>>> ALIAS ENTREZID
>>>>>> 1 AGT 183
>>>>>> 2 AGT 189
>>>>>> Warning message:
>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>> 'select' resulted in 1:many mapping between keys and return
>>>>>> rows
>>>>>>
>>>>>>
>>>>>> So in the instances where a gene symbol maps to more than one
>>>>>> ENTREZID, the output gets truncated, whereas if it is a
>>>>>> one-to-one mapping, it does not.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 7/25/2013 5:06 PM, Enrico Ferrero wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Herv?, that's exactly what I'm trying to say.
>>>>>>>
>>>>>>> Attached to this email is a tab delimited file with two
columns
>>>>>>> of GeneSymbols (or Aliases), and here is some simple code to
>>>>>>> reproduce the unexpected behaviour:
>>>>>>>
>>>>>>> library(org.Hs.eg.db)
>>>>>>> mydf<- read.table("testdata.txt", sep="\t", header=TRUE,
>>>>>>> as.is=TRUE)
>>>>>>> mytest<- select(org.Hs.eg.db, key=mydf$GeneSymbol1,
>>>>>>> keytype="ALIAS",
>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>>>>>> # check that mytest has less rows than mydf
>>>>>>> nrow(mydf)
>>>>>>> nrow(mytest)
>>>>>>> # pick a random row: they don't match mydf[250,] mytest[250,]
>>>>>>>
>>>>>>> Ideally, mytest should have the same number and position of
rows
>>>>>>> of mydf so that I can then cbind them.
>>>>>>> If mytest has more rows because of multiple mappings that's
also
>>>>>>> fine:
>>>>>>> I can always use merge(mydf, mytest), right?
>>>>>>>
>>>>>>> Thanks a lot to both for your help, it's very appreciated.
>>>>>>> Best,
>>>>>>>
>>>>>>>
>>>>>>> On 25 July 2013 21:32, Hervé Pagès<hpages at="" fhcrc.org="">
wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Enrico,
>>>>>>>>
>>>>>>>>
>>>>>>>> On 07/25/2013 01:20 PM, James W. MacDonald wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Enrico,
>>>>>>>>>
>>>>>>>>> Please don't take things off-list (e.g., use reply-all).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 7/25/2013 2:17 PM, Enrico Ferrero wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi James,
>>>>>>>>>>
>>>>>>>>>> Thanks very much for your help.
>>>>>>>>>> There is an issue that needs to be solved before thinking
>>>>>>>>>> about what's the best approach in my opinion.
>>>>>>>>>>
>>>>>>>>>> I don't understand why, but the object created with the
call
>>>>>>>>>> to select (test in my example, first.two in yours) has a
>>>>>>>>>> different number of rows from the original object (df in my
>>>>>>>>>> example). Specifically it has
>>>>>>>>>> *less* rows.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm surprised it has less rows. It can definitely have more,
>>>>>>>> when some of the keys passed to select() are mapped to more
>>>>>>>> than 1 row, but my understanding was that select() would
>>>>>>>> propagate unmapped keys to the output by placing them in rows
>>>>>>>> stuffed with NAs. So maybe I misunderstood how select()
works,
>>>>>>>> or its behavior was changed, or there is a bug somewhere.
Could
>>>>>>>> you please send the code that allows us to reproduce this?
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> H.
>>>>>>>>
>>>>>>>>
>>>>>>>>> If all symbols were converted to all possible Entrez IDs,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I would expect it to have more rows, not less. To me, it
>>>>>>>>>> looks like not all rows are looked up and returned.
>>>>>>>>>>
>>>>>>>>>> Do you see what I mean?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sure. You could be using outdated gene symbols. Or perhaps
you
>>>>>>>>> are using a mixture of symbols and aliases. Which is even
>>>>>>>>> cooler than just all
>>>>>>>>> symbols:
>>>>>>>>>
>>>>>>>>> > symb<- c(Rkeys(org.Hs.egSYMBOL)[1:10],
>>>>>>>>> Rkeys(org.Hs.egALIAS2EG)[31:45])
>>>>>>>>> > symb
>>>>>>>>> [1] "A1BG" "A2M" "A2MP1" "NAT1" "NAT2"
"AACP"
>>>>>>>>> [7] "SERPINA3" "AADAC" "AAMP" "AANAT" "AAMP"
"AANAT"
>>>>>>>>> [13] "DSPS" "SNAT" "AARS" "CMT2N" "AAV"
"AAVS1"
>>>>>>>>> [19] "ABAT" "GABA-AT" "GABAT" "NPD009" "ABC-1"
"ABC1"
>>>>>>>>> [25] "ABCA1"
>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","SYMBOL")
>>>>>>>>> SYMBOL ENTREZID
>>>>>>>>> 1 A1BG 1
>>>>>>>>> 2 A2M 2
>>>>>>>>> 3 A2MP1 3
>>>>>>>>> 4 NAT1 9
>>>>>>>>> 5 NAT2 10
>>>>>>>>> 6 AACP 11
>>>>>>>>> 7 SERPINA3 12
>>>>>>>>> 8 AADAC 13
>>>>>>>>> 9 AAMP 14
>>>>>>>>> 10 AANAT 15
>>>>>>>>> 11 AAMP 14
>>>>>>>>> 12 AANAT 15
>>>>>>>>> 13 DSPS<na>
>>>>>>>>> 14 SNAT<na>
>>>>>>>>> 15 AARS 16
>>>>>>>>> 16 CMT2N<na>
>>>>>>>>> 17 AAV<na>
>>>>>>>>> 18 AAVS1 17
>>>>>>>>> 19 ABAT 18
>>>>>>>>> 20 GABA-AT<na>
>>>>>>>>> 21 GABAT<na>
>>>>>>>>> 22 NPD009<na>
>>>>>>>>> 23 ABC-1<na>
>>>>>>>>> 24 ABC1<na>
>>>>>>>>> 25 ABCA1 19
>>>>>>>>> > select(org.Hs.eg.db, symb, "ENTREZID","ALIAS")
>>>>>>>>> ALIAS ENTREZID
>>>>>>>>> 1 A1BG 1
>>>>>>>>> 2 A2M 2
>>>>>>>>> 3 A2MP1 3
>>>>>>>>> 4 NAT1 9
>>>>>>>>> 5 NAT1 1982
>>>>>>>>> 6 NAT1 6530
>>>>>>>>> 7 NAT1 10991
>>>>>>>>> 8 NAT2 10
>>>>>>>>> 9 NAT2 81539
>>>>>>>>> 10 AACP 11
>>>>>>>>> 11 SERPINA3 12
>>>>>>>>> 12 AADAC 13
>>>>>>>>> 13 AAMP 14
>>>>>>>>> 14 AANAT 15
>>>>>>>>> 15 DSPS 15
>>>>>>>>> 16 SNAT 15
>>>>>>>>> 17 AARS 16
>>>>>>>>> 18 CMT2N 16
>>>>>>>>> 19 AAV 17
>>>>>>>>> 20 AAVS1 17
>>>>>>>>> 21 ABAT 18
>>>>>>>>> 22 GABA-AT 18
>>>>>>>>> 23 GABAT 18
>>>>>>>>> 24 NPD009 18
>>>>>>>>> 25 ABC-1 19
>>>>>>>>> 26 ABC1 19
>>>>>>>>> 27 ABC1 63897
>>>>>>>>> 28 ABCA1 19
>>>>>>>>> Warning message:
>>>>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>>>>> 'select' and duplicate query keys resulted in 1:many
>>>>>>>>> mapping between keys and return rows
>>>>>>>>> > mget(c("1982","6530","10991"), org.Hs.egGENENAME)
>>>>>>>>> $`1982` [1] "eukaryotic translation initiation factor 4
gamma,
>>>>>>>>> 2"
>>>>>>>>>
>>>>>>>>> $`6530`
>>>>>>>>> [1] "solute carrier family 6 (neurotransmitter transporter,
>>>>>>>>> noradrenalin), member 2"
>>>>>>>>>
>>>>>>>>> $`10991`
>>>>>>>>> [1] "solute carrier family 38, member 3"
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Jim
>>>>>>>>>
>>>>>>>>>> On 25 July 2013 18:17, James W. MacDonald<jmacdon at="" uw.edu="">
wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Enrico,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 7/25/2013 12:56 PM, Enrico Ferrero wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Dear James,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks very much for your prompt reply.
>>>>>>>>>>>> I knew the problem was the for loop and the select
function
>>>>>>>>>>>> is indeed a lot faster than that and works perfectly with
>>>>>>>>>>>> toy data.
>>>>>>>>>>>>
>>>>>>>>>>>> However, this is what happens when I try to use it with
>>>>>>>>>>>> real
>>>>>>>>>>>> data:
>>>>>>>>>>>>
>>>>>>>>>>>>> test<- select(org.Hs.eg.db, keys=df$GeneSymbol,
>>>>>>>>>>>>> keytype="ALIAS",
>>>>>>>>>>>>> cols=c("SYMBOL","ENTREZID","ENSEMBL"))
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Warning message:
>>>>>>>>>>>> In .generateExtraRows(tab, keys, jointype) :
>>>>>>>>>>>> 'select' and duplicate query keys resulted in
1:many
>>>>>>>>>>>> mapping between keys and return rows
>>>>>>>>>>>>
>>>>>>>>>>>> which is probably the warning you mentioned.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> That's not the warning I mentioned, but it does point out
>>>>>>>>>>> the same issue, which is that there is a one to many
mapping
>>>>>>>>>>> between symbol and entrez gene ID.
>>>>>>>>>>>
>>>>>>>>>>> So now you have to decide if you want to be naive (or
>>>>>>>>>>> stupid, depending on your perspective) or not. You could
>>>>>>>>>>> just cover your eyes and do this:
>>>>>>>>>>>
>>>>>>>>>>> first.two<- first.two[!duplicated(first.two$SYMBOL),]
>>>>>>>>>>>
>>>>>>>>>>> which will choose for you the first symbol -> gene ID
>>>>>>>>>>> mapping and nuke the rest. That's nice and quick, but you
>>>>>>>>>>> are making huge assumptions.
>>>>>>>>>>>
>>>>>>>>>>> Or you could decide to be a bit more sophisticated and do
>>>>>>>>>>> something like
>>>>>>>>>>>
>>>>>>>>>>> thelst<- tapply(1:nrow(first.two), first.two$SYMBOL,
>>>>>>>>>>> function(x)
>>>>>>>>>>> first.two[x,])
>>>>>>>>>>>
>>>>>>>>>>> At this point you can take a look at e.g., thelst[1:10] to
>>>>>>>>>>> see what we just did
>>>>>>>>>>>
>>>>>>>>>>> thelst<- do.call("rbind", lapply(thelst, function(x)
>>>>>>>>>>> c(x[1,1], paste(x[,2], collapse = "|")))
>>>>>>>>>>>
>>>>>>>>>>> and here you can look at head(thelst).
>>>>>>>>>>>
>>>>>>>>>>> Then you can check to ensure that the first column of
thelst
>>>>>>>>>>> is identical to the first column of df, and proceed as
>>>>>>>>>>> before.
>>>>>>>>>>>
>>>>>>>>>>> But there is still the problem of the multiple mappings.
As
>>>>>>>>>>> an
>>>>>>>>>>> example:
>>>>>>>>>>>
>>>>>>>>>>>> thelst[1:5]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> $HBD
>>>>>>>>>>> SYMBOL ENTREZID
>>>>>>>>>>> 2535 HBD 3045
>>>>>>>>>>> 2536 HBD 100187828
>>>>>>>>>>>
>>>>>>>>>>> $KIR3DL3
>>>>>>>>>>> SYMBOL ENTREZID
>>>>>>>>>>> 17513 KIR3DL3 115653
>>>>>>>>>>> 17514 KIR3DL3 100133046
>>>>>>>>>>>
>>>>>>>>>>>> mget(as.character(thelst[[1]][,2]), org.Hs.egGENENAME)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> $`3045`
>>>>>>>>>>> [1] "hemoglobin, delta"
>>>>>>>>>>>
>>>>>>>>>>> $`100187828`
>>>>>>>>>>> [1] "hypophosphatemic bone disease"
>>>>>>>>>>>
>>>>>>>>>>>> mget(as.character(thelst[[2]][,2]), org.Hs.egGENENAME)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> $`115653`
>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor, three
>>>>>>>>>>> domains, long cytoplasmic tail, 3"
>>>>>>>>>>>
>>>>>>>>>>> $`100133046`
>>>>>>>>>>> [1] "killer cell immunoglobulin-like receptor three
domains
>>>>>>>>>>> long cytoplasmic tail 3"
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So HBD is the gene symbol for two different genes! If this
>>>>>>>>>>> gene symbol is in your data, you will now have attributed
>>>>>>>>>>> your data to two genes that apparently are not remotely
>>>>>>>>>>> similar. if KIR3DL3 is in your data, then it worked out OK
>>>>>>>>>>> for that gene.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Jim
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> The real problem is that the number of rows is now
>>>>>>>>>>>> different for the 2
>>>>>>>>>>>> objects:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> nrow(df); nrow(test)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [1] 573
>>>>>>>>>>>> [1] 201
>>>>>>>>>>>>
>>>>>>>>>>>> So I obviously can't put the new data into the original
df.
>>>>>>>>>>>> My impression is that when the 1 to many mapping arises,
>>>>>>>>>>>> the select functions exits, with that warning message. As
a
>>>>>>>>>>>> result, my test object is incomplete.
>>>>>>>>>>>>
>>>>>>>>>>>> On top of that, and I can't really explain this, the row
>>>>>>>>>>>> positions are messed up, e.g.
>>>>>>>>>>>>
>>>>>>>>>>>>> all.equal(df[100,],test[100,])
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> returns FALSE.
>>>>>>>>>>>>
>>>>>>>>>>>> How can I work around this?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot!
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>>
>>>>>>>>>>>> On 25 July 2013 16:58, James W. MacDonald<jmacdon at="" uw.edu="">
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Enrico,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/25/2013 11:35 AM, Enrico Ferrero wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I often have data frames where I need to perform ID
>>>>>>>>>>>>>> conversions on one or more of the columns while
>>>>>>>>>>>>>> preserving the order of the rows,
>>>>>>>>>>>>>> e.g.:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> GeneSymbol Value1 Value2
>>>>>>>>>>>>>> GS1 2.5 0.1
>>>>>>>>>>>>>> GS2 3 0.2
>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And I want to obtain:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> GeneSymbol EntrezGeneID Value1 Value2
>>>>>>>>>>>>>> GS1 EG1 2.5 0.1
>>>>>>>>>>>>>> GS2 EG2 3 0.2
>>>>>>>>>>>>>> ..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What I've done so far was to create a function that
uses
>>>>>>>>>>>>>> org.Hs.eg.db to loop over the rows of the column and
does
>>>>>>>>>>>>>> the conversion:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> library(org.Hs.eg.db)
>>>>>>>>>>>>>> alias2EG<- function(x) {
>>>>>>>>>>>>>> for (i in 1:length(x)) {
>>>>>>>>>>>>>> if (!is.na(x[i])) {
>>>>>>>>>>>>>> repl<- org.Hs.egALIAS2EG[[x[i]]][1] if (!is.null(repl))
{
>>>>>>>>>>>>>> x[i]<- repl
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> else {
>>>>>>>>>>>>>> x[i]<- NA
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>> return(x)
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I should first note that gene symbols are not unique, so
>>>>>>>>>>>>> you are taking a chance on your mappings. Is there no
>>>>>>>>>>>>> other annotation for your data?
>>>>>>>>>>>>>
>>>>>>>>>>>>> In addition, you should note that it is almost always
>>>>>>>>>>>>> better to think of objects as vectors and matrices in R,
>>>>>>>>>>>>> rather than as things that need to be looped over (e.g.,
R
>>>>>>>>>>>>> isn't Perl or C).
>>>>>>>>>>>>>
>>>>>>>>>>>>> first.two<- select(org.Hs.eg.db,
>>>>>>>>>>>>> as.character(df$GeneSymbol), "ENTREZID",
>>>>>>>>>>>>> "SYMBOL")
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that there used to be a warning or an error (don't
>>>>>>>>>>>>> remember
>>>>>>>>>>>>> which)
>>>>>>>>>>>>> when
>>>>>>>>>>>>> you did something like this, stating that gene symbols
are
>>>>>>>>>>>>> not unique, and that you shouldn't do this sort of
thing.
>>>>>>>>>>>>> Apparently this warning has been removed, but the issue
>>>>>>>>>>>>> remains valid.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ## check yourself
>>>>>>>>>>>>>
>>>>>>>>>>>>> all.equal(df$GeneSymbol, first.two$SYMBOL)
>>>>>>>>>>>>>
>>>>>>>>>>>>> ## if true, proceed
>>>>>>>>>>>>>
>>>>>>>>>>>>> df<- data.frame(first.two, df[,-1])
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jim
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> and then call the function like this:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> df$EntrezGeneID<- alias2GS(df$GeneSymbol)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This works well, but gets very slow when I need to do
>>>>>>>>>>>>>> multiple conversions on large datasets.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there any way I can achieve the same result but in a
>>>>>>>>>>>>>> quicker, more efficient way?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> James W. MacDonald, M.S.
>>>>>>>>>>>>> Biostatistician
>>>>>>>>>>>>> University of Washington
>>>>>>>>>>>>> Environmental and Occupational Health Sciences
>>>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
>>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> James W. MacDonald, M.S.
>>>>>>>>>>> Biostatistician
>>>>>>>>>>> University of Washington
>>>>>>>>>>> Environmental and Occupational Health Sciences
>>>>>>>>>>> 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> --
>>>>>>>> Hervé Pagès
>>>>>>>>
>>>>>>>> Program in Computational Biology Division of Public Health
>>>>>>>> Sciences Fred Hutchinson Cancer Research Center
>>>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>>>> P.O. Box 19024
>>>>>>>> Seattle, WA 98109-1024
>>>>>>>>
>>>>>>>> E-mail: hpages at fhcrc.org
>>>>>>>> Phone: (206) 667-5791
>>>>>>>> Fax: (206) 667-1319
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences Fred Hutchinson Cancer
Research
>>>>> Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone: (206) 667-5791
>>>>> Fax: (206) 667-1319
>>>>
>>>>
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Enrico Ferrero
PhD Student
Steve Russell Lab - Department of Genetics FlyChip - Cambridge Systems
Biology Centre University of Cambridge
e.ferrero at gen.cam.ac.uk
http://flypress.gen.cam.ac.uk/
------------------------------
Message: 3
Date: Sat, 27 Jul 2013 06:51:31 -0700 (PDT)
From: "Guest [guest]" <guest@bioconductor.org>
To: bioconductor at r-project.org, jial2 at mail.nih.gov
Subject: [BioC] issue on removing all NAs (UNANNOTATED) rows
Message-ID: <20130727135131.22869142E5F at mamba.fhcrc.org>
Hi,
I am using limma package on HuGene2.0st array. When I tried to remove
the
unannotated rows (NAs) after mapping probeID to gene names. Some
significant
gene lists I got have the differentially expressed genes, but some of
gene
lists have all of the unsignificantly expressed genes.
The code I used as follows:
cont.matrix <- makeContrasts(
interaction=(T.Ko-T.wt)-(ctrl.Ko-ctrl.wt),
levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
sig.interaction = topTable(fit2, coef = "interaction",
number=nrow(fit2),
p.value=0.05, lfc=1) interaction.ids=sig.interaction[["ID"]]
## map probe ids to gene names...
interaction.sig.SYMBOL=mget(interaction.ids,
hugene20sttranscriptclusterSYMBOL, ifnotfound=NA) #REMOVE ALL NAs
(UNANNOTATED) ROWS and unlist them for easy formatting later
interaction.sig.ann=unlist(interaction.sig.SYMBOL[!is.na(interaction.s
ig.SYM
BOL)])
Does anyone know what's wrong with the code? No error message. Any
input is
appreciated.
Thanks,
-- output of sessionInfo():
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
--
Sent via the guest posting facility at bioconductor.org.
------------------------------
Message: 4
Date: Sat, 27 Jul 2013 07:39:59 -0700
From: Dan Tenenbaum <dtenenba@fhcrc.org>
To: "Bartek [guest]" <guest at="" bioconductor.org="">
Cc: bartek.taciak at gmail.com, "bioconductor at r-project.org"
<bioconductor at="" r-project.org="">
Subject: Re: [BioC] problem with installation package useR2013
Message-ID:
<caf42j238ctzetfhfyfsp8hc6elpv8puy975io9+sg1q8rszpew at="" mail.gmail.com="">
Content-Type: text/plain; charset=UTF-8
On Sat, Jul 27, 2013 at 4:04 AM, Bartek [guest] <guest at="" bioconductor.org="">
wrote:
>
> Hello,
> I cannot install package "useR2013", I try all possible commands
from
bioconductor web site and nothing. All the time I get the same message
from
R. Can You give any tips?
>
> -- output of sessionInfo():
>
> install.packages("useR2013", repos=NULL, type="source") Warning in
> install.packages :
> package ???useR2013??? is not available (for R version 3.0.1)
> Installing package into ???C:/Users/Bartek/Documents/R/win-
library/3.0???
> (as ???lib??? is unspecified)
> Ostrze??enie: b????dny pakiet 'useR2013'
> B? ??D: B? ??D: nie okre??lono pakiet??w Warning in install.packages
:
> running command '"C:/PROGRA~1/R/R-30~1.1/bin/x64/R" CMD INSTALL -l
> "C:\Users\Bartek\Documents\R\win-library\3.0" "useR2013"' had status
1
Warning in install.packages :
> installation of package ???useR2013??? had non-zero exit status
>
> OR
>
>> source("http://bioconductor.org/biocLite.R")
>> install.packages("useR2013")
> Warning in install.packages :
> package ???useR2013??? is not available (for R version 3.0.1)
> Installing package into ???C:/Users/Bartek/Documents/R/win-
library/3.0???
> (as ???lib??? is unspecified)
> Warning in install.packages :
> package ???useR2013??? is not available (for R version 3.0.1)
>
Follow the instructions at:
http://bioconductor.org/help/course-materials/2013/useR2013/
Dan
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
------------------------------
Message: 5
Date: Sun, 28 Jul 2013 11:02:01 +0800
From: Wang Peter <wng.peter@gmail.com>
To: Gordon K Smyth <smyth at="" wehi.edu.au="">
Cc: Bioconductor mailing list <bioconductor at="" r-project.org="">
Subject: Re: [BioC] limma for homemade microarray - question on NAs
and multiple probes for one gene
Message-ID:
<cakh7rfadl2qp6mp-r=ptuqs+5lhtq04s_z3gfyobetpw+uzseg at="" mail.gmail.com="">
Content-Type: text/plain
sorry, i have a question.
why did you do log(M)
did you only extract matched probe signals?
shan
On Thu, Jul 25, 2013 at 7:21 AM, Gordon K Smyth <smyth at="" wehi.edu.au="">
wrote:
> Dear Zhengyu,
>
>
> On Mon, 22 Jul 2013, zhengyu jiang wrote:
>
> Hi Gordon,
>>
>> Sorry to bother you again. I have two questions.
>>
>> (1) I started the "eset<-as.matrix(log(M)) #### take log
intensities"
>> with a matrix and I have a separate annotation file which contains
>> columns of "ID", "Sample Name", "Chr", "Coordinate", ""GeneSymbol".
>> The following code to top list is below. How can I add the
>> annotation to the final toplist output ?
>>
>
> The first and best way to answer questions is to read the
documentation.
> If you type ?topTable, you will see that there is an argument for a
> data.frame containing gene annotation. So you read the annotation
> into a data.frame, then use
>
> topTable(fit, ..., genelist=yourdatannotation)
>
>
> eset <- normalizeBetweenArrays(eset, method="quantile") ## quantile
>> normalization for all arrays
>> design<-model.matrix(~Pair+**Treat)
>> targets<-readTargets("targets.**txt",row.names=1) ## or row name
=1
>> Pair<-factor(targets$Pair)
>> Treat<-factor(targets$**Treatment,levels=c("N","T"))
>> fit_pair<-lmFit(eset,design)
>> plotSA(fit_pair)
>> fit= eBayes(fit_pair, trend=TRUE)
>> R=topTable(fit, coef="TreatT", adjust="BH",number=30
>>
>> (2) I have 6 sample data (3 control and 3 treatment) in one file.
>>
>
> Is this an Illumina BeadChip. You don't say but, if not, using
> read.ilmn is not appropriate.
>
>
> It contains columns of "ID", "Sample Name", "Chr", "Coordinate",
>> ""GeneSymbol", "XRaw" (expression raw data). I don't have control
probes.
>> If I want to use the Spot quality weights function "wt.fun" to read
>> in these data as described in the limma manual to define all XRaw
>> values below
>> 1000 as 0 weight,
>>
>
> And yet I have already asked you please not to do this.
>
>
> is that possible to modify the code? I can change all column names
if
>> needed. But I don't know what columns are required for read.illmn?
>>
>
> Again, please read the documentation. Type ?read.ilmn and you will
> see that there is no argument called wt.fun. You can only use
> read.ilmn for Illumina BeadChips. This type of array does not have
> spots, so trying to set spot weights would obviously make no sense.
>
> The Biconductor posting guide says
>
> "Read the relevant R documentation ... If you are having trouble
with
> function somefunc, try ?somefunc."
>
> The limma User's Guide says
>
> "Mailing list etiquette requires that you read the relevant help
page
> carefully before posting a problem to the list."
>
> Best wishes
> Gordon
>
>
> myfun <- function(x) as.numeric(x$XRaw <1000)
>> read.ilmn("062813_data.txt",**probeid="ID",annotation=c("**
>> Chr","Coordinate","GeneSymbol"**,expr="XRaw"),
>> wt.fun=myfun)
>> Thanks,
>> Zhengyu
>>
>>
>> On Tue, Jul 9, 2013 at 7:34 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote:
>>
>> Dear Zhengyu,
>>>
>>> Date: Mon, 8 Jul 2013 20:40:14 -0400
>>>
>>>> From: zhengyu jiang <zhyjiang2006 at="" gmail.com="">
>>>> To: bioconductor at r-project.org
>>>> Subject: [BioC] limma for homemade microarray - question on NAs
and
>>>> multiple probes for one gene
>>>>
>>>> Dear Bioconductor experts,
>>>>
>>>> We have data from a homemade one-channel microarray that I tried
to
>>>> apply limma for differential expression analysis between matched
>>>> paired Normal
>>>> (N) and Tumor (Tumor) samples - 8 biological replicates (one tech
>>>> replicate has been averaged after normalization). All samples are
>>>> formatted in one matrix (M).
>>>>
>>>>
>>> This is a standard problem, well covered in the limma User's
Guide.
>>>
>>> Signals have been quantile normalized between each paired normal
>>> and
>>>
>>>> tumor.
>>>>
>>>>
>>> I don't know what you mean by this. I recommend ordinary quantile
>>> normalization of all the arrays together, without regard to which
>>> sample is paired with which.
>>>
>>> Signal values below 5 (log scale) have been replaced by "NA"
since
>>> they
>>>
>>>> are
>>>> potentially noises. So there are many NAs in M.
>>>>
>>>>
>>> Please don't do this. limma can deal with low intensities
perfectly
>>> weel, and can figure out whether they are noise or not. You are
>>> simply removing valid data.
>>>
>>> If you are concerned about high variability of low intensity
probes,
>>> then look at
>>>
>>> plotSA(fit_pair)
>>>
>>> to examine the mean-variance trend, and use
>>>
>>> eBayes(fit_pair, trend=TRUE)
>>>
>>> if there is a strong trend.
>>>
>>> I followed the user manual and made the codes below.
>>>
>>>>
>>>> I think the code is correct?
>>>>
>>>>
>>> Looks ok.
>>>
>>> My questions are (1) how to deal with NAs - as I did a search but
>>> no
>>>
>>>> clear idea
>>>>
>>>>
>>> Don't create them in the first place. This has been said many
times
>>> before!
>>>
>>> (2) how do people do the statistics at the gene level for one
gene
>>> having
>>>
>>>> multiple probes - averaging or taking median?
>>>>
>>>>
>>> Usually we do not find it necessary to aggregate the multiple
>>> probes. The multiple probes might represent different transcripts
or
>>> variants, and some of the probes might not work. Why do you need
to
>>> take any special action?
>>>
>>> However, if you feel that you must, the best method to aggregate
the
>>> multiple probes is probably to retain the probe for each gene with
>>> highest fit_pair$Amean value. We have used this strategy in a
>>> number of published papers. The rationale for this is to choose
the
>>> probe corresponding to the most highly expressed transcript for
the
>>> gene. Alternatively, you could average the probes using the
>>> avereps() function in limma.
>>>
>>> Best wishes
>>> Gordon
>>>
>>> Thanks,
>>>
>>>> Zhengyu
>>>>
>>>>
>>>> head(M)
>>>>>
>>>> N1 N2 N3 N4 N5 N6 N7
>>>> N8 T1 T2 T3
>>>> 2 8.622724 7.423568 NA NA 7.487174 NA 8.516293
>>>> NA
>>>> 7.876259 7.856707 NA
>>>> T4 T5 T6 T7 T8
>>>> 2 NA 7.720018 NA 7.752550 NA
>>>>
>>>> eset<-as.matrix(M)
>>>>
>>>>> Pair=factor(targets$Pair)
>>>>> Treat=factor(targets$****Treatment,levels=c("N","T")) #
>>>>> compared matched
>>>>>
>>>>> normal to tumors
>>>>
>>>> design<-model.matrix(~Pair+****Treat)
>>>>> targets
>>>>>
>>>>> FilenName Pair Treatment
>>>> 1 N1 1 N
>>>> 2 N2 2 N
>>>> 3 N3 3 N
>>>> 4 N4 4 N
>>>> 5 N5 5 N
>>>> 6 N6 6 N
>>>> 7 N7 7 N
>>>> 8 N8 8 N
>>>> 9 T1 1 T
>>>> 10 T2 2 T
>>>> 11 T3 3 T
>>>> 12 T4 4 T
>>>> 13 T5 5 T
>>>> 14 T6 6 T
>>>> 15 T7 7 T
>>>> 16 T8 8 T
>>>> fit_pair<-lmFit(eset,design)
>>>> fit_pair<-eBayes(fit_pair)
>>>>
>>>> R=topTable(fit_pair, coef="TreatT", adjust="BH",number=30) #
>>>> display top
>>>> 30
>>>>
>>>
>
______________________________**______________________________**______
> ____ The information in this email is confidential and
> inte...{{dropped:26}}
------------------------------
Message: 6
Date: Sun, 28 Jul 2013 13:05:17 +1000 (AUS Eastern Standard Time)
From: Gordon K Smyth <smyth@wehi.edu.au>
To: Lorena Pantano <lorena.pantano at="" gmail.com="">
Cc: Bioconductor mailing list <bioconductor at="" r-project.org="">
Subject: Re: [BioC] weird model design for DE analysis
Message-ID: <pine.wnt.4.64.1307281300370.6028 at="" pc975.wehi.edu.au="">
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Dear Lorena,
You are right to recognise that this design needs special treatment.
You can analyse all the data together, but you will need a statistical
procedure that can account for the fact that the control samples in
the two
batches are biologically paired, whereas all the other samples are
biologically independent. This requires a method that can fit random
effects for the individuals or can estimate within individual
correlations.
You cannot do this in edgeR, DESeq, DESeq2 or in any of the other RNA-
seq
packages based on the negative binomial distribution. These packages
assume
that all samples are statistically independent.
The only valid approach that I know of would be to use voom and the
duplicate correlation approach of the limma package. First, read in
your
data (for both batches) and normalize your data as in the edgeR
package.
If y is your DGEList object, then
y <- calcNormFactors(y)
Then setup you design matrix as you would do for limma or edgeR. This
includes effects for the batch and treatment, but not effects for
patients.
Something like
design <- model.matrix(~batch+condition)
where batch takes two levels (A and B) and condition takes three
levels
(control, preclinical, clinical).
Then transform to logCPM using the voom function.
v <- voom(y, design, plot=TRUE)
Then estimate the correlation between the repeat samples for the same
individual:
dupcor <- duplicateCorrelation(v, design, block=individual)
Here 'individual' is a vector or factor indicating which individual
each
sample comes from. It has length equal to the number of samples and
takes
21 levels, one for each individual in your data (7 controls, 7
preclinic
patients, 7 clinical patients).
Then conduct a differential expression analysis:
fit <- lmFit(v, design, block=individual,
correlation=dupcor$consensus)
fit <- eBayes(fit)
Be sure to check that the value of dupcor$conensus is positive. This
analysis keeps track of the fact that the two samples from each
control
individual are correlated.
Then you can use the topTable() function in the limma package to
extract
genes that follow any pattern of differential expression that you are
interested in.
Best wishes
Gordon
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade,
Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth
-------------- original message --------------- [BioC] weird model
design
for DE analysis Lorena Pantano lorena.pantano at gmail.com Fri Jul 26
14:53:36 CEST 2013
Hi, thanks for the quick reply!
The control group are 14 samples, 7 batch A, 7 batch B. But those 7
are the
same individuals, same RNA. So sample c1 in batch A, is coming from
same
individual than sample c1 in batch B. Hope that is cleat.
And, yes, I also treated as separate level factor, but I want to check
also
this scenario, the increase or decrease of gene expression from
control ->
pre-case -> case.
thanks again!
Lo
On Fri, Jul 26, 2013 at 2:29 PM, Michael Love
<michaelisaiahlove at="" gmail.com="">wrote:
>hi Lorena
>
>Can you clarify on a few points below:
>
>On Fri, Jul 26, 2013 at 1:29 PM, Lorena Pantano
><lorena.pantano at="" gmail.com=""> wrote:
>> Hi,
>>
>> I have some doubts about my data. I would like to do DE analysis of
>> RNAseq data.
>>
>> I have two set of experiments done in two different time points, so
>> there is probably batch effect: let's say batch A and B.
>>
>> For A, I have 7 controls and 7 cases (clinical identified) For B, I
>> have the same 7 controls and other 7 pre-cases (preclinical
>> identified)
>>
>> I wanted to know if I can analyze all together, but I have
questions
>> like:
>>
>> 1-7 cases from A are only in batch A, and the others 7 are only in
>> batch B, is it correct to setup a blocked variable indicating the
>> batch effect although I only have one entire group for one batch,
and
>> another entire group for another batch?
>
> You can still use blocking. Because the control group spans batch A
> and batch B, you can still estimate the batch effect (the model
matrix
> should still be full rank).
>
>> 2-and 7 control for A and B are same people, so it is not quite
right
>> to merge all together because it would be more like technical
>> replicates.
>
> I don't understand this last sentence. Do you mean that some, but
not
> all, of individuals in the control group from batch A are also in
> batch B?
>
> I wouldn't say they are technical replicates if the samples are
drawn
> from the same individual. RNA-Seq experiments of the same individual
> will contain biological variability.
>
>> My goal is:
>>
>> 1-get DE genes that follow additive effect. Something like increase
a
>> little in pre-cases, and increase more in cases.
>
> It sounds to me like you might want to treat pre-cases and cases as
> separate levels of a factor variable, rather than assume that the
> trend will necessarily be: control < pre-case < case. But to receive
> more specific advice, you should probably explain more about the
share
> of individuals across the 28 samples.
>
> Mike
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:15}}
Hi Ioannis,
It might sound stupid but does the data.frame that is used with extraCols need to have the exonBaseMean, dispersion, counts and all the other details that the DEXSeqHTML object has? I followed the steps that you mentioned and could successfully incorporate the genenames and symbols in the html doc but DEXSeqHTML somehow doesn't generate the svg files for splicing/counts/expression when I use the extraCols. My data frame for extraCols has geneIDs as row names and two more columns with symbols and gene names.
Could you suggest something?
Thanks a lot,
Aditi