Entering edit mode
Gavin Koh
▴
220
@gavin-koh-4582
Last seen 10.3 years ago
I am summarising everything just so it is archived on the news group.
This
is the code I finally used:
The summarised data is from ArrayExpress (accession number
E-GEOD-22098).
There is no bead-level data available.
Each array is in a separate file, and the first 5 lines of the first
file
looks like this:
Probe_ID Signal Detection
ILMN_1809034 58.80201 0.003952569
ILMN_1660305 236.4589 0
ILMN_1792173 202.6858 0
ILMN_1762337 -4.230737 0.7285903
ILMN_2055271 7.409712 0.07641634
...
targets.txt looks like this:
name
GSM549324_4325540010_E_Raw.txt
GSM549325_4325540026_A_Raw.txt
GSM549326_4325540026_B_Raw.txt
GSM549327_4335991057_D_Raw.txt
GSM549328_4335991058_A_Raw.txt
...
The code I used was:
TB1 <- read.ilmn(
files=as.character(targets$name)[1:5],
probeid="Probe_ID",
expr="Signal", sep="\t",
other.columns="Detection"
)
colnames(TB1$E) <- substr(targets$name[1:5],1,9)
colnames(TB1$other$Detection) <- substr(targets$name[1:5],1,9)
TB1$genes <- as.data.frame(TB1$genes) #read.ilmn reads in as vector.
TB2 <- read.ilmn(
files=as.character(targets$name)[6:21],
probeid="Probe_ID",
expr="Signal", sep="\t",
other.columns="Detection"
)
colnames(TB2$E) <- substr(targets$name[6:21],1,9)
colnames(TB2$other$Detection) <- substr(targets$name[6:21],1,9)
TB2$genes <- as.data.frame(TB2$genes)
TB1.TB2 <- match(TB1$genes[[1]], TB2$genes[[1]])
TB <- cbind(TB1, TB2[TB1.TB2,])
On , Gavin Koh <gavin.koh@gmail.com> wrote:
> Dear Wei,
> I think that's worked!
> Thank you! Gavin.
> On 16 April 2011 13:25, Wei Shi shi@wehi.edu.au> wrote:
> > Hi Gavin:
> >
> > I think the problem is that your TB1$genes (and TB2$genes) is a
vector
> rather than a data frame. This made cbind fail to combine them. I
guess
> the data you downloaded from the public repository is not the
original
> GenomeStudio/BeadStudio output. But you can fix this using the
following
> code:
> >
> > m
> > TB1$genes
> > TB2$genes
> > TB
> >
> > I tried this code on my computer and it worked. Hope that will
work for
> you.
> >
> > Cheers,
> > Wei
> >
> > On Apr 16, 2011, at 7:34 PM, Gavin Koh wrote:
> >
> >> Dear Wei,
> >>
> >> I am afraid it still doesn't work. I this is because TB1 is a
list and
> >> not a data frame and I cannot coerce it to become a dataframe.
> >>> TB
> >> Error in object$genes[i, , drop = FALSE] : incorrect number of
> dimensions
> >>> names(TB1)
> >> [1] "source" "E" "genes" "targets" "other"
> >>> class(TB1)
> >> [1] "EListRaw"
> >> attr(,"package")
> >> [1] "limma"
> >>
> >> I checked EListRaw and it inherits directly from list and not
from
> data frame.
> >> So sorry,
> >>
> >> Gavin.
> >>
> >> On 16 April 2011 08:38, Wei Shi shi@wehi.edu.au> wrote:
> >>> Hi Gavin:
> >>>
> >>> Sorry, TB1[common.probes] should be changed to
TB1[common.probes, ].
> >>>
> >>> Hope it works now.
> >>>
> >>> Cheers,
> >>> Wei
> >>>
> >>>
> >>> On Apr 16, 2011, at 4:32 PM, Gavin Koh wrote:
> >>>
> >>>> Dear Wei,
> >>>>
> >>>> I am afraid this data is from a public repository, so I have no
> >>>> control over what data is published or the format :-(
> >>>> I am afraid cbind still does not appear to work with this
> subscripting.
> >>>>
> >>>>> common.probes
> >>>>> TB
> >>>> Error: Two subscripts required
> >>>>
> >>>> Please help?
> >>>>
> >>>> Gavin åå¨ æ¬ä¸
> >>>>
> >>>> On 16 April 2011 00:33, Wei Shi shi@wehi.edu.au> wrote:
> >>>>> Dear Gavin:
> >>>>>
> >>>>> OK, so you did not input the control data. That is the reason
why
> my code did not work. You should really include the control data in
your
> analysis because they are very useful for the normalization. But you
can
> use the following code to merge the data you are having now:
> >>>>>
> >>>>> m
> >>>>> merged
> >>>>>
> >>>>> This will remove the second ILMN_2038777 probe from TB1 and
combine
> probes from TB1 and TB2 in the right order.
> >>>>>
> >>>>> Cheers,
> >>>>> Wei
> >>>>>
> >>>>> On Apr 16, 2011, at 1:58 AM, Gavin Koh wrote:
> >>>>>
> >>>>>> Dear Wei
> >>>>>>
> >>>>>> I am very sorry, but this still does not work.
> >>>>>>
> >>>>>> ILMN_2038777 is not missing in TB1, but duplicated. The
batches
> with
> >>>>>> 48804 probes contain two copies of ILMN_2038777. The batches
with
> >>>>>> 48803 probes contain only one copy of ILMN_2038777. The order
of
> >>>>>> probes also seems to be different from batch to batch.
> >>>>>>
> >>>>>> TB1 was generated using:
> >>>>>>
> >>>>>> TB1
> >>>>>> files=as.character(targets$name)[1:5],
> >>>>>> probeid="Probe_ID",
> >>>>>> expr="Signal", sep="\t",
> >>>>>> other.columns="Detection"
> >>>>>> )
> >>>>>>
> >>>>>> The reason for this being that the summarized data for each
array
> is
> >>>>>> in a separate file. There is no bead level data available.
There
> is no
> >>>>>> xxx_profile.txt file.
> >>>>>>
> >>>>>> I tried removing ILMN_2038777, but I cannot. Am I right in
saying
> that
> >>>>>> this method of subsetting is only applicable to data frames?
> >>>>>>> TB1
> >>>>>> Error in object$genes[i, , drop = FALSE] : incorrect number
of
> dimensions
> >>>>>>> TB1
> >>>>>> Error in object$genes[i, , drop = FALSE] : incorrect number
of
> dimensions
> >>>>>>
> >>>>>> Just so you can see the structure of the file that
read.ilmn() has
> produced:
> >>>>>>
> >>>>>> --begin screen dump--
> >>>>>>
> >>>>>>> TB1
> >>>>>> An object of class "EListRaw"
> >>>>>> $source
> >>>>>> [1] "illumina"
> >>>>>>
> >>>>>> $E
> >>>>>> [,1] [,2] [,3] [,4] [,5]
> >>>>>> ILMN_1809034 58.802010 24.907950 13.905010 10.07729 7.044668
> >>>>>> ILMN_1660305 236.458900 113.218000 193.581800 282.36350
127.023400
> >>>>>> ILMN_1792173 202.685800 120.449500 208.370600 242.63090
130.447200
> >>>>>> ILMN_1762337 -4.230737 -3.899888 -3.654122 -3.30873 -5.115820
> >>>>>> ILMN_2055271 7.409712 8.776000 9.394149 12.66054 1.250353
> >>>>>> 48799 more rows ...
> >>>>>>
> >>>>>> $genes
> >>>>>>
> [1] "ILMN_1809034" "ILMN_1660305" "ILMN_1792173" "ILMN_1762337"
"ILMN_2055271"
> >>>>>> 48799 more elements ...
> >>>>>>
> >>>>>> $targets
> >>>>>> [1] SampleNames
> >>>>>> (or 0-length row.names)
> >>>>>>
> >>>>>> $other
> >>>>>> $Detection
> >>>>>> [,1] [,2] [,3] [,4] [,5]
> >>>>>> ILMN_1809034 0.003952569 0.01844532 0.03952569 0.08432148
> 0.111989500
> >>>>>> ILMN_1660305 0.000000000 0.00000000 0.00000000 0.00000000
> 0.001317523
> >>>>>> ILMN_1792173 0.000000000 0.00000000 0.00000000 0.00000000
> 0.001317523
> >>>>>> ILMN_1762337 0.728590300 0.75230570 0.68247690 0.57444010
> 0.708827400
> >>>>>> ILMN_2055271 0.076416340 0.05138340 0.05665349 0.06719368
> 0.283267500
> >>>>>> 48799 more rows ...
> >>>>>>
> >>>>>> --end screen dump--
> >>>>>>
> >>>>>> Gavin
> >>>>>>
> >>>>>> On 15 April 2011 12:24, Wei Shi shi@wehi.edu.au> wrote:
> >>>>>>> Dear Gavin:
> >>>>>>>
> >>>>>>> Thanks for the further information. The probe "ILMN_2038777"
is
> not only a gene probe but also a positive control probe (control
type:
> housekeeping). You can find more information about this probe in the
HT12
> manifest file. But I do not know why it was absent in your TB2
dataset.
> Anyway, it will be quite safe to remove the housekeeping
"ILMN_2038777"
> from your TB1 dataset. Then you can combine these two datasets
together.
> Below is the code to do this:
> >>>>>>>
> >>>>>>> x1
> >>>>>>> x2
> >>>>>>> x1
> >>>>>>> m
> >>>>>>> x.merged
> >>>>>>>
> >>>>>>> This will combine TB1 with TB2. For the other four datasets,
you
> can merge them to x.merged using the same procedure (removing
> housekeeping "ILMN_2038777" from the dataset first if it has, then
using
> match and cbind commands to merge them).
> >>>>>>>
> >>>>>>> Hope this will work for you. But let you know it doesn't.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Wei
> >>>>>>>
> >>>>>>>
> >>>>>>> On Apr 15, 2011, at 9:16 PM, Gavin Koh wrote:
> >>>>>>>
> >>>>>>>> Dear Wei,
> >>>>>>>>
> >>>>>>>> Thank you for replying so quickly. There appear to be 6
batches
> in
> >>>>>>>> this dataset (TB1 to 6)
> >>>>>>>>
> >>>>>>>>> TB1$genes[1:10]
> >>>>>>>> [1] "ILMN_1809034" "ILMN_1660305" "ILMN_1792173"
"ILMN_1762337"
> >>>>>>>> "ILMN_2055271" "ILMN_1736007" "ILMN_1814316"
> >>>>>>>> [8] "ILMN_2359168" "ILMN_1731507" "ILMN_1787689"
> >>>>>>>>> TB2$genes[1:10]
> >>>>>>>> [1] "ILMN_1762337" "ILMN_2055271" "ILMN_1736007"
"ILMN_2383229"
> >>>>>>>> "ILMN_1806310" "ILMN_1779670" "ILMN_2321282"
> >>>>>>>> [8] "ILMN_1671474" "ILMN_1772582" "ILMN_1735698"
> >>>>>>>>> TB3$genes[1:10]
> >>>>>>>> [1] "ILMN_1809034" "ILMN_1660305" "ILMN_1792173"
"ILMN_1762337"
> >>>>>>>> "ILMN_2055271" "ILMN_1736007" "ILMN_1814316"
> >>>>>>>> [8] "ILMN_2359168" "ILMN_1731507" "ILMN_1787689"
> >>>>>>>>> TB4$genes[1:10]
> >>>>>>>> [1] "ILMN_1762337" "ILMN_2055271" "ILMN_1736007"
"ILMN_2383229"
> >>>>>>>> "ILMN_1806310" "ILMN_1779670" "ILMN_2321282"
> >>>>>>>> [8] "ILMN_1671474" "ILMN_1772582" "ILMN_1735698"
> >>>>>>>>> TB5$genes[1:10]
> >>>>>>>> [1] "ILMN_1809034" "ILMN_1660305" "ILMN_1792173"
"ILMN_1762337"
> >>>>>>>> "ILMN_2055271" "ILMN_1736007" "ILMN_1814316"
> >>>>>>>> [8] "ILMN_2359168" "ILMN_1731507" "ILMN_1787689"
> >>>>>>>>> TB6$genes[1:10]
> >>>>>>>> [1] "ILMN_1762337" "ILMN_2055271" "ILMN_1736007"
"ILMN_2383229"
> >>>>>>>> "ILMN_1806310" "ILMN_1779670" "ILMN_2321282"
> >>>>>>>> [8] "ILMN_1671474" "ILMN_1772582" "ILMN_1735698"
> >>>>>>>>
> >>>>>>>> å¤è¬è¬æ¨ç幫å©ï¼
> >>>>>>>>
> >>>>>>>> Gavin
> >>>>>>>>
> >>>>>>>> On 15 April 2011 11:45, Wei Shi shi@wehi.edu.au> wrote:
> >>>>>>>>> Hi Gavin:
> >>>>>>>>>
> >>>>>>>>> It would be best if you can match the two batches using
the
> probe identifiers because they are much less likely to have
duplicates.
> Would it possible to show the first several probes in each dataset
so
> that I can write some code to help you do this?
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Wei
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Apr 15, 2011, at 7:54 PM, Gavin Koh wrote:
> >>>>>>>>>
> >>>>>>>>>> Dear Wei,
> >>>>>>>>>>
> >>>>>>>>>> A little more information: the difference seems to be a
single
> duplicated probe.
> >>>>>>>>>> Just comparing two batches (TB1 and TB2) with different
probe
> numbers:
> >>>>>>>>>>> length(TB1$genes)
> >>>>>>>>>> [1] 48804
> >>>>>>>>>>> length(TB2$genes)
> >>>>>>>>>> [1] 48803
> >>>>>>>>>>> length(unique(TB2$genes))
> >>>>>>>>>> [1] 48803
> >>>>>>>>>>> length(unique(TB1$genes))
> >>>>>>>>>> [1] 48803
> >>>>>>>>>>> setdiff(TB1$genes,TB2$genes)
> >>>>>>>>>> character(0)
> >>>>>>>>>>> setequal(TB1$genes,TB2$genes)
> >>>>>>>>>> [1] TRUE
> >>>>>>>>>>
> >>>>>>>>>> That still leaves me the problem that I don't know how to
> identify the
> >>>>>>>>>> repeated probe or how to cbind TB1 and TB2... :-(
> >>>>>>>>>>
> >>>>>>>>>> Gavin
> >>>>>>>>>>
> >>>>>>>>>> On 15 April 2011 02:38, Wei Shi shi@wehi.edu.au> wrote:
> >>>>>>>>>>> Hi Gavin:
> >>>>>>>>>>>
> >>>>>>>>>>> The number of probes which were present in one batch but
not
> in others should be very small. So you can use the probes which are
> common in all batches for your analysis.
> >>>>>>>>>>>
> >>>>>>>>>>> Hope this helps.
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Wei
> >>>>>>>>>>>
> >>>>>>>>>>> On Apr 15, 2011, at 1:20 AM, Gavin Koh wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> I am trying to analyse data from ArrayExpress
E-GEOD-22098
> (published
> >>>>>>>>>>>> Dec last year).
> >>>>>>>>>>>> According to the study methods, the data are Illumina
> HumanHT-12 v3
> >>>>>>>>>>>> Expression BeadChips, but the hybridisation seems to
have
> been done in
> >>>>>>>>>>>> several batches, with different numbers of probes in
each
> batch,
> >>>>>>>>>>>> alternating between 48803 and 48804. Can anyone tell me
how
> to combine
> >>>>>>>>>>>> these different batches into the same file, please? I
am
> trying to
> >>>>>>>>>>>> read the probe data using the read.ilmn() function in
limma,
> but
> >>>>>>>>>>>> failing, because cbind complains the matrices are not
the
> same length
> >>>>>>>>>>>> (precise error is "Error in cbind(out$E,
objects[[i]]$E) :
> number of
> >>>>>>>>>>>> rows of matrices must match (see arg 2)").
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thank you in advance,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Gavin Koh
> >>>>>>>>>>>>
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> Bioconductor mailing list
> >>>>>>>>>>>> Bioconductor@r-project.org
> >>>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>>>>>>>>>> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
>
______________________________________________________________________
> >>>>>>>>>>> The information in this email is confidential and
intended
> solely for the addressee.
> >>>>>>>>>>> You must not disclose, forward, print or use it without
the
> permission of the sender.
> >>>>>>>>>>>
>
______________________________________________________________________
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Hofstadter's Law: It always takes longer than you expect,
even
> when
> >>>>>>>>>> you take into account Hofstadter's Law.
> >>>>>>>>>> âDouglas Hofstadter (in Gödel, Escher, Bach, 1979)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
>
______________________________________________________________________
> >>>>>>>>> The information in this email is confidential and intended
> solely for the addressee.
> >>>>>>>>> You must not disclose, forward, print or use it without
the
> permission of the sender.
> >>>>>>>>>
>
______________________________________________________________________
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Hofstadter's Law: It always takes longer than you expect,
even
> when
> >>>>>>>> you take into account Hofstadter's Law.
> >>>>>>>> âDouglas Hofstadter (in Gödel, Escher, Bach, 1979)
> >>>>>>>
> >>>>>>>
> >>>>>>>
>
______________________________________________________________________
> >>>>>>> The information in this email is confidential and intended
solely
> for the addressee.
> >>>>>>> You must not disclose, forward, print or use it without the
> permission of the sender.
> >>>>>>>
>
______________________________________________________________________
> >%
[[alternative HTML version deleted]]