extracting specific columns in R

0

Entering edit mode

Fatemehsadat Seyednasrollah ▴ 260

@fatemehsadat-seyednasrollah-5367

Last seen 11.2 years ago

Hello all, I have a text tab delimitated file from 100 biological samples with the names of samples as the names of columns. What is the memory efficient way of extracting only some specific columns(samples) and working on them? Should I make a new file of that and work with the new file like : new <- read.table (myfile, header = T ) [ , c(column names)] and then write this new to a new file? Thank you in advance

• 1.6k views

ADD COMMENT • link updated 13.3 years ago by Aliaksei Holik ▴ 350 • written 13.3 years ago by Fatemehsadat Seyednasrollah ▴ 260

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 9 months ago

United States

On Wed, Aug 15, 2012 at 8:11 AM, Fatemehsadat Seyednasrollah <fatsey@utu.fi>wrote: > Hello all, > > I have a text tab delimitated file from 100 biological samples with the > names of samples as the names of columns. > What is the memory efficient way of extracting only some specific > columns(samples) and working on them? > Should I make a new file of that and work with the new file like : > new <- read.table (myfile, header = T ) [ , c(column names)] > > Have a look at the colClasses argument to read.table(). You could, for example, read the first few lines of the file to get the header (using nrows), figure out which columns to read based on that, and then set the colClasses accordingly to read the full table. Sean > and then write this new to a new file? > Thank you in advance > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 13.3 years ago Sean Davis 21k

0

Entering edit mode

Aliaksei Holik ▴ 350

@aliaksei-holik-4992

Last seen 9.8 years ago

Spain/Barcelona/Centre for Genomic Regu…

Dear listers, Apologies if my question is not strictly related to Bioconductor, though one never knows, maybe there's a package that does what I need. I am analysing a list of differentially expressed genes from an Illumina microarray. In particular I'm trying to compare the list of differentially expressed genes to an existing list of genes preferentially expressed in the stem cell population (stem cell signature). When I do so, 10% of DE genes belong to the stem cell signature. What I'd like to do now is to find out, how likely that would happen by chance, i.e. put a p value on it. At the moment I know: There're 17119 unique genes in my dataset. Of them 86 are differentially expressed. The stem cell signature contains 510 genes. It is combined from several platforms, which makes it hard to establish the total number of unique genes, but it's at least 20819 (the size of the largest platform). There are 9 overlapping genes between DE genes and the stem cell signature. So I wonder: 1) If there's an accepted way to calculate a p value using these data. For instance could I run a like of a chi squared test? E.g. stem cell specific genes represent 510/20819=2.4% of total dataset. So expected number of the stem cell genes in my DE genes would be 86x2.4%=2. So my chi squared test would be based on 9 observed vs 2 expected. 2) Or do I have to generate a geneset based on the stem cell signature and go through GSEA algorithms to calculate enrichment and significance. Any pointers in the right direction would be much appreciated. Many thanks for your time and help! Aliaksei.

ADD COMMENT • link 13.3 years ago Aliaksei Holik ▴ 350

0

Entering edit mode

Howdy, Disclaimer: I am not a statistician and am always reluctant to give such advice since I'd never really claim authority in this arena, but ... here goes ;-) On Wednesday, August 15, 2012, Aliaksei Holik wrote: > Dear listers, > > Apologies if my question is not strictly related to Bioconductor, though > one never knows, maybe there's a package that does what I need. > > I am analysing a list of differentially expressed genes from an Illumina > microarray. In particular I'm trying to compare the list of differentially > expressed genes to an existing list of genes preferentially expressed in > the stem cell population (stem cell signature). When I do so, 10% of DE > genes belong to the stem cell signature. What I'd like to do now is to find > out, how likely that would happen by chance, i.e. put a p value on it. > > At the moment I know: > There're 17119 unique genes in my dataset. > Of them 86 are differentially expressed. > > The stem cell signature contains 510 genes. > It is combined from several platforms, which makes it hard to establish > the total number of unique genes, but it's at least 20819 (the size of the > largest platform). > > There are 9 overlapping genes between DE genes and the stem cell signature. > > So I wonder: > > 1) If there's an accepted way to calculate a p value using these data. For > instance could I run a like of a chi squared test? E.g. stem cell specific > genes represent 510/20819=2.4% of total dataset. So expected number of the > stem cell genes in my DE genes would be 86x2.4%=2. So my chi squared test > would be based on 9 observed vs 2 expected. A fisher's test would seem like the natural first choice. I'm also pretty sure that (for large enough N) the chi-square is a good approximation to the same, so your intuition is spot on! Your choice in numbers (ie. what the real size of "the urn" that you sample from is) is crucial, so some more care is required there. 2) Or do I have to generate a geneset based on the stem cell signature and > go through GSEA algorithms to calculate enrichment and significance. These aren't mutually exclusively and sure -- if you have a "signature set" why not add it to the pool you would compare against with GSEA and let it rip. The difference here is that you will need the expression values for your genes and not just a list of DE genes for this to work (it wasn't clear to me if you had that -- it's also not clear if your expression is coming from different arrays or the gene set is: mixing expression from different platforms is tricky) HTH, -steve > Any pointers in the right direction would be much appreciated. > > Many thanks for your time and help! > > Aliaksei. > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact [[alternative HTML version deleted]]

ADD REPLY • link 13.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

On 15.08.2012 14:51, Aliaksei Holik wrote: > Dear listers, > > Apologies if my question is not strictly related to Bioconductor, > though one never knows, maybe there's a package that does what I > need. > > I am analysing a list of differentially expressed genes from an > Illumina microarray. In particular I'm trying to compare the list of > differentially expressed genes to an existing list of genes > preferentially expressed in the stem cell population (stem cell > signature). When I do so, 10% of DE genes belong to the stem cell > signature. What I'd like to do now is to find out, how likely that > would happen by chance, i.e. put a p value on it. > > At the moment I know: > There're 17119 unique genes in my dataset. > Of them 86 are differentially expressed. > > The stem cell signature contains 510 genes. > It is combined from several platforms, which makes it hard to > establish the total number of unique genes, but it's at least 20819 > (the size of the largest platform). > > There are 9 overlapping genes between DE genes and the stem cell > signature. > > So I wonder: > > 1) If there's an accepted way to calculate a p value using these > data. For instance could I run a like of a chi squared test? E.g. > stem > cell specific genes represent 510/20819=2.4% of total dataset. So > expected number of the stem cell genes in my DE genes would be > 86x2.4%=2. So my chi squared test would be based on 9 observed vs 2 > expected. Hypergeometric test? > phyper(9-1,86,17119-86,510,lower.tail=F) [1] 0.001035456 For the total number of genes I used your lower estimate to be conservative. To be completely correct I think you would need to remove any of the 510 genes that are not in your 17,119 gene dataset. That will only boost the P value though (as they cannot be called DE if they are not in your dataset) and it is already 'significant' by most journals standards. -- Alex Gutteridge

ADD REPLY • link 13.3 years ago Alex Gutteridge ▴ 650

0

Entering edit mode

Hi Aliaksei, I will ask two questions before I give any suggestions. I am thinking what the suitable gene set testing methods are in your case. First, depending on the biological knowledge, will you consider the stem cell signature genes as a gene set or the differential expressed genes as a gene set if you have to choose one? Second, do you have the expression data for the two datasets? In your latest email, you may have the expression data. As you may know, in your case, we need a gene set from one data set and expression data from another study to do a gene set test. If we take the stem cell signature genes as the gene set, we will need to have the expression data in your study. With them, our gene set testing methods "ROAST", "CAMERA" and "ROMER" in limma can work well. You choose which you want to use depending on your statistical hypothesis. I may suggest starting with ROAST. The R code of ROAST should be straightforward. I will be happy to help if you have questions about using it. "ROAST: rotation gene set tests for complex microarray experiments" On the other hand, if you don't have (don't want to use) any expression data at this moment. But do you have the t statistic (or log fold change) results from the analysis genome-wide? If we still take the stem cell signature genes as the gene set, you can use "geneSetTest" (usually rank based p value) in limma to do the test with the genome-wide t statistics or log fold change from your own study. We have mentioned that this may be a bit optimistic in our "CAMERA" paper. You will still be able to draw conclusion if you see a very significant p value. "Camera: a competitive gene set test accounting for inter-gene correlation" The above gene set tests (like GSEA) only require you have one gene set, and they don't need the cutoff to make the other gene list. In another case, you may not even have the genome-wide t statistics or log fold change, hypergeometric test or Fisher's test might be the only options, as others suggested. The R function "phyper" may help. Di ---- Di Wu Postdoctoral fellow Harvard University, Statistics Department Harvard Medical School Science Center, 1 Oxford Street, Cambridge, MA 02138-2901 USA ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] On Behalf Of Aliaksei Holik [salvador@bio.bsu.by] Sent: Wednesday, August 15, 2012 9:51 AM Cc: bioconductor at r-project.org Subject: [BioC] Gene enrichment question Dear listers, Apologies if my question is not strictly related to Bioconductor, though one never knows, maybe there's a package that does what I need. I am analysing a list of differentially expressed genes from an Illumina microarray. In particular I'm trying to compare the list of differentially expressed genes to an existing list of genes preferentially expressed in the stem cell population (stem cell signature). When I do so, 10% of DE genes belong to the stem cell signature. What I'd like to do now is to find out, how likely that would happen by chance, i.e. put a p value on it. At the moment I know: There're 17119 unique genes in my dataset. Of them 86 are differentially expressed. The stem cell signature contains 510 genes. It is combined from several platforms, which makes it hard to establish the total number of unique genes, but it's at least 20819 (the size of the largest platform). There are 9 overlapping genes between DE genes and the stem cell signature. So I wonder: 1) If there's an accepted way to calculate a p value using these data. For instance could I run a like of a chi squared test? E.g. stem cell specific genes represent 510/20819=2.4% of total dataset. So expected number of the stem cell genes in my DE genes would be 86x2.4%=2. So my chi squared test would be based on 9 observed vs 2 expected. 2) Or do I have to generate a geneset based on the stem cell signature and go through GSEA algorithms to calculate enrichment and significance. Any pointers in the right direction would be much appreciated. Many thanks for your time and help! Aliaksei. _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 13.3 years ago Wu, Di ▴ 120

Login before adding your answer.