Multi-factor multi-level analysis of RNAseq data

0

Entering edit mode

Guest User ★ 13k

@guest-user-4897

Last seen 9.6 years ago

Hi all. This is my first post here. I wasn't sure of posting my question to this mailing list, as my competence in statistics is very poor. So, please, forgive me in advance for what I could say... A few words to describe my experimental design. I have RNAseq data from 54 samples. My experiment deals with apple fruits collected at two different times (H=at harvest, and PH=after postharvest storage) from trees of three different varieties (G, F, and P) grown under three different agronomic conditions (L, M, and H). I have done 3 biological replicates. So 2 times x 3 varieties x 3 conditions x 3 replicates = 54. Based upon the suggestions of some colleagues of mine, I have decided to start using EdgeR for the analysis of these data, as they told me (without knowing how!!!) that this package implements multifactor- multilevel pipelines. My first objective is to achieve a list of differentially expressed genes that are affected: 1) only by the genotype (variety) 2) only by the agronomic condition 3) only by storage or by an interaction of 4) genotype x agr. condition 5) genotype x storage 6) agr. condition x storage. I could not be interested by the triple interaction. I have read the EdgeR vignette and case studies. I have searched the internet but found just discussions in which the "statistical level" was too high for me. Would you please give me some code examples to get these lists? The first problem may be the setting up of the design matrix... May I use the same scheme of the paragraph 3.5 of the EdgeR vignette (Comparisons Both Between and Within Subjects)? Thanks in advance to whom will reply and sorry for this "low level" question. Alessandro -- output of sessionInfo(): R version 2.15.3 (2013-03-01) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8 attached base packages: [1] grid stats graphics grDevices utils datasets methods base other attached packages: [1] gplots_2.11.0 MASS_7.3-23 KernSmooth_2.23-8 caTools_1.14 gdata_2.12.0 gtools_2.7.1 RColorBrewer_1.0-5 edgeR_3.0.8 limma_3.14.4 DESeq_1.10.1 lattice_0.20-13 locfit_1.5-9 [13] Biobase_2.18.0 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] annotate_1.36.0 AnnotationDbi_1.20.7 bitops_1.0-4.2 DBI_0.2-5 genefilter_1.40.0 geneplotter_1.36.0 IRanges_1.16.6 parallel_2.15.3 RSQLite_0.11.2 splines_2.15.3 stats4_2.15.3 [12] survival_2.37-2 tools_2.15.3 XML_3.96-1.1 xtable_1.7-1 -- Sent via the guest posting facility at bioconductor.org.

RNASeq edgeR RNASeq edgeR • 1.9k views

ADD COMMENT • link updated 10.9 years ago by Sam McInturf ▴ 300 • written 10.9 years ago by Guest User ★ 13k

0

Entering edit mode

Sam McInturf ▴ 300

@sam-mcinturf-5291

Last seen 8.6 years ago

United States

Alessandro, That is a big question, 'how to do DE in NGS' can be a bit involved for a single question. So, here is a script! This is my current RNA seq project, where I have 3 replicates, 2 tissues, and 2 genotypes, a smaller factorial design than what you use. I have included the parts for setting up the DGEList, contrast/design matrices, and calling deferentially expression with some extra formatting that I use to make tables to send to my lab mates. There is a section called 'annotation' where I use biomaRt to get some gene descriptions, which may or may not be available to you. I used the method in 3.3.1 of edgeR vignette to do the grouping (each treatment combintation as a group). My stats are not good enough for me to comment on if this is the best / worst/ equivlent to any other method. I added some extra comments to make it easier to read. As your problems become more specific you will be able to find answers to your questions in the bioconductor list serve archive, just fyi. Hope this helps! ###################################################################### # SAMS edgeR SCRIPT ###################################################################### # Sample description, design, contrasts ###################################################################### #make design matrix - make targets matrix and give good row names. Then reorder the Genotype to include Columbia as ref targets <- data.frame(Genotype = factor(rep(c("Columbia", "OPT3"), each =6), levels = c("Columbia", "OPT3")), Tissue =factor(rep(rep(c("Root", "Shoot"), each =3), 2), levels = c("Root", "Shoot"))) #this is a manual way to make a 'targets' rownames(targets) <- substring(rownames(colData(olap)), 1,3) group <- factor(paste(targets$Genotype, targets$Tissue, sep = ".")) #assign each contrast combination as a group design <- model.matrix(~0+group) #make a design matrix, ~0 means no intercept colnames(design) <- levels(group) #pretty names #Make Contrast matrix Cont <- makeContrasts( Roots = Columbia.Root - OPT3.Root, Shoots = Columbia.Shoot - OPT3.Shoot, DifDif = (Columbia.Root - OPT3.Root) - (Columbia.Shoot - OPT3.Shoot), levels = design) # make a DGEList of the read counts and supply the group to make groups, calc Normilzation factors ###################################################################### # edger <- DGEList(assays(olap)$counts, group = group) #assays(olap)$counts extracts the count matrix rownames(edger$samples) <- substring(rownames(edger$samples),1,3) #make it pretty (remove .bam extension) keep <- rowSums(cpm(edger) > 2) >= 4 #a gene must have 2cpm at least four times in all the data to keep edgerfilter <- edger[keep,] #FALSE - 8204, TRUE - 19212 edgerN <- calcNormFactors(edgerfilter) #calculate normalization factors #calculate dispersion edgerN <- estimateCommonDisp(edgerN, verbose =TRUE) edgerN <- estimateTagwiseDisp(edgerN, trend = "none") #fit the model fit <- glmFit(edgerN, design) #fit the model cpmMat <- cpm(edgerN); colnames(cpmMat) <- substr(colnames(cpmMat), 1,3) #keeping a normalized count matrix is good for heatmaps later ###################################################################### # Differential Expression ###################################################################### #The next three sections are basically the same thing just for each contrast. Betters ways may exist (like a loop of some form?) but this is verbose and easy-ish to read. # Root DE ###################################################################### #Conduct the likelihood ratio test (LRT). Call significance at FDR moderated (benjamani and hochberg) at 0.05. #make this logical (logical of if each gene is DE). Get the names of these genes. #make a new data frame with a FDR column(same adjust as above). This is like the 'toptable' (from limma) RootsLrt <- glmLRT(fit, contrast=Cont[,"Roots"]) #call DE genes #decide testsDGE() looks at each gene in the contrast and labels it zero if it is not changing, -1/1 for -/+ fold change #Note that you have to supply arguements to p.adjust (adjust method) and a p.value that is your alpha level rdt <- decideTestsDGE(RootsLrt, adjust.method = "BH", p.value = 0.01);rownames(rdt) <- rownames(RootsLrt$table) #summary(rdt) isDEr <- as.logical(rdt) #convert the -1,0,1 (down, nc, up) to TRUE (DE) FALSE (nc) rDEnames <- rownames(RootsLrt$table[isDEr,]) #names of DE genes #there may be a better way, but it works. just adding a FDR column manually using p.adjust Rtable <- data.frame(RootsLrt$table[isDEr,], FDR = p.adjust(RootsLrt$table$PValue, "BH")[isDEr]) #organize the table by FDR Rtable <- Rtable[order(Rtable$FDR),] #get the gene names for each gene with logFC > 2 (either direction) and an FDR > 0.05, this is for heatmaps later rDEnames2 <- rownames(Rtable[Rtable$FDR < 0.05& abs(Rtable$logFC) > 2,]) ###################################################################### # Annotation ###################################################################### #Get annotation dbs for next sections library(biomaRt) athGene <- useMart("plants_mart_18",dataset="athaliana_eg_gene") # Roots annotation ###################################################### RLrt <- data.frame(tair_locus = rownames(RootsLrt$table)[isDEr], RootsLrt$table[isDEr,]) RALrt <- merge( RLrt, getBM(attributes = c("tair_locus", "tair_symbol", "transmembrane_domain","description"), filters = "tair_locus", values = rownames(RLrt), mart = athGene),by = "tair_locus") > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] biomaRt_2.16.0 RColorBrewer_1.0-5 gplots_2.11.0.1 [4] MASS_7.3-26 KernSmooth_2.23-10 caTools_1.14 [7] gdata_2.12.0.2 gtools_2.7.1 GenomicRanges_1.12.4 [10] IRanges_1.18.1 BiocGenerics_0.6.0 edgeR_3.2.3 [13] limma_3.16.5 loaded via a namespace (and not attached): [1] Biostrings_2.28.0 bitops_1.0-5 RCurl_1.95-4.1 Rsamtools_1.12.3 [5] stats4_3.0.1 tcltk_3.0.1 tools_3.0.1 XML_3.96-1.1 [9] zlibbioc_1.6.0 On Thu, Jun 13, 2013 at 7:32 AM, Alessandro Botton [guest] < guest@bioconductor.org> wrote: > > Hi all. > This is my first post here. I wasn't sure of posting my question to this > mailing list, as my competence in statistics is very poor. So, please, > forgive me in advance for what I could say... > A few words to describe my experimental design. I have RNAseq data from 54 > samples. My experiment deals with apple fruits collected at two different > times (H=at harvest, and PH=after postharvest storage) from trees of three > different varieties (G, F, and P) grown under three different agronomic > conditions (L, M, and H). I have done 3 biological replicates. So 2 times x > 3 varieties x 3 conditions x 3 replicates = 54. > Based upon the suggestions of some colleagues of mine, I have decided to > start using EdgeR for the analysis of these data, as they told me (without > knowing how!!!) that this package implements multifactor-multilevel > pipelines. > My first objective is to achieve a list of differentially expressed genes > that are affected: > 1) only by the genotype (variety) > 2) only by the agronomic condition > 3) only by storage > or by an interaction of > 4) genotype x agr. condition > 5) genotype x storage > 6) agr. condition x storage. > I could not be interested by the triple interaction. > I have read the EdgeR vignette and case studies. I have searched the > internet but found just discussions in which the "statistical level" was > too high for me. > Would you please give me some code examples to get these lists? > The first problem may be the setting up of the design matrix... May I use > the same scheme of the paragraph 3.5 of the EdgeR vignette (Comparisons > Both Between and Within Subjects)? > Thanks in advance to whom will reply and sorry for this "low level" > question. > Alessandro > > > > -- output of sessionInfo(): > > R version 2.15.3 (2013-03-01) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8 > > attached base packages: > [1] grid stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] gplots_2.11.0 MASS_7.3-23 KernSmooth_2.23-8 caTools_1.14 > gdata_2.12.0 gtools_2.7.1 RColorBrewer_1.0-5 edgeR_3.0.8 > limma_3.14.4 DESeq_1.10.1 lattice_0.20-13 locfit_1.5-9 > [13] Biobase_2.18.0 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] annotate_1.36.0 AnnotationDbi_1.20.7 bitops_1.0-4.2 > DBI_0.2-5 genefilter_1.40.0 geneplotter_1.36.0 > IRanges_1.16.6 parallel_2.15.3 RSQLite_0.11.2 > splines_2.15.3 stats4_2.15.3 > [12] survival_2.37-2 tools_2.15.3 XML_3.96-1.1 > xtable_1.7-1 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Sam McInturf [[alternative HTML version deleted]]

ADD COMMENT • link 10.9 years ago Sam McInturf ▴ 300

0

Entering edit mode

Dear Sam, thanks a lot for the script. I will adapt it to my specific needs and try it as soon as possible. I will let you know... My sensation is that I will need your help again very soon... Thanks again. Alessandro Il giorno 13/giu/2013, alle ore 16.58, Sam McInturf ha scritto: > Alessandro, > That is a big question, 'how to do DE in NGS' can be a bit involved for a single question. So, here is a script! This is my current RNA seq project, where I have 3 replicates, 2 tissues, and 2 genotypes, a smaller factorial design than what you use. > > I have included the parts for setting up the DGEList, contrast/design matrices, and calling deferentially expression with some extra formatting that I use to make tables to send to my lab mates. There is a section called 'annotation' where I use biomaRt to get some gene descriptions, which may or may not be available to you. > I used the method in 3.3.1 of edgeR vignette to do the grouping (each treatment combintation as a group). My stats are not good enough for me to comment on if this is the best / worst/ equivlent to any other method. > I added some extra comments to make it easier to read. > > As your problems become more specific you will be able to find answers to your questions in the bioconductor list serve archive, just fyi. > > Hope this helps! > ###################################################################### > # SAMS edgeR SCRIPT > ###################################################################### > > # Sample description, design, contrasts > ###################################################################### > #make design matrix - make targets matrix and give good row names. Then reorder the Genotype to include Columbia as ref > targets <- data.frame(Genotype = factor(rep(c("Columbia", "OPT3"), each =6), levels = c("Columbia", "OPT3")), Tissue =factor(rep(rep(c("Root", "Shoot"), each =3), 2), levels = c("Root", "Shoot"))) #this is a manual way to make a 'targets' > rownames(targets) <- substring(rownames(colData(olap)), 1,3) > group <- factor(paste(targets$Genotype, targets$Tissue, sep = ".")) #assign each contrast combination as a group > design <- model.matrix(~0+group) #make a design matrix, ~0 means no intercept > colnames(design) <- levels(group) #pretty names > #Make Contrast matrix > Cont <- makeContrasts( > Roots = Columbia.Root - OPT3.Root, > Shoots = Columbia.Shoot - OPT3.Shoot, > DifDif = (Columbia.Root - OPT3.Root) - (Columbia.Shoot - OPT3.Shoot), > levels = design) > > # make a DGEList of the read counts and supply the group to make groups, calc Normilzation factors > #################################################################### ### > edger <- DGEList(assays(olap)$counts, group = group) #assays(olap)$counts extracts the count matrix > rownames(edger$samples) <- substring(rownames(edger$samples),1,3) #make it pretty (remove .bam extension) > keep <- rowSums(cpm(edger) > 2) >= 4 #a gene must have 2cpm at least four times in all the data to keep > edgerfilter <- edger[keep,] #FALSE - 8204, TRUE - 19212 > edgerN <- calcNormFactors(edgerfilter) #calculate normalization factors > #calculate dispersion > edgerN <- estimateCommonDisp(edgerN, verbose =TRUE) > edgerN <- estimateTagwiseDisp(edgerN, trend = "none") > #fit the model > fit <- glmFit(edgerN, design) #fit the model > > cpmMat <- cpm(edgerN); colnames(cpmMat) <- substr(colnames(cpmMat), 1,3) #keeping a normalized count matrix is good for heatmaps later > ###################################################################### > # Differential Expression > ###################################################################### > #The next three sections are basically the same thing just for each contrast. Betters ways may exist (like a loop of some form?) but this is verbose and easy-ish to read. > > # Root DE > ###################################################################### > #Conduct the likelihood ratio test (LRT). Call significance at FDR moderated (benjamani and hochberg) at 0.05. > #make this logical (logical of if each gene is DE). Get the names of these genes. > #make a new data frame with a FDR column(same adjust as above). This is like the 'toptable' (from limma) > RootsLrt <- glmLRT(fit, contrast=Cont[,"Roots"]) #call DE genes > #decide testsDGE() looks at each gene in the contrast and labels it zero if it is not changing, -1/1 for -/+ fold change > #Note that you have to supply arguements to p.adjust (adjust method) and a p.value that is your alpha level > rdt <- decideTestsDGE(RootsLrt, adjust.method = "BH", p.value = 0.01);rownames(rdt) <- rownames(RootsLrt$table) #summary(rdt) > isDEr <- as.logical(rdt) #convert the -1,0,1 (down, nc, up) to TRUE (DE) FALSE (nc) > rDEnames <- rownames(RootsLrt$table[isDEr,]) #names of DE genes > #there may be a better way, but it works. just adding a FDR column manually using p.adjust > Rtable <- data.frame(RootsLrt$table[isDEr,], FDR = p.adjust(RootsLrt$table$PValue, "BH")[isDEr]) > #organize the table by FDR > Rtable <- Rtable[order(Rtable$FDR),] > #get the gene names for each gene with logFC > 2 (either direction) and an FDR > 0.05, this is for heatmaps later > rDEnames2 <- rownames(Rtable[Rtable$FDR < 0.05& abs(Rtable$logFC) > 2,]) > > ###################################################################### > # Annotation > ###################################################################### > #Get annotation dbs for next sections > library(biomaRt) > > athGene <- useMart("plants_mart_18",dataset="athaliana_eg_gene") > > # Roots annotation > ###################################################### > RLrt <- data.frame(tair_locus = rownames(RootsLrt$table)[isDEr], RootsLrt$table[isDEr,]) > RALrt <- merge( RLrt, getBM(attributes = c("tair_locus", "tair_symbol", "transmembrane_domain","description"), filters = "tair_locus", values = rownames(RLrt), mart = athGene),by = "tair_locus") > > > sessionInfo() > R version 3.0.1 (2013-05-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] grid parallel stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] biomaRt_2.16.0 RColorBrewer_1.0-5 gplots_2.11.0.1 > [4] MASS_7.3-26 KernSmooth_2.23-10 caTools_1.14 > [7] gdata_2.12.0.2 gtools_2.7.1 GenomicRanges_1.12.4 > [10] IRanges_1.18.1 BiocGenerics_0.6.0 edgeR_3.2.3 > [13] limma_3.16.5 > > loaded via a namespace (and not attached): > [1] Biostrings_2.28.0 bitops_1.0-5 RCurl_1.95-4.1 Rsamtools_1.12.3 > [5] stats4_3.0.1 tcltk_3.0.1 tools_3.0.1 XML_3.96-1.1 > [9] zlibbioc_1.6.0 > > > > > > On Thu, Jun 13, 2013 at 7:32 AM, Alessandro Botton [guest] <guest@bioconductor.org> wrote: > > Hi all. > This is my first post here. I wasn't sure of posting my question to this mailing list, as my competence in statistics is very poor. So, please, forgive me in advance for what I could say... > A few words to describe my experimental design. I have RNAseq data from 54 samples. My experiment deals with apple fruits collected at two different times (H=at harvest, and PH=after postharvest storage) from trees of three different varieties (G, F, and P) grown under three different agronomic conditions (L, M, and H). I have done 3 biological replicates. So 2 times x 3 varieties x 3 conditions x 3 replicates = 54. > Based upon the suggestions of some colleagues of mine, I have decided to start using EdgeR for the analysis of these data, as they told me (without knowing how!!!) that this package implements multifactor-multilevel pipelines. > My first objective is to achieve a list of differentially expressed genes that are affected: > 1) only by the genotype (variety) > 2) only by the agronomic condition > 3) only by storage > or by an interaction of > 4) genotype x agr. condition > 5) genotype x storage > 6) agr. condition x storage. > I could not be interested by the triple interaction. > I have read the EdgeR vignette and case studies. I have searched the internet but found just discussions in which the "statistical level" was too high for me. > Would you please give me some code examples to get these lists? > The first problem may be the setting up of the design matrix... May I use the same scheme of the paragraph 3.5 of the EdgeR vignette (Comparisons Both Between and Within Subjects)? > Thanks in advance to whom will reply and sorry for this "low level" question. > Alessandro > > > > -- output of sessionInfo(): > > R version 2.15.3 (2013-03-01) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8 > > attached base packages: > [1] grid stats graphics grDevices utils datasets methods base > > other attached packages: > [1] gplots_2.11.0 MASS_7.3-23 KernSmooth_2.23-8 caTools_1.14 gdata_2.12.0 gtools_2.7.1 RColorBrewer_1.0-5 edgeR_3.0.8 limma_3.14.4 DESeq_1.10.1 lattice_0.20-13 locfit_1.5-9 > [13] Biobase_2.18.0 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] annotate_1.36.0 AnnotationDbi_1.20.7 bitops_1.0-4.2 DBI_0.2-5 genefilter_1.40.0 geneplotter_1.36.0 IRanges_1.16.6 parallel_2.15.3 RSQLite_0.11.2 splines_2.15.3 stats4_2.15.3 > [12] survival_2.37-2 tools_2.15.3 XML_3.96-1.1 xtable_1.7-1 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Sam McInturf [[alternative HTML version deleted]]

ADD REPLY • link 10.9 years ago Alessandro Botton ▴ 10

Login before adding your answer.