RNA-seq with patient samples without replicates with multiple timepoints
Entering edit mode
kevin • 0
Last seen 1 day ago

Hi everyone,

i'm currently struggle with my first RNA-seq analysis. The experiment design is quite complex and therefore it's hard for me to fit any existing tutorial. The experiment delivered samples from multiple patients from different timepoints, also with a different dose of treatment. For all patients i have a sample for time point 0, called screening. Additionally i have for some patients additional timepoints i.e. day 10, day 20, day 28 etc. For most of the patients i have a sample for the end of treatment time point. All samples have no replicates.

After reading a lot about different approaches to do an analysis without replicates, i ended up with the manual of edgeR (chapter 3.5) and tried to fit it for my experiment. What i have done so far (beside mapping the reads and gathering the counts):

I created a design matrix which looks like this:

Please follow the link for better readability: https://pastebin.com/P8nP7GCX

As you can see there are all my samples for each patient and a variable top rank or low rank and also responder status. Unfortunately i have no information about remission state for the patients, so the classification for responder has been done by using the median value of the overall survival. Furthermore the top / low rank is just a ranking order by the overall survival time. I'm not sure if this is a good approach to group the samples, but for now i couldn't think of another approach to group.

Then i created my counts matrix with the correct order of the counts:

matrix <- readDGE(files)

After that i created the DGEList object:

y <- DGEList(counts=matrix)
filtergroup = c(1,1,1,1,1,1,1,2,1,2,2,2,1,1,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1)
keep <- filterByExpr(y, group=filtergroup)
y <- y[keep, , keep.lib.sizes=FALSE]

Please note, that i used the classification of responder / non-responder as grouping.

Now i've created a contrast and run the analysis:

contrasts <- makeContrasts(ScrTopRankVsScrLowRank = Scr.LowRank-Scr.TopRank, levels = design)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, contrast=contrasts[,"ScrTopRankVsScrLowRank"])

So basically if compared all of my top ranked samples from screening time point (4) vs. the lower ranked (9).

For the next step i want to create a heatmap of the normalized counts of the top DGEs. Since the result set is quite huge, i was looking for a cutoff to minimize it. The problem is, that all approaches i've found so far are using the FDR value as a cutoff, which happens to be 1 for all my DGE.

So my questions are now:

  1. Is the general approach i'm following usable for the design of the experiment?

  2. How to minimize the DGE, or shouldn't i do this?

I'm also very open for any suggestions which contrasts to analyze, as i'm not a biologist but a medical student who is really struggling with this analysis.

If needed i can provide some screenshots of the samples table.

Thanks for help in advance!

Regards, Kevin

edgeR RNASeq multi • 319 views
Entering edit mode
Last seen 1 hour ago
WEHI, Melbourne, Australia

Your experiment is replicated because each patient is a replicate. Your experimental design is paired (also called blocked or repeated-measures), a standard type of design, although apparently with more time points for some patients than others. The edgeR User's Guide advice on non-replicated experiments isn't relevant here.

It is difficult to say more about your experiment without seeing the sample table and knowing what scientific hypotheses you are wanting to test. Show us the sample table as code (not as a screen shot) and show us the code you used to create the design matrix. How does the filtergroup variable correspond to your sample table?

If you're a medical student and you have a complex experiment, it is normally good to collaborate with a bioinformatician or statistician at your own institution. We may be able to give you some suggestions here (if you give more information) but ultimately we can't take responsibility for the analysis. We cannot make key research decisions for you such as what hypotheses to test. It's not the same as a proper collaboration.

I don't understand your question about minimizing DGE. On one hand you say that the FDR is always 1, meaning that you have no significant DE genes at all. On the other hand, you say that the results set is huge. The FDR is only significance cutoff that should be used. Heatmaps are always restricted to the top 100-200 genes regardless of the total number of DE genes so there is no need for a cutoff anyway -- all you need is the ranking by p-value.

Heatmaps and DE criteria are demonstrated in the edgeR workflow: https://www.bioconductor.org/packages/release/workflows/vignettes/RnaSeqGeneEdgeRQL/inst/doc/edgeRQL.html

Entering edit mode

Hi Gordon,

thank you very much for your reply. Here is the sample table:

Patient Timepoint   Dose    Id  Responder   Rank    Mutated Genes
T01001  scr 20  469 Y   14  STAG2,TET2,ASXL1,EZH2,ZRSR2
T01002  scr 20  479 Y   11  IDH1,STAG2,DNMT3A
T01002  d10 20  471 Y   11  IDH1,STAG2,DNMT3A
T01002  d20 20  473 Y   11  IDH1,STAG2,DNMT3A
T01002  d28 20  475 Y   11  IDH1,STAG2,DNMT3A
T01002  eot 20  477 Y   11  IDH1,STAG2,DNMT3A
T01004  scr 40  481 N   22  ASXL1,CSF3R,DNMT3A,IDH1,NPM1,STAG2
T01005  scr 40  483 Y   3   IDH1,NPM1,SMC1A,NRAS
T01007  scr 60  485 N   18  CBL,CUX1,STAG2
T01009  scr 80  465 Y   5   CEBPA,EZH2,NPM1,TET2
T01009  d56 80  461 Y   5   CEBPA,EZH2,NPM1,TET3
T01009  eot 80  447 Y   5   CEBPA,EZH2,NPM1,TET4
T01010  scr 80  467 N   12  DNMT3A,TET2,NPM1,FLT3-ITD 
T01010  eot 80  451 N   12  DNMT3A,TET2,NPM1,FLT3-ITD 
T02001  scr 80  455 N   19  NOTCH1,IDH1,ASXL1,FLT3,STAG2,IDH2
T02001  d10 80  453 N   19  NOTCH1,IDH1,ASXL1,FLT3,STAG2,IDH3
T02001  d20 80  459 N   19  NOTCH1,IDH1,ASXL1,FLT3,STAG2,IDH4
T02001  eot 80  457 N   19  NOTCH1,IDH1,ASXL1,FLT3,STAG2,IDH5
T03001  scr 60  501 N   20  ASXL1,DNMT3A,IDH1,RUXN1,SRSF2
T03001  eot 60  487 N   20  ASXL1,DNMT3A,IDH1,RUXN1,SRSF3
T05001  scr 40  495 Y   4   SRSF2,ASXL1,TET2,U2AF1
T05001  d20 40  489 Y   4   SRSF2,ASXL1,TET2,U2AF2
T05001  d28 40  491 Y   4   SRSF2,ASXL1,TET2,U2AF3
T05001  d56 40  493 Y   4   SRSF2,ASXL1,TET2,U2AF4
T05001  eot 40  499 Y   4   SRSF2,ASXL1,TET2,U2AF5
T05003  scr 40  445 Y   2   SRSF2,IDH1,FLT3,IDH2,NPM1
T05004  scr 60  497 N   21  IDH1
T05004  d10 60  463 N   21  IDH1
T05004  eot 60  449 N   21  IDH1
T05005  scr 60  443 Y   10  ASXL1,RUXN1

The property 'Responder' was defined by the median overall survival time of each patient. So if a patient survived longer than the median, he / she is considered to be a responder. I did this on suggestion of my Professor. The filtergroup variable represents this attribute. The ranking refers to the overall survival time, too.

For the design matrix i did:

design <- model.matrix(~Patient)
Scr.TopRank <- Timepoint == 'scr' & as.numeric(Rank) <= 5
Scr.LowRank <- Timepoint == 'scr' & as.numeric(Rank) > 5
Eot.TopRank <- Timepoint == 'eot' & as.numeric(Rank) <= 5
Eot.LowRank <- Timepoint == 'eot' & as.numeric(Rank) > 5
design <- cbind(design, Scr.TopRank, Scr.LowRank, Eot.TopRank, Eot.LowRank)

I ran an analysis comparing the top 4 ranked against the other samples, both for screening (time 0) time point and also screening vs. end of treatment. This has been suggested by a fellow in my lab.

contrasts <- makeContrasts(ScrResponderVsScrNonResponder = Scr.NonResponder-Scr.Responder, levels = design)
# do analysis
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, contrast=contrasts[,"ScrResponderVsScrNonResponder"])

This gives me the following heatmap:heatmap screening low rank vs. screening high rank

After reading more about the design matrix in the edgeR manual i found, that i maybe should create the initial design matrix like this:

design <- model.matrix(~0+Patient)

Without an intercept column. But i'm not sure which one is correct.

For the FDR, it's correct that it's always 1. By huge result set i mean, that there is a long list of genes reported by topTags. Here is an example output:

                    logFC     logCPM        F       PValue FDR
ENSG00000255435  2.812487 -1.3080662 16.83877 0.0006801096   1
ENSG00000145050 -2.036380  4.5508780 15.06660 0.0011121286   1
ENSG00000205755 -2.916518  0.5486569 13.46371 0.0017806606   1
ENSG00000279439  2.607781  1.4777663 13.30177 0.0018701749   1
ENSG00000134508 -1.443832  4.5556861 11.30776 0.0035070928   1
ENSG00000275560 -1.682709  3.0815607 11.19200 0.0036429485   1

I hope this additional data will clarify some things. Thanks for the advice to reach out for an bioinformatician or statistician. I'll definitely do this, but first i wanted to figure out / understand some basics so i can ask specific questions.

Regards, Kevin

Entering edit mode

Dear Kevin,

There a many issues here and I have little time to reply, so I will try to give you some brief feedback:

  1. Your analysis is not correct. Your experiment has six different time points. You are currently treating d10, d20, d28 and d56 as the same time-point, although clearly they are not. Furthermore your analysis is treating these five time points as the baseline, whereas scr would be a more meaningful baseline. I also don't think the filtering step is appropriate, but that is a more minor concern than the treatment of timepoints.

  2. Your dataset is very, very much underpowered for the type of analysis you are trying to do. You only have 13 patients, and most of them only have observations at one or two timepoints, so there are actually only 5 patients who are making any contribution at all to your current analysis. In my 20 years of working in genomics, I have never seen publishable results produced from such a small experiment with human subjects.

  3. There is no suggestion from the analysis results or the heatmap of any systematic differences in your data. The differences mostly appear to be patient-specific. There does not appear to be any significant differential expression, and that's what one would expect from such an underpowered experiment.

  4. I feel it is important that you consult an experienced RNA-seq analyst to trouble-shoot whether it is reasonable to expect this dataset to yield any useful results. This would start with some quality checking and graphical exploration of the data.

  5. You haven't articulated any scientific questions you are trying to answer and that is normally a starting point to designing an analysis. I suggest you have a discussion with your professor and with the experienced analyst as to what questions might be interesting to pose. I suspect you will need to tone down the complexity of the questions you are trying to answer in view of the limitations of the data.

  6. It makes no difference whether you use ~Patient or ~0+Patient. Both will give identical results.

  7. I don't think it would be fruitful for me to make detailed corrections to the code, but I will note that your contrast statement makes reference to columns that are not in the design matrix. Also the code does not explain what is shown in the heatmap.

  8. topTags gives the number of top genes that you ask for. It will only give a long list if you ask for a long list. By default it just shows you the top 10.

Entering edit mode

Hi Gordon,

thanks again for your reply. First my biological questions i want to to pursue:

Q1) To find genes differentially expressed (DE) between screening time point samples:

group <- factor(c(2,2,3,3,3,3,2,1,2,2,3,3,2,3,2,3,3,3,2,3,1,3,3,3,3,1,2,3,3,2))
design <- model.matrix(~group)
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)
qlf <- glmQLFTest(fit, coef=2)

So if i understood it right with my group i grouped the samples intro three groups: 1: screening time point samples with a high rank for overall survival, 2: screening time point samples with a low rank and 3: all other time points. With coef=2 i compared group 2 vs. 1.

Again this doesn't give me a list of DGEs with significance, but i understand that this maybe won't be the case for any contrast due to the experiment design / amount of samples.

The thing i want to know is, if at least the analysis for my question in this case is correct.

Q2/3: The overall goal is to identify genes / pathways which a differently regulated in screening time point (my professor calls this the baseline question) and later on to find DGEs for the contrast of screening time point vs. end of treatment time point (he calls it the dynamic question). After finding this genes / pathway i have done an experiment with two cell lines with comparable treatment and i'm going to try to compare the genes / pathways from the human samples with the results of this cell line experiments.

But currently i'm not sure how to find this genes / pathway. When i'm doing a heatmap on all counts of the screening time point samples i got this:

heatmap of screening time point counts

Which looks quite heterogeneous for me. Should i try to work with k-means here? If i understand this right k-means could help me to identify sub-clusters which i could you for more focused analysis for DGE. The grouping for responders / non-responder is again just classified by the overall survival median above / below status. Basically it's just to have a quick look if these samples cluster together, which in my opinion they don't do completely.

Is this approach basically the right way? Or am i completely misunderstanding the concepts?

Again, thanks for your advice to reach out for an analyst at my university to get help with this.



Entering edit mode

No your approaches are not right -- differences between patients can't be ignored when assessing time effects and clustering can't be used a precursor to DE analysis. It is not true that k-means can be used to identify a more focused DE analysis. Your original approach was closer to the mark. This is an unbalanced multilevel experiment (in the sense explained in the limma User's Guide).

This would be a very difficult dataset to analyse even for an experienced analyst and, to be honest, I am not sure that you are understanding the issues involved. I would personally analyse this data in limma rather than in edgeR because of it's ability to model random effects and quality weights.

Having said that, there is no analysis that can discover signal in a dataset that doesn't have any. The first exploration step normally would be to make an MDS plot to see whether the samples separate in the way you would expect. All the case studies in the limma and edgeR documentation show MDS plots.

I only have time to explain software syntax rather than to design analyses so I will have to leave it here. Your professor's comments about "baseline" and "dynamic" questions are on the ball. It is your professor's responsibility to make sure you have the necessary guidance and technical assistance.


Login before adding your answer.

Traffic: 376 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6