Question

Generation of QSEA TCGA 450K Human Methylation Calibration Dataset

2

Entering edit mode

justinburgener ▴ 20

@justinburgener-12755

Last seen 7.0 years ago

Hi all,

This question is directed to those who may have experience with the QSEA R package, which proceeds from MEDIPS.

I have MeDIP-seq data which I hope to begin analyzing very shortly, but unfortunately I'm not too familiar with processing genomic data. I've been able to following the tutorial for QSEA up until the normalization section, which suggests the use of a previously validated dataset. 450k TCGA data from lung is used in this example, and I such I presumed to use Head and Neck for mine. However I'm unable to figure out how to average the beta values within 500bp windows across all samples, as described within the tutorial, which looks like this:

I believe it may have been done by GRanges, or another R package which utilizes GRanges, but I am unable to generate the same object format. I've tried packages such as minfi, GRanges, but I lack the hands-on knowledge to further process the TCGA data after import.

I was wondering if anyone has any suggestions as to what R packages or software I can use to produce a similar R object.

Thank you,

Justin

illimina 450k methylation qsea MEDIPS TCGA • 2.9k views

ADD COMMENT • link updated 3 months ago by Matthias Lienhard ▴ 240 • written 7.0 years ago by justinburgener ▴ 20

2

Entering edit mode

Matthias Lienhard ▴ 240

@matthias-lienhard-6292

Last seen 3 months ago

Max Planck Institute for molecular Gene…

On this page you find the methylation values for several TCGA cohorts. https://oc-molgen.gnz.mpg.de/owncloud/s/YrQfJnAZLMbTcKb For each cohort you find 3 files:

1) Methylation levels on probe level. the fist 4 columns are the probe id, the chromosome, the position on hg38 and the associated gene: *_methylation450_level3_beta_values.table

2 & 3) Methylation levels summarized in genomic bins of size 250 and 500 bases, processed as described above: *_meth450k_level3_wd500.table and wd250.table

Use the TCGA sample barcodes from the header to find out which columns are control and which are tumor, as explained here: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/

e.g. the column ACC.HumanMethylation450.1..TCGA-OR-A5JO-01A-11D-A29J-05_Beta_value

is a normal control, as indicated by the highlighted part (11)

Note all positions refer to hg38!

Best, Matthias

ADD COMMENT • link 3.7 years ago Matthias Lienhard ▴ 240

0

Entering edit mode

Thanks Matthias, that looks great! Simon

ADD REPLY • link 3.7 years ago Simon Pearce • 0

0

Entering edit mode

Matthias Lienhard ▴ 240

@matthias-lienhard-6292

Last seen 3 months ago

Max Planck Institute for molecular Gene…

I noticed the link to the calibration data was broken, so I re-uploaded the data to https://nc.molgen.mpg.de/cloud/index.php/s/ZqfEQsJCJi6ANp8

ADD COMMENT • link 3 months ago Matthias Lienhard ▴ 240

score 2 · Accepted Answer · 2017-04-04

2

Entering edit mode

Matthias Lienhard ▴ 240

@matthias-lienhard-6292

Last seen 3 months ago

Max Planck Institute for molecular Gene…

Hi Justin,

I used two perl script to summarize the TCGA data into genome wide bins:

1) SummarizeTables.pl combines several sample files in a single table.

2) summarize_bs_in_bins.pl sumerizes bisulfite values in genome wide bins of fixed size by computing the average % methylation values.

I provide these scripts on my github repository: https://github.com/MatthiasLienhard/hts_scripts.git

The tables can then be averaged over normal and tumor samples in R directly.

For TCGA head and neck squamous cell carcinoma, I already prepared an 500 base overview table, in case you are interested, please contact me.

Best, Matthias

ADD COMMENT • link 7.0 years ago Matthias Lienhard ▴ 240

0

Entering edit mode

Thank you for all the useful information! I'll take a look at the script and try to reproduce the tables to gain some more experience. I'll definitely contact you should I have difficulties and require the table you already prepared, very convenient! :)

Thanks again,

Justin

ADD REPLY • link 7.0 years ago justinburgener ▴ 20

0

Entering edit mode

Hi Matthias,

Thanks for the detail steps. I generated the R table for COAD samples.

When I run the command: wd=findOverlaps(tcgacoad450kmeth, getRegions(qseaSet), select="first") I get some NULL values in wd.

When I pass this to the addEnrichmentParameters method I get below error: Error: subscript contains NAs

Shall I remove those windows from wd before running the addEnrichmentParameters?

Thanks

Gaurav

ADD REPLY • link 4.1 years ago gaurav4thareja ▴ 20

0

Entering edit mode

Hi Gaurav,

like that its difficult to judge whether this is normal and acceptable. Can you provide more details, that make it easier to reproduce the issue? Where exactly did the NAs (or NULLs?) appear? If you have a commented script, that contains the function output, that would be best.

Best, Matthias

ADD REPLY • link 4.1 years ago Matthias Lienhard ▴ 240

1

Entering edit mode

Hi Matthias,

I was able to resolve this issue by selecting only those chromosomes in tcgacoad450kmeth that are in qseaSet. Any extra regions or windows in tcgacoad450kmeth object will give this error.

I am getting values > 1 in output of getOffset(qseaSet, scale="fraction"). Ideally, these values should be less than 1 as per the tutorial. What could be the reason for >1 numbers.

Regards

Gaurav

ADD REPLY • link 4.1 years ago gaurav4thareja ▴ 20

0

Entering edit mode

Hi Gaurav,

large estimated offset values around 1 indicate that the coverage in regions without any CpG are as high as in regions with CpGs. This suggests that the MeDIP enrichment was not effective or the samples are completely demethylated. Do you see this for all your samples?

ADD REPLY • link 4.1 years ago Matthias Lienhard ▴ 240

0

Entering edit mode

Hi Matthias,

Yes, I am observing for all samples.

01-NT 04-NT 05-NT 08-NT 09-NT 10-NT 11-NT 12-NT 14-NT 0.9169249 1.1974449 1.1248954 1.0219849 0.9956591 0.5468642 0.9655438 1.1318748 1.2267099 22-NT 26-NT 38-NT 39-NT 40-NT 46-NT 50-NT 01-TT 04-TT 1.5692617 1.0610539 1.3977801 0.6516493 0.8128844 1.2260762 1.1285125 1.2004545 1.1669367 05-TT 08-TT 09-TT 10-TT 11-TT 12-TT 14-TT 22-TT 26-TT 1.0152239 1.0229919 1.2373184 0.7241613 1.2339220 1.4789610 1.2667688 1.1665162 1.0011935 38-TT 39-TT 40-TT 46-TT 50-TT 1.5121940 0.6874239 0.8598436 1.2930325 0.9949078

So, is the data has no enrichment or shall I same other signal matrix?

ADD REPLY • link 4.0 years ago gaurav4thareja ▴ 20

0

Entering edit mode

Data indicate that there is no enrichment, however, of course its worth digging into it, e.g. by looking at regions with / without expected enrichment in a genome browser. Of course there is a chance that the selected calibration regions are unfortunate.

ADD REPLY • link 4.0 years ago Matthias Lienhard ▴ 240

0

Entering edit mode

Data indicate that there is no enrichment, however, of course its worth digging into it, e.g. by looking at regions with / without expected enrichment in a genome browser. Of course there is a chance that the selected calibration regions are unfortunate.

ADD REPLY • link 4.0 years ago Matthias Lienhard ▴ 240

0

Entering edit mode

Hi Matthias, Is it possible for you to share the full result for LUAD/LUSC somewhere (not just the chr20-22), rather than me generating it again?

Thanks, Simon

ADD REPLY • link 3.8 years ago Simon Pearce • 0

0

Entering edit mode

Sorry, I missed that comment. Do you still need the TCGA calibration data?

ADD REPLY • link 3.7 years ago Matthias Lienhard ▴ 240

0

Entering edit mode

Yes please Matthias, that would be appreciated.

ADD REPLY • link 3.7 years ago Simon Pearce • 0

0

Entering edit mode

See my new answer for the link to the data for 35 TCGA cohorts

ADD REPLY • link 3.7 years ago Matthias Lienhard ▴ 240

0

Entering edit mode

Hi Matthias,

Is it possible to go through the workflow (stated above in your answer) in R? I am not familiar with the perl universe. I have an in-house 450K prostate cancer data set, which I would like to apply in the the calibration step, and as all my data map to hg19, I cannot use your TCGA cohort tables.

Best, Karoline

ADD REPLY • link 3.6 years ago Karoline • 0

0

Entering edit mode

Sure, but can you please make a new thread or answer this one, as this comments get quite confusing. Please specify what you already tried and what did not work. The scripts should work without knowing perl as they work with command line options.

ADD REPLY • link 3.6 years ago Matthias Lienhard ▴ 240