Is it possible make a file in BED-like format in such a way that each row of data represents a region of the genome and then have different columns to specify the length, the GC content, the DNA methylation, the nucleosome occupancy, Nucleosome Repeat Length and as many features as one could gather, of each region? (if anyone could suggest other features, that would be great)
I ask this with the intention of trying to apply machine learning algorithms to try to predict genomic features.
If it is possible how would I go about collecting the data from different GEO datasets such that each data point that I collected was aligned with the region of DNA that it was supposed to be characterising....
I have two bed files one specifies the NRL of a set of regions the other returns the 'mappability' of these regions they are organised as follows:
head(file1)
chr start end mappability
chr1 3000066 3000100 1.0000
chr1 3000100 3000130 0.5000
chr1 3000130 3000199 0.0625
chr1 3000199 3000277 0.0500
head(file2)
chr start end NRL
chr1 3000000 3000067 250
chr1 3000067 3000079 300
chr1 3000079 3000084 200
chr1 3000084 3000099 130
So I am wondering if someone can help me with an R-script to parse these files such that I can make a new file that contains the overlapping regions between each start and end specifying columns and keep the features that pertain to each of the original files...
So I would want my output to be something like this:
head(files_merged)
chr overlap mappability NRL GC_content more_features...... chr1 start-end 1.0000 250 chr1 start-end 0.5000 300 chr1 start-end 0.0625 200
I can see (obviously) how my plan is flawed in that the regions specified in one file could be much smaller than those in another. Hence what I am also open to suggestions as to a better way to do this?