How to remove INDELS/SNPs from a bed file
1
0
Entering edit mode
Cian • 0
@831b6808
Last seen 24 months ago
Ireland

I am trying to remove INDELS from a .bed file because I only need the SNP data. My data is set up so column 1 = chromosome, 2 = start position, 3 = stop position, 4-5 = reference-alternative nucleotide and 6 = patient ID. how do I remove the INDELS along with rest of that row from the file?

Code should be placed in three backticks as shown below


# include your problematic code here with any corresponding output 
# please also include the results of running the following in an R session 

sessionInfo( )
SNP • 1.1k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

That isn't a bed file, by definition. It's more of a free-form file with genomic positions. But still easy enough to deal with. Here's a self-contained example, where I am assuming that your 'bed' file follows the normal convention for a bed file and is 0-based, half open. In Bioconductor we use 1-based, fully closed numbering. If you don't know what that means, read this.

## Fake-o data
## You would have to read your data in using read.table
> starts <-  seq(234, 237, 1)
> d.f <- data.frame(chr = rep("chr1", 4), start = starts, end = starts +c(1,5,1,1), ref = c("A","T","C","G"), alt = c("C", "ACGTG","A","T"), user = 1:4)
> d.f
   chr start end ref   alt user
1 chr1   234 235   A     C    1
2 chr1   235 240   T ACGTG    2
3 chr1   236 237   C     A    3
4 chr1   237 238   G     T    4

## Note we add 1 to the start position to convert to 1-based, fully closed
> gr <- GRanges(d.f[,1], IRanges(d.f[,2] + 1L, d.f[,3]), ref = d.f[,4], alt = d.f[,5], id = d.f[,6])
> gr
GRanges object with 4 ranges and 3 metadata columns:
      seqnames    ranges strand |         ref         alt        id
         <Rle> <IRanges>  <Rle> | <character> <character> <integer>
  [1]     chr1       235      * |           A           C         1
  [2]     chr1   236-240      * |           T       ACGTG         2
  [3]     chr1       237      * |           C           A         3
  [4]     chr1       238      * |           G           T         4
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> width(gr)
[1] 1 5 1 1
> newgr <- gr[width(gr) == 1L]
> newgr
GRanges object with 3 ranges and 3 metadata columns:
      seqnames    ranges strand |         ref         alt        id
         <Rle> <IRanges>  <Rle> | <character> <character> <integer>
  [1]     chr1       235      * |           A           C         1
  [2]     chr1       237      * |           C           A         3
  [3]     chr1       238      * |           G           T         4
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> as(newgr, "data.frame")
  seqnames start end width strand ref alt id
1     chr1   235 235     1      *   A   C  1
2     chr1   237 237     1      *   C   A  3
3     chr1   238 238     1      *   G   T  4

I trust you know how to write that resulting data.frame to disk.

ADD COMMENT

Login before adding your answer.

Traffic: 446 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6