Question

extracting promoter region error for A.thaliana

0

Entering edit mode

sekawaiwai2006 ▴ 20

@sekawaiwai2006-9531

Last seen 9.5 years ago

Hello

Below are the codes that i used in order extract the upstream 1000 bp for every genes in A.thaliana. however, i get an error at the end when using the command :up1000seqs <- getSeq(genome, up1000), does anyone know what is the problem here??

thx

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘getSeq’ for signature ‘"character"’

My code is

library("TxDb.Athaliana.BioMart.plantsmart28")
txdb <- TxDb.Athaliana.BioMart.plantsmart28
genome<-library("BSgenome.Athaliana.TAIR.TAIR9")
gn <- sort(genes(txdb)) 
up1000 <- flank(gn, width=1000)
up1000seqs <- getSeq(genome, up1000)

Session info is

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Athaliana.TAIR.TAIR9_1.3.1000     
 [2] BSgenome_1.38.0                            
 [3] rtracklayer_1.30.1                         
 [4] Biostrings_2.38.3                          
 [5] XVector_0.10.0                             
 [6] TxDb.Scerevisiae.UCSC.sacCer2.sgdGene_3.2.2
 [7] TxDb.Athaliana.BioMart.plantsmart28_3.2.2  
 [8] GenomicFeatures_1.22.8                     
 [9] AnnotationDbi_1.32.3                       
[10] Biobase_2.30.0                             
[11] GenomicRanges_1.22.3                       
[12] GenomeInfoDb_1.6.2                         
[13] IRanges_2.4.6                              
[14] S4Vectors_0.8.7                            
[15] BiocGenerics_0.16.1                        

loaded via a namespace (and not attached):
 [1] zlibbioc_1.16.0            GenomicAlignments_1.6.3   
 [3] BiocParallel_1.4.3         tools_3.2.3               
 [5] SummarizedExperiment_1.0.2 DBI_0.3.1                 
 [7] lambda.r_1.1.7             futile.logger_1.4.1       
 [9] futile.options_1.0.0       bitops_1.0-6              
[11] RCurl_1.95-4.7             biomaRt_2.26.1            
[13] RSQLite_1.0.0              Rsamtools_1.22.0          
[15] XML_3.98-1.3

promoter region txdb • 2.0k views

ADD COMMENT • link updated 10.0 years ago by Martin Morgan 25k • written 10.0 years ago by sekawaiwai2006 ▴ 20

Martin Morgan · Answer 1 · 2016-01-28

0

Entering edit mode

Martin Morgan 25k

@martin-morgan-1513

Last seen 17 days ago

United States

In your code genome is the return value of library(). Instead you would like it to reference the genome that is present in the library

library("BSgenome.Athaliana.TAIR.TAIR9")
genome <- BSgenome.Athaliana.TAIR.TAIR9

ADD COMMENT • link 10.0 years ago Martin Morgan 25k

0

Entering edit mode

so i have changed the code now

library("TxDb.Athaliana.BioMart.plantsmart28")
txdb <- TxDb.Athaliana.BioMart.plantsmart28
library("BSgenome.Athaliana.TAIR.TAIR9")
genome<-BSgenome.Athaliana.TAIR.TAIR9

gn <- sort(genes(txdb)) 
up1000 <- flank(gn, width=1000)
up1000seqs <- getSeq(genome, up1000)

now the error appears like this:

Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width,  : 
  sequence 1 not found
In addition: Warning message:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)

ADD REPLY • link updated 10.0 years ago by Martin Morgan 25k • written 10.0 years ago by sekawaiwai2006 ▴ 20

0

Entering edit mode

I should have sat on my fingers and let others respond! The problem is that the sequence (chromosome) names in the TxDb are different from the sequence names in the BSgenome.

This could be because they are from the same genome build but using different naming conventions, or because the genome builds are different. From the following

> seqinfo(txdb)
Seqinfo object with 7 sequences (1 circular) from an unspecified genome:
  seqnames seqlengths isCircular genome
  1          30427671         NA   <NA>
  2          19698289         NA   <NA>
  3          23459830         NA   <NA>
  4          18585056         NA   <NA>
  5          26975502         NA   <NA>
  Mt           366924       TRUE   <NA>
  Pt           154478         NA   <NA>
> seqinfo(genome)
Seqinfo object with 7 sequences (2 circular) from TAIR9 genome:
  seqnames seqlengths isCircular genome
  Chr1       30427671      FALSE  TAIR9
  Chr2       19698289      FALSE  TAIR9
  Chr3       23459830      FALSE  TAIR9
  Chr4       18585056      FALSE  TAIR9
  Chr5       26975502      FALSE  TAIR9
  ChrM         366924       TRUE  TAIR9
  ChrC         154478       TRUE  TAIR9

it looks like the naming convention is different. One can rename seqlevels on either object (I think...)

> seqlevels(txdb) = seqlevels(genome)
> seqinfo(txdb)
Seqinfo object with 7 sequences (1 circular) from an unspecified genome:
  seqnames seqlengths isCircular genome
  Chr1       30427671         NA   <NA>
  Chr2       19698289         NA   <NA>
  Chr3       23459830         NA   <NA>
  Chr4       18585056         NA   <NA>
  Chr5       26975502         NA   <NA>
  ChrM         366924       TRUE   <NA>
  ChrC         154478         NA   <NA>

and then have some success

> gn <- sort(genes(txdb)) 
> up1000 <- flank(gn, width=1000)
Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 1 out-of-bound range located on sequence Chr3.
  Note that only ranges located on a non-circular sequence whose length
  is not NA can be considered out-of-bound (use seqlengths() and
  isCircular() to get the lengths and circularity flags of the underlying
  sequences). You can use trim() to trim these ranges. See
  ?`trim,GenomicRanges-method` for more information.

I guess this occurs because some genes are within 1000 nucleotides of the start / end of the sequence, so the flank extends outside the chromosome bounds.

Following the advice

> getSeq(genome, trim(up1000))
  A DNAStringSet instance of length 33602
        width seq                                           names               
    [1]  1000 ATATTGCTATTTCTGCCAATA...CTTCACTGTCTTCCTCCCTCC AT1G01010
    [2]  1000 ATGCGTGTTGGATGCTTACAA...AAAACAGACCAGAAGAGAGAG AT1G01040
    [3]  1000 CGCGATTCTTTTTGGCAATGA...CTAAGTTTCATTAGATACTAA AT1G01046
    [4]  1000 AGTTCTAATTTTGCTTTCTGT...TATTAGTAAAGCAATTAGAAT AT1G01073
    [5]  1000 GATGGGCCGCATAAATATTCG...AAGTAATTAAACACATTACAC AT1G01110
    ...   ... ...
[33598]  1000 TCGACCCGTGCAGTGCTGTAG...CCAGCAGGAGGCCCGCACGAC ATCG01200
[33599]  1000 GATCCGAATGAATCATCTTTT...TATAAGTAATGCAACTATGAA ATCG01210
[33600]  1000 ATCCGATCAATTGCGTAAAGC...GTTCTATTTCTTGATGGGGGA ATCG01220
[33601]  1000 GTGCAAGGCGCTTTGGTGGGA...CACTTAGCTCCATGGAACAAT ATCG01270
[33602]  1000 TTTTTATAGGAACTTCTGTAC...AATGCAATTTAGGAGGAATCA ATCG01280

Returning to the warning message, I went looking for flanking sequences that were outside the end of the sequence length

> idx = end(up1000) > seqlengths(up1000)[as.character(seqnames(up1000))]
> table(idx)
idx
FALSE  TRUE 
33600     2 
> up1000[idx]
GRanges object with 2 ranges and 1 metadata column:
            seqnames               ranges strand |     gene_id
               <Rle>            <IRanges>  <Rle> | <character>
  AT3G63540     Chr3 [23459805, 23460804]      - |   AT3G63540
  ATMG01410     ChrM [  366701,   367700]      - |   ATMG01410
  -------
  seqinfo: 7 sequences (1 circular) from an unspecified genome

ADD REPLY • link 10.0 years ago Martin Morgan 25k