extracting promoter region error for A.thaliana
1
0
Entering edit mode
@sekawaiwai2006-9531
Last seen 8.3 years ago

Hello 

Below are the codes that i used in order extract the upstream 1000 bp for every genes in A.thaliana. however, i get an error at the end when using the command :up1000seqs <- getSeq(genome, up1000), does anyone know what is the problem here?? 

thx

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘getSeq’ for signature ‘"character"’

My code is

library("TxDb.Athaliana.BioMart.plantsmart28")
txdb <- TxDb.Athaliana.BioMart.plantsmart28
genome<-library("BSgenome.Athaliana.TAIR.TAIR9")
gn <- sort(genes(txdb)) 
up1000 <- flank(gn, width=1000)
up1000seqs <- getSeq(genome, up1000)

Session info is

> sessionInfo()
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BSgenome.Athaliana.TAIR.TAIR9_1.3.1000     
 [2] BSgenome_1.38.0                            
 [3] rtracklayer_1.30.1                         
 [4] Biostrings_2.38.3                          
 [5] XVector_0.10.0                             
 [6] TxDb.Scerevisiae.UCSC.sacCer2.sgdGene_3.2.2
 [7] TxDb.Athaliana.BioMart.plantsmart28_3.2.2  
 [8] GenomicFeatures_1.22.8                     
 [9] AnnotationDbi_1.32.3                       
[10] Biobase_2.30.0                             
[11] GenomicRanges_1.22.3                       
[12] GenomeInfoDb_1.6.2                         
[13] IRanges_2.4.6                              
[14] S4Vectors_0.8.7                            
[15] BiocGenerics_0.16.1                        

loaded via a namespace (and not attached):
 [1] zlibbioc_1.16.0            GenomicAlignments_1.6.3   
 [3] BiocParallel_1.4.3         tools_3.2.3               
 [5] SummarizedExperiment_1.0.2 DBI_0.3.1                 
 [7] lambda.r_1.1.7             futile.logger_1.4.1       
 [9] futile.options_1.0.0       bitops_1.0-6              
[11] RCurl_1.95-4.7             biomaRt_2.26.1            
[13] RSQLite_1.0.0              Rsamtools_1.22.0          
[15] XML_3.98-1.3       
promoter region txdb • 1.7k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 months ago
United States

In your code genome is the return value of library(). Instead you would like it to reference the genome that is present in the library

library("BSgenome.Athaliana.TAIR.TAIR9")
genome <- BSgenome.Athaliana.TAIR.TAIR9
ADD COMMENT
0
Entering edit mode

so i have changed the code now 

library("TxDb.Athaliana.BioMart.plantsmart28")
txdb <- TxDb.Athaliana.BioMart.plantsmart28
library("BSgenome.Athaliana.TAIR.TAIR9")
genome<-BSgenome.Athaliana.TAIR.TAIR9

gn <- sort(genes(txdb)) 
up1000 <- flank(gn, width=1000)
up1000seqs <- getSeq(genome, up1000)

now the error appears like this:

Error in .getOneSeqFromBSgenomeMultipleSequences(x, name, start, NA, width,  : 
  sequence 1 not found
In addition: Warning message:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)
ADD REPLY
0
Entering edit mode

I should have sat on my fingers and let others respond! The problem is that the sequence (chromosome) names in the TxDb are different from the sequence names in the BSgenome.

This could be because they are from the same genome build but using different naming conventions, or because the genome builds are different. From the following

> seqinfo(txdb)
Seqinfo object with 7 sequences (1 circular) from an unspecified genome:
  seqnames seqlengths isCircular genome
  1          30427671         NA   <NA>
  2          19698289         NA   <NA>
  3          23459830         NA   <NA>
  4          18585056         NA   <NA>
  5          26975502         NA   <NA>
  Mt           366924       TRUE   <NA>
  Pt           154478         NA   <NA>
> seqinfo(genome)
Seqinfo object with 7 sequences (2 circular) from TAIR9 genome:
  seqnames seqlengths isCircular genome
  Chr1       30427671      FALSE  TAIR9
  Chr2       19698289      FALSE  TAIR9
  Chr3       23459830      FALSE  TAIR9
  Chr4       18585056      FALSE  TAIR9
  Chr5       26975502      FALSE  TAIR9
  ChrM         366924       TRUE  TAIR9
  ChrC         154478       TRUE  TAIR9

it looks like the naming convention is different. One can rename seqlevels on either object (I think...)

> seqlevels(txdb) = seqlevels(genome)
> seqinfo(txdb)
Seqinfo object with 7 sequences (1 circular) from an unspecified genome:
  seqnames seqlengths isCircular genome
  Chr1       30427671         NA   <NA>
  Chr2       19698289         NA   <NA>
  Chr3       23459830         NA   <NA>
  Chr4       18585056         NA   <NA>
  Chr5       26975502         NA   <NA>
  ChrM         366924       TRUE   <NA>
  ChrC         154478         NA   <NA>

and then have some success

> gn <- sort(genes(txdb)) 
> up1000 <- flank(gn, width=1000)
Warning message:
In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
  GRanges object contains 1 out-of-bound range located on sequence Chr3.
  Note that only ranges located on a non-circular sequence whose length
  is not NA can be considered out-of-bound (use seqlengths() and
  isCircular() to get the lengths and circularity flags of the underlying
  sequences). You can use trim() to trim these ranges. See
  ?`trim,GenomicRanges-method` for more information.

I guess this occurs because some genes are within 1000 nucleotides of the start / end of the sequence, so the flank extends outside the chromosome bounds.

Following the advice

> getSeq(genome, trim(up1000))
  A DNAStringSet instance of length 33602
        width seq                                           names               
    [1]  1000 ATATTGCTATTTCTGCCAATA...CTTCACTGTCTTCCTCCCTCC AT1G01010
    [2]  1000 ATGCGTGTTGGATGCTTACAA...AAAACAGACCAGAAGAGAGAG AT1G01040
    [3]  1000 CGCGATTCTTTTTGGCAATGA...CTAAGTTTCATTAGATACTAA AT1G01046
    [4]  1000 AGTTCTAATTTTGCTTTCTGT...TATTAGTAAAGCAATTAGAAT AT1G01073
    [5]  1000 GATGGGCCGCATAAATATTCG...AAGTAATTAAACACATTACAC AT1G01110
    ...   ... ...
[33598]  1000 TCGACCCGTGCAGTGCTGTAG...CCAGCAGGAGGCCCGCACGAC ATCG01200
[33599]  1000 GATCCGAATGAATCATCTTTT...TATAAGTAATGCAACTATGAA ATCG01210
[33600]  1000 ATCCGATCAATTGCGTAAAGC...GTTCTATTTCTTGATGGGGGA ATCG01220
[33601]  1000 GTGCAAGGCGCTTTGGTGGGA...CACTTAGCTCCATGGAACAAT ATCG01270
[33602]  1000 TTTTTATAGGAACTTCTGTAC...AATGCAATTTAGGAGGAATCA ATCG01280

Returning to the warning message, I went looking for flanking sequences that were outside the end of the sequence length

> idx = end(up1000) > seqlengths(up1000)[as.character(seqnames(up1000))]
> table(idx)
idx
FALSE  TRUE 
33600     2 
> up1000[idx]
GRanges object with 2 ranges and 1 metadata column:
            seqnames               ranges strand |     gene_id
               <Rle>            <IRanges>  <Rle> | <character>
  AT3G63540     Chr3 [23459805, 23460804]      - |   AT3G63540
  ATMG01410     ChrM [  366701,   367700]      - |   ATMG01410
  -------
  seqinfo: 7 sequences (1 circular) from an unspecified genome
ADD REPLY

Login before adding your answer.

Traffic: 887 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6