Question

ISO statistical test for obs-exp A,T,C,G counts by genomic range/position

0

Entering edit mode

mat149 ▴ 80

@mat149-11450

Last seen 3 months ago

United States

Hello,

My question is somewhat open-ended, contextual, and difficult to phrase.

I am looking for clarity on what statistical test is most appropriate for my biological question. I suspect that a chi-square test is most suitable, but I am not certain.

I would like to demonstrate/test that a specific set of genomic ranges (ex. the -125 upstream genomic ranges of all tx's (n=10,897) in a genome) have reliable differences in A,T,C,G abundances compared to A,T,C,G abundances from the entire genome (collectively).

Are there best practices/methods for this situation?

To provide context (img link below), the observed 'C' abundances and 'G' abundances at the target genomic ranges deviate from the expected C/G abundances (as calculated from the entire genome) at approx. -125 to +0; the trend may continue downstream. What statistical test(s) should I use to determine if observed C/G abundance is reliably different in the -125 upstream genomic ranges of transcripts versus expected C/G abundance [calculated from the entire genome]?

https://imgur.com/a/iSZ8Bae

Looking forward to thoughts/insight on this. Thanks,

##########################################################
## A,T,C,G counts/freqs on the genome ##
##########################################################
seq1<-getSeq(genome)                            # BSgenome

head(seq1)
DNAStringSet object of length 6:
      width seq                                                          names               
[1] 4119637 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042185.1
[2] 2929485 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042186.1
[3] 2900960 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042187.1
[4] 2873850 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042188.1
[5] 2784243 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042189.1
[6] 2581745 TAACCCTAACCCAACCCTAACCCAACCCT...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042190.1

length(seq1)
[1] 21

frequency1<-alphabetFrequency(seq1,as.prob=TRUE)
frequency1<-frequency1[,1:4]                        
count1<-alphabetFrequency(seq1,as.prob=FALSE)
count1<-count1[,1:4]
count1                                                        # expected ATCG counts -- genome
            A      C      G       T
 [1,] 1085873 976295 976627 1080842
 [2,]  815780 647099 654133  812473
 [3,]  783045 666510 661625  789780
 [4,]  759804 675299 676552  762195
 [5,]  749268 640940 640652  753383
 [6,]  679659 612020 614759  675307
 [7,]  689116 600150 600712  684001
 [8,]  676965 555146 556370  683715
 [9,]  695608 542313 540795  693477
[10,]  675126 560809 563092  672300
[11,]  618650 529356 529676  623740
[12,]  552427 491128 491360  553398
[13,]  533863 484380 484601  536334
[14,]  548054 453910 453529  548248
[15,]  548831 436579 436427  545076
[16,]  451486 400734 399571  452419
[17,]  440559 372603 371746  441287
[18,]  389881 346353 349550  390750
[19,]  325980 295454 294424  325582
[20,]  182840 110590 113550  177699
[21,]   49416  20349  22782   46488

frequency1                                                   # expected ATCG freqs -- genome
              A         C         G         T
 [1,] 0.2635846 0.2369857 0.2370663 0.2623634
 [2,] 0.2784722 0.2208917 0.2232928 0.2773433
 [3,] 0.2699262 0.2297550 0.2280711 0.2722478
 [4,] 0.2643854 0.2349806 0.2354166 0.2652174
 [5,] 0.2691101 0.2302026 0.2300992 0.2705881
 [6,] 0.2632557 0.2370567 0.2381176 0.2615700
 [7,] 0.2677240 0.2331604 0.2333787 0.2657368
 [8,] 0.2738314 0.2245558 0.2250509 0.2765618
 [9,] 0.2813729 0.2193652 0.2187511 0.2805109
[10,] 0.2731836 0.2269263 0.2278501 0.2720401
[11,] 0.2688121 0.2300126 0.2301516 0.2710237
[12,] 0.2645327 0.2351793 0.2352904 0.2649976
[13,] 0.2618030 0.2375369 0.2376453 0.2630148
[14,] 0.2735154 0.2265313 0.2263411 0.2736122
[15,] 0.2790317 0.2219615 0.2218842 0.2771226
[16,] 0.2649239 0.2351436 0.2344611 0.2654714
[17,] 0.2709140 0.2291257 0.2285987 0.2713617
[18,] 0.2640515 0.2345716 0.2367368 0.2646400
[19,] 0.2625822 0.2379930 0.2371633 0.2622616
[20,] 0.3127186 0.1891465 0.1942091 0.3039257
[21,] 0.3554213 0.1463588 0.1638580 0.3343619



gen<-sum(seqlengths(genome))
A<-sum(count1[,1])/gen      # A = .2701629
C<-sum(count1[,2])/gen      # C = .2297183
G<-sum(count1[,3])/gen      # G = .2300384
T<-sum(count1[,4])/gen      # T = .2700805

##########################################################
## A,T,C,G counts/freqs on -125 to +0 genomic ranges for all tx's in the genome ##
##########################################################
tb<-transcriptsBy(tx,by="gene")                                                                # txdb
tb<-unlist(tb)
seq2<-getPromoterSeq(tb,genome,upstream=125,downstream=0,use.names=TRUE)
frequency2<-alphabetFrequency(seq2,as.prob=TRUE)
frequency2<-frequency2[,1:4]                                    
count2<-alphabetFrequency(seq2,as.prob=FALSE)                       
count2<-count2[,1:4]                                        


head(count2)                                                                                         # observed counts at tx's
      A  C  G  T
[1,] 37 40 21 27
[2,] 32 38 25 30
[3,] 34 42 21 28
[4,] 26 44 26 29
[5,] 36 32 27 30
[6,] 34 32 36 23

dim(count2)
[1] 10897     4


head(frequency2)                                                   # observed freqs at tx's
         A     C     G     T
[1,] 0.296 0.320 0.168 0.216
[2,] 0.256 0.304 0.200 0.240
[3,] 0.272 0.336 0.168 0.224
[4,] 0.208 0.352 0.208 0.232
[5,] 0.288 0.256 0.216 0.240
[6,] 0.272 0.256 0.288 0.184


dim(frequency2)
[1] 10897     4


A2<-sum(count2[,1])/sum(count2[,1:4])       # A = 0.2607969
C2<-sum(count2[,2])/sum(count2[,1:4])       # C = 0.3042863
G2<-sum(count2[,3])/sum(count2[,1:4])       # G = 0.1835566
T2<-sum(count2[,4])/sum(count2[,1:4])       # T = 0.2513602

GenomicRanges • 672 views

ADD COMMENT • link 22 months ago mat149 ▴ 80