ISO statistical test for obs-exp A,T,C,G counts by genomic range/position
0
0
Entering edit mode
mat149 ▴ 80
@mat149-11450
Last seen 5 days ago
United States

Hello,

My question is somewhat open-ended, contextual, and difficult to phrase.

I am looking for clarity on what statistical test is most appropriate for my biological question. I suspect that a chi-square test is most suitable, but I am not certain.

I would like to demonstrate/test that a specific set of genomic ranges (ex. the -125 upstream genomic ranges of all tx's (n=10,897) in a genome) have reliable differences in A,T,C,G abundances compared to A,T,C,G abundances from the entire genome (collectively).

Are there best practices/methods for this situation?

To provide context (img link below), the observed 'C' abundances and 'G' abundances at the target genomic ranges deviate from the expected C/G abundances (as calculated from the entire genome) at approx. -125 to +0; the trend may continue downstream. What statistical test(s) should I use to determine if observed C/G abundance is reliably different in the -125 upstream genomic ranges of transcripts versus expected C/G abundance [calculated from the entire genome]?

https://imgur.com/a/iSZ8Bae

Looking forward to thoughts/insight on this. Thanks,

##########################################################
## A,T,C,G counts/freqs on the genome ##
##########################################################
seq1<-getSeq(genome)                            # BSgenome

head(seq1)
DNAStringSet object of length 6:
      width seq                                                          names               
[1] 4119637 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042185.1
[2] 2929485 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042186.1
[3] 2900960 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042187.1
[4] 2873850 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042188.1
[5] 2784243 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042189.1
[6] 2581745 TAACCCTAACCCAACCCTAACCCAACCCT...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042190.1

length(seq1)
[1] 21

frequency1<-alphabetFrequency(seq1,as.prob=TRUE)
frequency1<-frequency1[,1:4]                        
count1<-alphabetFrequency(seq1,as.prob=FALSE)
count1<-count1[,1:4]
count1                                                        # expected ATCG counts -- genome
            A      C      G       T
 [1,] 1085873 976295 976627 1080842
 [2,]  815780 647099 654133  812473
 [3,]  783045 666510 661625  789780
 [4,]  759804 675299 676552  762195
 [5,]  749268 640940 640652  753383
 [6,]  679659 612020 614759  675307
 [7,]  689116 600150 600712  684001
 [8,]  676965 555146 556370  683715
 [9,]  695608 542313 540795  693477
[10,]  675126 560809 563092  672300
[11,]  618650 529356 529676  623740
[12,]  552427 491128 491360  553398
[13,]  533863 484380 484601  536334
[14,]  548054 453910 453529  548248
[15,]  548831 436579 436427  545076
[16,]  451486 400734 399571  452419
[17,]  440559 372603 371746  441287
[18,]  389881 346353 349550  390750
[19,]  325980 295454 294424  325582
[20,]  182840 110590 113550  177699
[21,]   49416  20349  22782   46488

frequency1                                                   # expected ATCG freqs -- genome
              A         C         G         T
 [1,] 0.2635846 0.2369857 0.2370663 0.2623634
 [2,] 0.2784722 0.2208917 0.2232928 0.2773433
 [3,] 0.2699262 0.2297550 0.2280711 0.2722478
 [4,] 0.2643854 0.2349806 0.2354166 0.2652174
 [5,] 0.2691101 0.2302026 0.2300992 0.2705881
 [6,] 0.2632557 0.2370567 0.2381176 0.2615700
 [7,] 0.2677240 0.2331604 0.2333787 0.2657368
 [8,] 0.2738314 0.2245558 0.2250509 0.2765618
 [9,] 0.2813729 0.2193652 0.2187511 0.2805109
[10,] 0.2731836 0.2269263 0.2278501 0.2720401
[11,] 0.2688121 0.2300126 0.2301516 0.2710237
[12,] 0.2645327 0.2351793 0.2352904 0.2649976
[13,] 0.2618030 0.2375369 0.2376453 0.2630148
[14,] 0.2735154 0.2265313 0.2263411 0.2736122
[15,] 0.2790317 0.2219615 0.2218842 0.2771226
[16,] 0.2649239 0.2351436 0.2344611 0.2654714
[17,] 0.2709140 0.2291257 0.2285987 0.2713617
[18,] 0.2640515 0.2345716 0.2367368 0.2646400
[19,] 0.2625822 0.2379930 0.2371633 0.2622616
[20,] 0.3127186 0.1891465 0.1942091 0.3039257
[21,] 0.3554213 0.1463588 0.1638580 0.3343619



gen<-sum(seqlengths(genome))
A<-sum(count1[,1])/gen      # A = .2701629
C<-sum(count1[,2])/gen      # C = .2297183
G<-sum(count1[,3])/gen      # G = .2300384
T<-sum(count1[,4])/gen      # T = .2700805

##########################################################
## A,T,C,G counts/freqs on -125 to +0 genomic ranges for all tx's in the genome ##
##########################################################
tb<-transcriptsBy(tx,by="gene")                                                                # txdb
tb<-unlist(tb)
seq2<-getPromoterSeq(tb,genome,upstream=125,downstream=0,use.names=TRUE)
frequency2<-alphabetFrequency(seq2,as.prob=TRUE)
frequency2<-frequency2[,1:4]                                    
count2<-alphabetFrequency(seq2,as.prob=FALSE)                       
count2<-count2[,1:4]                                        


head(count2)                                                                                         # observed counts at tx's
      A  C  G  T
[1,] 37 40 21 27
[2,] 32 38 25 30
[3,] 34 42 21 28
[4,] 26 44 26 29
[5,] 36 32 27 30
[6,] 34 32 36 23

dim(count2)
[1] 10897     4


head(frequency2)                                                   # observed freqs at tx's
         A     C     G     T
[1,] 0.296 0.320 0.168 0.216
[2,] 0.256 0.304 0.200 0.240
[3,] 0.272 0.336 0.168 0.224
[4,] 0.208 0.352 0.208 0.232
[5,] 0.288 0.256 0.216 0.240
[6,] 0.272 0.256 0.288 0.184


dim(frequency2)
[1] 10897     4


A2<-sum(count2[,1])/sum(count2[,1:4])       # A = 0.2607969
C2<-sum(count2[,2])/sum(count2[,1:4])       # C = 0.3042863
G2<-sum(count2[,3])/sum(count2[,1:4])       # G = 0.1835566
T2<-sum(count2[,4])/sum(count2[,1:4])       # T = 0.2513602
GenomicRanges • 385 views
ADD COMMENT

Login before adding your answer.

Traffic: 950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6