Hello,
My question is somewhat open-ended, contextual, and difficult to phrase.
I am looking for clarity on what statistical test is most appropriate for my biological question. I suspect that a chi-square test is most suitable, but I am not certain.
I would like to demonstrate/test that a specific set of genomic ranges (ex. the -125 upstream genomic ranges of all tx's (n=10,897) in a genome) have reliable differences in A,T,C,G abundances compared to A,T,C,G abundances from the entire genome (collectively).
Are there best practices/methods for this situation?
To provide context (img link below), the observed 'C' abundances and 'G' abundances at the target genomic ranges deviate from the expected C/G abundances (as calculated from the entire genome) at approx. -125 to +0; the trend may continue downstream. What statistical test(s) should I use to determine if observed C/G abundance is reliably different in the -125 upstream genomic ranges of transcripts versus expected C/G abundance [calculated from the entire genome]?
Looking forward to thoughts/insight on this. Thanks,
##########################################################
## A,T,C,G counts/freqs on the genome ##
##########################################################
seq1<-getSeq(genome) # BSgenome
head(seq1)
DNAStringSet object of length 6:
width seq names
[1] 4119637 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042185.1
[2] 2929485 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042186.1
[3] 2900960 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042187.1
[4] 2873850 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042188.1
[5] 2784243 TAACCCTAACCCTAACCCTAACCCTAACC...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042189.1
[6] 2581745 TAACCCTAACCCAACCCTAACCCAACCCT...GTTAGGGTTAGGGTTAGGGTTAGGGTTA CP042190.1
length(seq1)
[1] 21
frequency1<-alphabetFrequency(seq1,as.prob=TRUE)
frequency1<-frequency1[,1:4]
count1<-alphabetFrequency(seq1,as.prob=FALSE)
count1<-count1[,1:4]
count1 # expected ATCG counts -- genome
A C G T
[1,] 1085873 976295 976627 1080842
[2,] 815780 647099 654133 812473
[3,] 783045 666510 661625 789780
[4,] 759804 675299 676552 762195
[5,] 749268 640940 640652 753383
[6,] 679659 612020 614759 675307
[7,] 689116 600150 600712 684001
[8,] 676965 555146 556370 683715
[9,] 695608 542313 540795 693477
[10,] 675126 560809 563092 672300
[11,] 618650 529356 529676 623740
[12,] 552427 491128 491360 553398
[13,] 533863 484380 484601 536334
[14,] 548054 453910 453529 548248
[15,] 548831 436579 436427 545076
[16,] 451486 400734 399571 452419
[17,] 440559 372603 371746 441287
[18,] 389881 346353 349550 390750
[19,] 325980 295454 294424 325582
[20,] 182840 110590 113550 177699
[21,] 49416 20349 22782 46488
frequency1 # expected ATCG freqs -- genome
A C G T
[1,] 0.2635846 0.2369857 0.2370663 0.2623634
[2,] 0.2784722 0.2208917 0.2232928 0.2773433
[3,] 0.2699262 0.2297550 0.2280711 0.2722478
[4,] 0.2643854 0.2349806 0.2354166 0.2652174
[5,] 0.2691101 0.2302026 0.2300992 0.2705881
[6,] 0.2632557 0.2370567 0.2381176 0.2615700
[7,] 0.2677240 0.2331604 0.2333787 0.2657368
[8,] 0.2738314 0.2245558 0.2250509 0.2765618
[9,] 0.2813729 0.2193652 0.2187511 0.2805109
[10,] 0.2731836 0.2269263 0.2278501 0.2720401
[11,] 0.2688121 0.2300126 0.2301516 0.2710237
[12,] 0.2645327 0.2351793 0.2352904 0.2649976
[13,] 0.2618030 0.2375369 0.2376453 0.2630148
[14,] 0.2735154 0.2265313 0.2263411 0.2736122
[15,] 0.2790317 0.2219615 0.2218842 0.2771226
[16,] 0.2649239 0.2351436 0.2344611 0.2654714
[17,] 0.2709140 0.2291257 0.2285987 0.2713617
[18,] 0.2640515 0.2345716 0.2367368 0.2646400
[19,] 0.2625822 0.2379930 0.2371633 0.2622616
[20,] 0.3127186 0.1891465 0.1942091 0.3039257
[21,] 0.3554213 0.1463588 0.1638580 0.3343619
gen<-sum(seqlengths(genome))
A<-sum(count1[,1])/gen # A = .2701629
C<-sum(count1[,2])/gen # C = .2297183
G<-sum(count1[,3])/gen # G = .2300384
T<-sum(count1[,4])/gen # T = .2700805
##########################################################
## A,T,C,G counts/freqs on -125 to +0 genomic ranges for all tx's in the genome ##
##########################################################
tb<-transcriptsBy(tx,by="gene") # txdb
tb<-unlist(tb)
seq2<-getPromoterSeq(tb,genome,upstream=125,downstream=0,use.names=TRUE)
frequency2<-alphabetFrequency(seq2,as.prob=TRUE)
frequency2<-frequency2[,1:4]
count2<-alphabetFrequency(seq2,as.prob=FALSE)
count2<-count2[,1:4]
head(count2) # observed counts at tx's
A C G T
[1,] 37 40 21 27
[2,] 32 38 25 30
[3,] 34 42 21 28
[4,] 26 44 26 29
[5,] 36 32 27 30
[6,] 34 32 36 23
dim(count2)
[1] 10897 4
head(frequency2) # observed freqs at tx's
A C G T
[1,] 0.296 0.320 0.168 0.216
[2,] 0.256 0.304 0.200 0.240
[3,] 0.272 0.336 0.168 0.224
[4,] 0.208 0.352 0.208 0.232
[5,] 0.288 0.256 0.216 0.240
[6,] 0.272 0.256 0.288 0.184
dim(frequency2)
[1] 10897 4
A2<-sum(count2[,1])/sum(count2[,1:4]) # A = 0.2607969
C2<-sum(count2[,2])/sum(count2[,1:4]) # C = 0.3042863
G2<-sum(count2[,3])/sum(count2[,1:4]) # G = 0.1835566
T2<-sum(count2[,4])/sum(count2[,1:4]) # T = 0.2513602