Hello,
I'm using R to analyse RNA-seq data. As a training, I work on normalized counts of RNA-seq data from the article Recurrent SPI1 (PU.1) fusions in high-risk pediatric T cell acute lymphoblastic leukemia, Nat Genet 2017, Seki et al.
I'm using the edgeR package. When I use the glmQLFit
function I have the error Error in xy.coords(x, y, setLab = FALSE) : 'x' and 'y' lengths differ
Here is my script (I'm a beginner, it is probably not perfect) :
> norm.counts <- read.delim("normalized_counts.tsv",
+ header = T, sep = "\t", row.names = 1)
>
> colSums(norm.counts)
TALL001 TALL002 TALL003 TALL004 TALL006 TALL007 TALL008 TALL009
33722027 30093322 34278280 37848244 25315474 23804890 35201963 36456236
TALL010 TALL011 TALL013 TALL014 TALL015 TALL016 TALL017 TALL018
38172855 35779360 32727294 43894216 34418442 36482047 34938025 37990134
TALL019 TALL020 TALL024 TALL025 TALL030 TALL031 TALL032 TALL033
38960315 25773150 30495549 32640739 35919181 43539093 42042277 41543688
TALL034 TALL035 TALL036 TALL037 TALL038 TALL039 TALL040 TALL042
30707698 35396923 43606066 33754135 37889274 36725286 36119062 35467025
TALL044 TALL045 TALL046 TALL047 TALL048 TALL049 TALL050 TALL051
29878225 41663520 32872072 57717587 41764592 36364493 35096191 43286063
TALL052 TALL053 TALL054 TALL055 TALL056 TALL057 TALL061 TALL062
36609160 31687412 31257274 28374736 35096045 40710641 45411284 58606528
TALL063 TALL064 TALL065 TALL066 TALL067 TALL068 TALL069 TALL071
63632920 37907728 31373424 48992507 31730288 52096648 18664742 53593920
TALL072 TALL073 TALL074 TALL075 TALL076 TALL077 TALL078 TALL079
18677207 17232588 29134564 50632375 14496115 59028227 110368139 44027517
TALL080 TALL081 TALL_VC001 TALL_VC003 TALL_VC005 TALL_VC010 TALL_VC011 TALL_VC013
49852029 39005896 23015426 57767116 34416753 20725098 33770006 23086144
TALL_VC014 TALL_VC020 TALL_VC021 TALL_VC022 TALL_VC023 TALL_VC024 TALL_VC025 TALL_VC028
22658885 17152179 7989542 7992537 27377480 19922179 27544197 24572406
TALL_VC050 TALL_VC058 TALL_VC059 TALL_VC062 TALL_VC065 TALL_VC067 TALL_VC068 TALL_VC069
38924366 35155377 24276703 23203659 30098537 13527342 40998764 27662921
TALL_VC070 TALL_VC072 TALL_VC073 TALL_VC075 TALL_VC076 TALL_VC078 TALL_VC080 TALL_VC081
36701352 16129274 30069837 25375765 20900437 25040643 26063947 20015569
TALL_VC082 TALL_VC083 TALL_VC084 TALL_VC086 TALL_VC087 TALL_VC088 TALL_VC090 TALL_VC091
4885073 8292829 27656565 34987607 25828277 33419992 16890930 32962403
TALL_VC092 TALL_VC093 TALL_VC094 TALL_VC095 TALL_VC096 TALL_VC099 TALL_VC100 TALL_VC101
19414351 34597332 29012888 25723041 29418583 29752930 28421256 30689661
TALL_VC102 TALL_VC103 TALL_VC104 TALL_VC105 TALL023 TALL028 TALL027 TALL041
35541147 28443857 28729420 25288812 35183906 37119618 37099201 40949315
TALL059 TALL060 TALL070
39259066 48815589 9645969
>
> # I create au norm.counts2 file to work on it
> norm.counts2 <- norm.counts
>
> # I add a column with genes symbols
> Symbol = row.names(norm.counts2)
> head(Symbol)
[1] "5S_rRNA" "7SK" "A1BG" "A1BG-AS1" "A1CF" "A2M"
> norm.counts2 <- cbind(norm.counts2, Symbol)
> norm.counts2 <- norm.counts2[, colnames(norm.counts2)[c(124, 1:123)]]
>
> # I create the matrix
> group <- factor(colnames(norm.counts2), exclude = "Symbol")
>
>
> design <- model.matrix(~0 + group)
> colnames(design) <- levels(group)
>
>
> # I create the DGEList object, verify librairies sizes and use plotMDS
> norm.counts2 <- norm.counts2[2:124]
>
> dgeObj <- DGEList(norm.counts2)
>
>
> dgeObj$samples$lib.size
[1] 33722027 30093322 34278280 37848244 25315474 23804890 35201963 36456236 38172855
[10] 35779360 32727294 43894216 34418442 36482047 34938025 37990134 38960315 25773150
[19] 30495549 32640739 35919181 43539093 42042277 41543688 30707698 35396923 43606066
[28] 33754135 37889274 36725286 36119062 35467025 29878225 41663520 32872072 57717587
[37] 41764592 36364493 35096191 43286063 36609160 31687412 31257274 28374736 35096045
[46] 40710641 45411284 58606528 63632920 37907728 31373424 48992507 31730288 52096648
[55] 18664742 53593920 18677207 17232588 29134564 50632375 14496115 59028227 110368139
[64] 44027517 49852029 39005896 23015426 57767116 34416753 20725098 33770006 23086144
[73] 22658885 17152179 7989542 7992537 27377480 19922179 27544197 24572406 38924366
[82] 35155377 24276703 23203659 30098537 13527342 40998764 27662921 36701352 16129274
[91] 30069837 25375765 20900437 25040643 26063947 20015569 4885073 8292829 27656565
[100] 34987607 25828277 33419992 16890930 32962403 19414351 34597332 29012888 25723041
[109] 29418583 29752930 28421256 30689661 35541147 28443857 28729420 25288812 35183906
[118] 37119618 37099201 40949315 39259066 48815589 9645969
>
> barplot(dgeObj$samples$lib.size, names = colnames(dgeObj), las = 2)
>
> plotMDS(dgeObj)
>
> # I estimate dispersion
> dgeObj<- estimateCommonDisp(dgeObj)
> dgeObj <- estimateGLMTrendedDisp(dgeObj)
> dgeObj <- estimateTagwiseDisp(dgeObj)
> plotBCV(dgeObj)
> # I use glmQLFit
> fit <- glmQLFit(dgeObj, design, robust = TRUE)
Error in xy.coords(x, y, setLab = FALSE) : 'x' and 'y' lengths differ
I don't know what to do. Thank you for your help. Tell me if you need more information.
Thanks a lot for your answers Gordon and Yunshun,
Indeed, rethinking about it, my code is a little bit stupid. I'll rework on it and read better the EdgeR user guide. I'm taking the opportunity to thank you for this user guide too. It is really complete with lot of case studies. You, Yunshun and your colleagues did a great job.