Using Multi-Dimensional Scaling (MDS) to produce a vector in order to account for patient bias when constructing DGE lists from RNA-seq datasets in R?
1
0
Entering edit mode
@e506e96b
Last seen 14 months ago
United Kingdom

I am currently working on my PhD and as part of my thesis, I intend to analyse gene expression within multiple sclerosis lesions by looking at RNA-seq datasets on Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/) , which I have selected already. Part of my analysis is using R to transform the count matrices provided on GEO into DGElists (differential gene expression lists), and I have been using both the edgeR handbook, as well as an R script provided by one of the groups whose dataset I am including in my analysis, to construct my own code (as I have had to teach myself R programming from scratch).

My current issue is as follows: Following along with the script provided by the lab group, they appear to have calculated a "PCA Vector", which they use in the creation of their design matrix. As far as I can tell, PCA, or Principal Component Analysis, is meant to account for the multi-dimensionality of a dataset, essentially transforming the data so that it can be plotted on a 2-dimensional set of axes and the variability of the points shown visually. Upon contacting the group, they said that they used PCA to find the biggest outliers as well as identify the largest patient bias, which they then normalised for.

The group has also provided their data which can be opened in R. When I utilise their "PCA vector" in the construction of my design matrix, I get the same values as the group, however, any attempts at producing this PCA vector by performing PCA using R packages does not give me values that are in any way close to those in the group's DGE lists (also provided).

My question therefore is this: Does anybody have an idea as to how one could create a "PCA vector" - a value of variation between patients - which could then be incorporated into a design matrix in order to account for patient variation? I feel I should mention - patient age, sex and MS lesion type are already accounted for as they are also used in the construction of the design matrix. Would anyone have any experience with this?

I ask that you please be gentle/patient with me - I am still very new to the world of bioinformatics (and eager to learn!) and I apologise if I have not explained my issue very well. I will do my very utmost to clear up any confusion, and I would be incredibly grateful to anyone who can help!

For those interested in seeing the journal article which includes links to the GEO repository which hosts their DGE lists and count matrix, as well as their github containing their R script and data, I have included the link below.

https://actaneurocomms.biomedcentral.com/articles/10.1186/s40478-019-0855-7#availability-of-data-and-materials

PrincipalComponent StatisticalMethod edgeR MultidimensionalScaling Normalization • 713 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 14 hours ago
United States

Use either sva or RUVseq instead. Both packages have vignettes that you can peruse.

ADD COMMENT
0
Entering edit mode

Thank you James, I will take a look into this!

ADD REPLY

Login before adding your answer.

Traffic: 836 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6