Question: Using Human Protein Atlas annotations with pRoloc to interpretate subcellular location
0
18 months ago by
moldach10
Canada/Montreal/Douglas Mental Health Institute
moldach10 wrote:

As my understanding the package pRoloc allows one to study the localisation of protein inside cells , using relative quantitation of known organelle residents, termed organelle markers.

In the the vignette it uses tan2009r1 data of markers that have been obtained by mining the pRolocdata datasets and curation by various members of the Cambridge Centre for Proteomics

table(pRolocmarkers("dmel"))
##
##  Cytoskeleton            ER         Golgi      Lysosome       Nucleus
##             7            24             7             8            21
##            PM    Peroxisome    Proteasome  Ribosome 40S  Ribosome 60S
##            25             4            14            22            32
## mitochondrion
##            15

This shows only markers for a subset of organelles: Cytoskeleton, ER, Golgi, Lysosome, Nucleus, PM (Plasma Membrane I assume), Peroxisome, ?Proteasome? (not sure what this is?), Ribosome 40S, Ribsome 60S, and Mitochondrion.  It misses many other sub-cellular compartments like Cytosol, Actin Filaments, Vesicles, etc.

Are these markers for proteins that are only specific to the organelle of interest? Meaning, it is not a protein that can be found in other subcellular compartments (multi-localizing protein)

What I would like to do is use information from the Human Proteome Atlas with pRoloc's addMarkers() function. This way instead of 11 organelles I have data on more subcellular compartments. The subcellular location data from their website has many proteins which mult-localize. Can this data be used instead as marker data? Or can you only use pRolocmarkers (Homo sapiens only has 872 fir example) with their Uniprot Protein Identifiers

modified 18 months ago by Laurent Gatto1.2k • written 18 months ago by moldach10
Answer: Using Human Protein Atlas annotations with pRoloc to interpretate subcellular lo
0
18 months ago by
Laurent Gatto1.2k
Belgium
Laurent Gatto1.2k wrote:

Markers are defined as proteins that localise to a sub-cellular niche with high confidence. They are subsequently used to infer localisation of non-markers proteins (of unknown localisation), hence the need for high confidence for all or most markers. They are often defined based on a set of organelles of interest; in the case of whole proteome maps, one would want to document the sub-cellular diversity as much as possible. But it isn't always possible to reliably identify enough markers for all (or the majority of) know organelles. As a result, it is important to set threshold after classification, to limit assignment to this limited (but trustworthy) number of classes to high confident results (this is described in the main pRoloc vignette).

Choosing markers can be done in multiple ways: expert curation, or using GO and other resources, such as HPA or the markers we provide via pRolocmarkers as a guide. The important point is to validate these marker proteins. Cellular compartment documented in a generic resource might not reflect the localisation of that protein in your data (i.e. cell type under specific conditions). Validation is generally done by visualising the data (on a PCA plot for example - see plot2D) with highlighted markers, as assure that they define credible cluster. If a set of markers are scattered around, it might be because they actually aren't good markers, that sub-cellular niche isn't resolved in the data, or the data are noisy (in that case, all organelle clusters will suffer from a the lack of resolution).

The papers A foundation for reliable spatial proteomics data analysis describe this in more details.

Regarding the use of multi-localised markers, there is no reason not to define any, as long as you can find enough that share the same multi-localisation pattern. Subsequently, any proteins assigned to that class would be a candidate localising to that set of organelles.

PS: From Wikipedia: The proteasome are protein complexes which degrade unneeded or damaged proteins by proteolysis, a chemical reaction that breaks peptide bonds. Enzymes that help such reactions are called proteases.

Thank you for your prompt and detailed reply Laurent.

Thanks for clearing up that markers can be selected in a number of ways (e.g. pRolocmarkers, GO CC [with additional curation], HPA, etc.).

There are 11 high-confidence marker categories in pRolocmarkers:

> table(pRolocmarkers("hsap"))

40S Ribosome          60S Ribosome    actin cytoskeleton               Cytosol
33                    46                    45                    77
Endoplasmic Reticulum       Golgi Apparatus              Lysosome          Mitochondria
92                    34                    42                   134
Nucleus            PEROXISOME            plasma mem            Proteasome
116                    27                   190                    36 

I want to use the Human Protein Atlas (HPA) instead because it has a finer scale of information for the subcellular location. For example, instead of just Nucleus like pRolocmarkers, the HPA lists sub-compartments:

1. Nuclear membrane
1. Nuclear membrane
2. Nucleoli
1. Nucleoli
2. Nucleoli fibrillar center
3. Nucleoplasm
1. Nuclear bodies
2. Nuclear speckles
3. Nucleoplasm
4. Nucleus

We designed a "Chromatome" assay and want to assess the efficacy of this technique; success of enriching for chromatin bound proteins in the nucleus. Would the number of marker categories have an effect on the pRoloc algorithm - how well it separates categories? pRolocmarkers has 12 while HPA has 34. Your manuscript says "failure to extract organelle markers that cover the whole subcellular diversity in the data; this leads to prediction errors, as protein profiles of unknown localization can only be associated with organelles that appear in the labeled training data" so shouldn't more organelle markers be better?

HPA has four categories of "reliability"/confidence for where proteins localize:
1) Validated: i) genetic methods using siRNA silencing or CRISPR/Cas9 knockout, ii) expression of a fluorescent protein-tagged protein at endogenous levels, iii) independent antibodies targetting different epitopes.
2) Supported: Agreement with external experimental data from the Uniprot database
3) Approved: Lack of external experimental information (i.e. only found by HPA method: integrating transcriptomics data and antibody-based image profiling approach)
4) Uncertain: HPA showed contradictory results compared to complementary information about the protein location.

So in theory, if I were to take only those proteins with high-reliability (e.g. Validated or Supported) which did not multi-localize and use those as markers I should get good classification for which parts of the nucleus our proteins are localizing in?

Another caveat is how many of these markers are used. There are 116 high-confidence nucleus markers in pRolocmarkers, I haven't yet checked how many are in HPA data but you do mention in the manuscript "an inevitable trade-off...increasing the number of markers to better characterize the multivariate data." I'm guessing the best way to find out is in a heuristic fashion (like manual selection of perplexity hyperparameter for t-SNE clustering) or would it be a faux-pas like trying out different statistical methods until you get the best result (p-hacking)?

P.S. I should have really used Wikipedia to search for proteosome before asking that here, sorry! But after a bit more reading I now know the differences between proteosome/lysosome/endosome/phagosome/aggresome now - thanks!

1

My decision on whether I should use multi-localizing markers for my biological question:
From Gatto et al., 2014 "Although proteins with genuine multiple localizations are of particular interest (see below), one must be careful when assessing multiple GO CC terms and distinguish proteins present in more than one subcellular niche (multilocalization) from changes in localization under different conditions and incorrect annotation." According to the HPA, half of all proteins localize to multiple locations. This reflects spatial restriction and ordering of timing of molecular function in one compartment; some proteins may have context specific function in different parts of the cell (moonlighting).

With this information in hand I think, for my application, that I should only use high-confidence single localization markers.

• Re when you say "I want to use the Human Protein Atlas (HPA) instead because it has a finer scale of information for the subcellular location."
This is of course a good approach, but whether this can be done also depends on the resolution in your data. If there aren't enough sub-nuclear markers or they don't form clusters, over-annotation will end up being counterproductive.

• Re the number of organelles and the risk of prediction errors: The number of classes won't affect the results from an algorithmic point of view. You just need to appreciate that the classifier will only be able to assign proteins to annotated classes, and it is wrong to presume that all sub-cellular niches are annotated. That's also why setting thresholds on the final classification results is important - proteins with low classification scores might not belong to any of the annotated classes in the first place.
If you focus on a set of classes of interest, and limit the annotation for others, that's fine.

• Re using HPA, I would suggest to start with validated/supported (and focus on single localisation, as you mention in another comment). Still, it is important to consider/curate these in the light of your data. This is generally done by visualising your markers on a PCA/t-SNE plot, and convince yourself (and collaborators) that they are valid. The trade-off is between more but ambiguous markers, or less high confidence markers.

Hope this helps.