Unweighted pair group method with arithmetic mean (UPGMA)

Key takeaways:
UPGMA (unweighted pair group method with arithmetic mean) is a hierarchical clustering method used in bioinformatics for constructing evolutionary trees by progressively merging clusters based on pairwise distance metrics among sequences.
The process involves calculating a pairwise distance matrix, merging the closest clusters, updating the distance matrix using the arithmetic mean, and repeating these steps until all sequences are grouped into a single hierarchical tree.
While UPGMA is efficient and useful for analyzing moderate to high divergence datasets, it assumes a constant evolutionary rate and may not be suitable for highly divergent datasets.

UPGMA (unweighted pair group method with arithmetic mean) is a hierarchical clustering method commonly used in bioinformatics, particularly in phylogenetics, for constructing evolutionary trees based on molecular sequence data. It is a bottom-up agglomerative clustering algorithm that builds a tree by progressively merging clusters (groups of sequences) based on their pairwise distances.

How does UPGMA work?

Pairwise distance matrix: The first step in UPGMA involves calculating the pairwise distances between all sequences in the dataset. These distances can be based on various metrics such as genetic distances, sequence similarities, or dissimilarities.
Initialization: Initially, each sequence is considered an individual cluster (leaf) in the tree, and the pairwise distances between them form the initial distance matrix.
Cluster merging: At each iteration, UPGMA identifies the two closest clusters based on the pairwise distance matrix and merges them into a new cluster. The distance between the new cluster and other clusters is calculated using the arithmetic mean of the pairwise distances between the sequences in the merged clusters.
Updating distance matrix: After merging clusters, the distance matrix is updated to reflect the new distances between the merged cluster and the remaining clusters. The new distance is calculated through the formula:

Repeat: Steps 3 and 4 are repeated until all sequences are clustered into a single group, forming a complete hierarchical tree structure.
Tree construction: The hierarchical tree structure obtained from the clustering process represents the evolutionary relationships between the sequences. The height of each node in the tree represents the distance between the clusters it connects, and the branching pattern reflects the sequence similarity or dissimilarity.

Let’s see an example!

Example of UPGMA

Imagine a research study focused on understanding the evolutionary relationships between different species of birds based on their DNA sequences. Researchers collect DNA samples from various bird species and sequence-specific genetic markers to compare their sequences.

After collecting the data, they calculate the pairwise distances between the DNA sequences of all the sampled bird species, and they represent the distances as the following matrix:

Now they will perform clustering analysis with UPGMA on the matrix. The main essence is to progressively cluster the bird species based on their distances, merging the closest species into clusters, updating the distance matrix and ultimately constructing a hierarchical tree structure.

Let’s apply UPGMA clustering on the matrix above!

Step 1: Choose the smallest distance

The first thing we will do is choose the two species with the smallest distance between them, in our case, species A and B, with a distance of 2.

After finding the smallest distance, we will cluster the two species together like so:

Let’s calculate the distance of our cluster with every other species:

Distance AB with C: $dist(AB,C) = \frac{4+4}{2} = 8/2 = 4$
Distance AB with D: $dist(AB,D) = \frac{6+6}{2} = 12/2 = 6$
Distance AB with E: $dist(AB,E) = \frac{6+6}{2} = 12/2 = 6$
Distance AB with F: $dist(AB,D) = \frac{8+8}{2} = 16/2 = 8$

The updated distance vector matrix will be as follows:

Conclusion

UPGMA’s simplicity and computational efficiency make it a valuable method for analyzing large datasets with moderate to high sequence divergence. However, it’s important to note that UPGMA assumes a constant evolutionary rate across sequences and may not be suitable for highly divergent datasets. Through its application in diverse fields such as evolutionary biology, genetics, and ecology, UPGMA continues to contribute to our knowledge of the natural world and the intricate processes underlying evolutionary change.

Quiz

Test your knowledge from the quiz below.

Frequently asked questions

Haven’t found what you were looking for? Contact Us

What is the difference between UPGMA and wpgma?

UPGMA (unweighted pair group method with arithmetic mean) uses the average distance between pairs of taxa (in simple terms, taxa are the items or groups you’re trying to compare and organize. Imagine you’re sorting animals like cats, dogs, and birds based on how similar they are) to build a tree, while WPGMA (weighted pair group method with arithmetic mean) gives more weight to the more similar pairs of taxa, which can affect the tree structure.

What is the function of UPGMA?

UPGMA is a clustering algorithm used to construct phylogenetic trees based on genetic distance data. It groups taxa into clusters and builds the tree by progressively merging the closest clusters.

Is UPGMA rooted or unrooted?

UPGMA generates an unrooted tree, meaning it does not place a root at any particular point, leaving the evolutionary relationships unmarked in terms of ancestral direction.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources