Unweighted pair group method with arithmetic mean (UPGMA)

Key takeaways:

  • UPGMA (unweighted pair group method with arithmetic mean) is a hierarchical clustering method used in bioinformatics for constructing evolutionary trees by progressively merging clusters based on pairwise distance metrics among sequences.

  • The process involves calculating a pairwise distance matrix, merging the closest clusters, updating the distance matrix using the arithmetic mean, and repeating these steps until all sequences are grouped into a single hierarchical tree.

  • While UPGMA is efficient and useful for analyzing moderate to high divergence datasets, it assumes a constant evolutionary rate and may not be suitable for highly divergent datasets.

UPGMA (unweighted pair group method with arithmetic mean) is a hierarchical clustering method commonly used in bioinformatics, particularly in phylogenetics, for constructing evolutionary trees based on molecular sequence data. It is a bottom-up agglomerative clustering algorithm that builds a tree by progressively merging clusters (groups of sequences) based on their pairwise distances.

How does UPGMA work?

  1. Pairwise distance matrix: The first step in UPGMA involves calculating the pairwise distances between all sequences in the dataset. These distances can be based on various metrics such as genetic distances, sequence similarities, or dissimilarities.

  2. Initialization: Initially, each sequence is considered an individual cluster (leaf) in the tree, and the pairwise distances between them form the initial distance matrix.

  3. Cluster merging: At each iteration, UPGMA identifies the two closest clusters based on the pairwise distance matrix and merges them into a new cluster. The distance between the new cluster and other clusters is calculated using the arithmetic mean of the pairwise distances between the sequences in the merged clusters.

  4. Updating distance matrix: After merging clusters, the distance matrix is updated to reflect the new distances between the merged cluster and the remaining clusters. The new distance is calculated through the formula:

  1. Repeat: Steps 3 and 4 are repeated until all sequences are clustered into a single group, forming a complete hierarchical tree structure.

  2. Tree construction: The hierarchical tree structure obtained from the clustering process represents the evolutionary relationships between the sequences. The height of each node in the tree represents the distance between the clusters it connects, and the branching pattern reflects the sequence similarity or dissimilarity.

Let’s see an example!

Example of UPGMA

Imagine a research study focused on understanding the evolutionary relationships between different species of birds based on their DNA sequences. Researchers collect DNA samples from various bird species and sequence-specific genetic markers to compare their sequences.

After collecting the data, they calculate the pairwise distances between the DNA sequences of all the sampled bird species, and they represent the distances as the following matrix:

Pairwise distance matrix of all distances.
Pairwise distance matrix of all distances.

Now they will perform clustering analysis with UPGMA on the matrix. The main essence is to progressively cluster the bird species based on their distances, merging the closest species into clusters, updating the distance matrix and ultimately constructing a hierarchical tree structure.

Let’s apply UPGMA clustering on the matrix above!

Step 1: Choose the smallest distance

The first thing we will do is choose the two species with the smallest distance between them, in our case, species A and B, with a distance of 2.

After finding the smallest distance, we will cluster the two species together like so:

Cluster of two species.
Cluster of two species.

Step 2: Update the distance matrix

Now that we have a cluster of AB formed, we will calculate the distance of the cluster AB with all other species (C, D, E, F). The equation to calculate the distance is:

Let’s calculate the distance of our cluster with every other species:

  • Distance AB with C: dist(AB,C)=4+42=8/2=4dist(AB,C) = \frac{4+4}{2} = 8/2 = 4

  • Distance AB with D: dist(AB,D)=6+62=12/2=6dist(AB,D) = \frac{6+6}{2} = 12/2 = 6

  • Distance AB with E: dist(AB,E)=6+62=12/2=6dist(AB,E) = \frac{6+6}{2} = 12/2 = 6

  • Distance AB with F: dist(AB,D)=8+82=16/2=8dist(AB,D) = \frac{8+8}{2} = 16/2 = 8

The updated distance vector matrix will be as follows:

Updated pairwise distance matrix of all distances.
Updated pairwise distance matrix of all distances.

Now we have an updated matrix, all we have to do is repeat the above steps. The above two steps were the first cycle, and let’s move on to the second one!

Second cycle

In the updated matrix, we have the option of two clusters as there is a tie between the minimum distance. We can either make a cluster of ED or we can make a cluster of ABC. Let’s make one with ED for simplicity:

Cluster of two species.
Cluster of two species.

Next, let’s calculate the distances of other species from DE:

  • Distance DE with AB: dist(DE,AB)=6+62=12/2=6dist(DE,AB) = \frac{6+6}{2} = 12/2 = 6

  • Distance DE with C: dist(AB,C)=6+62=12/2=6dist(AB,C) = \frac{6+6}{2} = 12/2 = 6

  • Distance DE with F: dist(AB,C)=8+82=16/2=8dist(AB,C) = \frac{8+8}{2} = 16/2 = 8

Let’s update the distance matrix with the new distances:

Updated pairwise distance matrix of all distances.
Updated pairwise distance matrix of all distances.

Let’s repeat the process again!

Third cycle

Now, the smallest distance is between AB and C, so let’s create a cluster of ABC:

Cluster of two species.
Cluster of two species.

We will calculate the distance from ABC with other nodes now:

  • Distance ABC with DE: dist(ABC,DE)=6+62=12/2=6dist(ABC,DE) = \frac{6+6}{2} = 12/2 = 6

  • Distance ABC with F: dist(ABC,F)=8+82=16/2=8dist(ABC,F) = \frac{8+8}{2} = 16/2 = 8

Let’s update the weight matrix, too, now:

Updated pairwise distance matrix of all distances.
Updated pairwise distance matrix of all distances.

Our last cycle is left, so let’s do that!

Fourth cycle

Now, the smallest distance is between ABC and DE, so let’s create a cluster of ABCDE:

Cluster of two species.
Cluster of two species.

We will calculate the distance from ABCDE with other nodes now:

  • Distance ABC with DE: dist(ABCDE,F)=8+82=16/2=8dist(ABCDE,F) = \frac{8+8}{2} = 16/2 = 8

Let’s update the weight matrix, too, now:

Updated pairwise distance matrix of all distances.
Updated pairwise distance matrix of all distances.

Now we have no more nodes left for clustering, so we can create our final tree!

Create a final hierarchal tree

The final hierarchal tree, or the phylogenetic tree, will contain similarity information between the species. We have created the trees in parts, so let’s merge them all:

Final phylogenetic tree.
Final phylogenetic tree.

Conclusion

UPGMA’s simplicity and computational efficiency make it a valuable method for analyzing large datasets with moderate to high sequence divergence. However, it’s important to note that UPGMA assumes a constant evolutionary rate across sequences and may not be suitable for highly divergent datasets. Through its application in diverse fields such as evolutionary biology, genetics, and ecology, UPGMA continues to contribute to our knowledge of the natural world and the intricate processes underlying evolutionary change.

Quiz

Test your knowledge from the quiz below.

1

What is the primary application of UPGMA in bioinformatics?

A)

Data compression

B)

Constructing evolutionary trees based on molecular sequence data

C)

Predicting gene function

D)

Aligning DNA sequences

Question 1 of 30 attempted

Frequently asked questions

Haven’t found what you were looking for? Contact Us


What is the difference between UPGMA and wpgma?

UPGMA (unweighted pair group method with arithmetic mean) uses the average distance between pairs of taxa (in simple terms, taxa are the items or groups you’re trying to compare and organize. Imagine you’re sorting animals like cats, dogs, and birds based on how similar they are) to build a tree, while WPGMA (weighted pair group method with arithmetic mean) gives more weight to the more similar pairs of taxa, which can affect the tree structure.


What is the function of UPGMA?

UPGMA is a clustering algorithm used to construct phylogenetic trees based on genetic distance data. It groups taxa into clusters and builds the tree by progressively merging the closest clusters.


Is UPGMA rooted or unrooted?

UPGMA generates an unrooted tree, meaning it does not place a root at any particular point, leaving the evolutionary relationships unmarked in terms of ancestral direction.



Free Resources

Copyright ©2025 Educative, Inc. All rights reserved