Hardware Constraints for Transformer Models

Transformer models could not exist without optimized hardware. Memory and disk management design remain critical components. However, computing power also remains a prerequisite. It would be nearly impossible to train the original transformer described earlier in the course, without GPUs. GPUs are at the center of the battle for efficient transformer models.

This appendix lesson go over the importance of GPUs in three steps:

  • The architecture and scale of transformers.

  • CPUs vs. GPUs.

  • Implementing GPUs in PyTorch as an example of how any other optimized language optimizes.

The architecture and scale of transformers

A hint about hardware-driven design appeared in Chapter 3 of this course:

"However, we would only get one point of view at a time by analyzing the sequence with one dmodeld_{model} block. Furthermore, it would take quite some calculation time to find other perspectives."

A better way is to divide the dmodel=512d_{model} = 512 dimensions of each word xnx_n of xx (all the words of a sequence) into eight dk=64d_{k} =64dimensions.

We can then run the eight “heads” in parallel to speed up the training and obtain eight different representation subspaces of how each word relates to another:

Get hands-on with 1400+ tech skills courses.