I present an overview of the ’transformer’ neural network architecture and how it interacts with text and image modalities.