Because the title suggests, on this article I’m going to implement the Transformer structure from scratch with PyTorch — sure, actually from scratch. Earlier than we get into it, let me present a short overview of the structure. Transformer was first launched in a paper titled “Consideration Is All You Want” written by Vaswani et al. again in 2017 [1]. This neural community mannequin is designed to carry out seq2seq (Sequence-to-Sequence) duties, the place it accepts a sequence because the enter and is anticipated to return one other sequence for the output corresponding to machine translation and query answering.
Earlier than Transformer was launched, we often used RNN-based fashions like LSTM or GRU to perform seq2seq duties. These fashions are certainly able to capturing context, but they accomplish that in a sequential method. This strategy makes it difficult to seize long-range dependencies, particularly when the essential context may be very far behind the present timestep. In distinction, Transformer can freely attend any elements of the sequence that it considers essential with out being constrained by sequential processing.