In this page, we will implement Attention is all you need, from scratch. This is a very simple implementation and is not optimized for speed. The goal is to understand the architecture of the transformer. If you are a basic understanding of linear algrebra and python, you should be able to follow along.

Subcomponents of the transformer which need to be coded out in seperate sections:

Architecture

Training

  • [[notes/optimizers||Optimizers]]
    • Why adam takes into account the root mean and expected value of the divergece ?
  • Loss functions
  • Regularization
  • Weight decay
  • Learning rate schedulers

Decoding

  • Beam search
  • Greedy search
  • Temperature sampling
  • Evaluation metrics

I am speed. - Lightning McQueen

Deep Learning go brrrrrrr

  • Gradient Checkpointing
  • Mixed Precision Training
  • Distributed Training
    • Model Parallelism
    • Data Parallelism
  • Quantization
  • Pruning
  • Knowledge Distillation
  • Transfer Learning
  • Fine Tuning
  • Pretraining

There are several topics which are required to have a basic intution about why is it learning. Here are some topics, I will be covering which go behind why does this blob of code learn: