Line-By-Line, Let’s Reproduce GPT-2: Part 2 — {Hardware} Optimization | by Matthew Gunton | Jul, 2024

Utilizing PyTorch, we don’t want to alter our code dramatically to make use of the brand new information sort. The documentation advises us to solely use these through the ahead move of your mannequin and loss calculation. As our code does each of those in 1 line, we will modify our code as under:

for i in vary(50):
t0 = time.time()
x, y = train_loader.next_batch()
x, y = x.to(gadget), y.to(gadget)
optimizer.zero_grad()
with torch.autocast(device_type=gadget, dtype=torch.bfloat16): # bf16 change
logits, loss = mannequin(x, y)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
t1 = time.time()
dt = (t1-t0)*1000
print(f"loss {loss.merchandise()}, step {i}, dt {dt:.2f}ms")
loss_arr.append(loss.merchandise())

Identical to that, our code is now working utilizing BF16.

Working on our A100, we now see that the common step takes about 330ms! We’ve already diminished our runtime by about 70%, and we’re simply getting began!

Picture by Writer — A100 Coaching after information sort optimizations

We will additional enhance our coaching time by using the PyTorch Compile characteristic. It will give us pretty massive efficiency will increase with out having to regulate our code one bit.

To return at it from a high-level, each laptop program is executed in binary. As a result of most individuals discover it tough to code in binary, we now have created higher-level languages that permit us code in types which might be simpler for individuals to assume in. Once we compile these languages, they’re remodeled again into binary that we truly run. Generally on this translation, we will work out quicker methods to do the identical calculation — similar to reusing a sure variable and even merely not doing one to start with.

# ...
mannequin = GPT(GPTConfig(vocab_size=50304))
mannequin.to(gadget)
mannequin = torch.compile(mannequin) # new line right here
# ...

This brings us now to machine studying and PyTorch. Python is a high-level language however we’re nonetheless doing computationally intense calculations with it. Once we run torch compile we’re spending extra time compiling our code, however we wind up seeing our runtime (the coaching for us right here) go rather a lot quicker due to that additional work we did to seek out these optimizations.

Karpathy provides the next instance of how PyTorch could enhance the calculations. Our GELU activation operate may be written out like under:

class TanhGELU(nn.Module):
def ahead(self, enter):
return 0.5 * enter * (1.0 + torch.tanh(math.sqrt(2.0/math.pi) * (enter + 0.044715 * torch.pow(enter, 3.0))))

For every calculation you see within the above operate, we now have to dispatch a kernel within the GPU. Which means that once we begin off by taking enter to the third energy, we pull enter from high-bandwidth reminiscence (HBM) into the GPU cores and do our calculation. We then write again to HBM earlier than we begin our subsequent calculation and start the entire course of over once more. Naturally, this sequencing is inflicting us to spend so much of time ready for reminiscence transfers to happen.

PyTorch compile permits us to see an inefficiency like this and be extra cautious with once we are spinning up new kernels, leading to dramatic pace ups. That is referred to as kernel fusion.

Whereas on this subject, I’d prefer to level out a superb open-source challenge referred to as Luminal that takes this concept a bit of additional. Luminal is a separate framework that you simply write your coaching / inferencing in. Through the use of this framework, you get entry to its compiler which finds many extra optimizations for you by nature of getting a extra restricted variety of computations to think about. Should you like the concept of bettering runtime by compiling quick GPU code, give the challenge a glance.

Once we run the above code now we see that we see every step takes roughly 145 ms (chopping by 50% from earlier than and ~86% from the unique). We pay for this with the primary iteration which took roughly 40,000ms to run! As most coaching sequences have many extra steps than 50, this tradeoff is one which we’re prepared to make.

Picture by Writer — A100 Coaching run after the Torch Compile optimizations

One other optimization we make is utilizing Flash Consideration (see the paper right here). The code change itself may be very easy for us, however the pondering behind it’s price exploring.

y = F.scaled_dot_product_attention(q, ok, v, is_causal=True)

Much like how we condensed the TanhGELU class into as few kernels as we may, we apply the identical pondering to consideration. Of their paper, “FlashAttention: Quick and Reminiscence-Environment friendly Actual Consideration with IO-Consciousness”, the authors present how one can obtain a 7.6x pace up by fusing the kernel. Whereas in idea torch compile ought to have the ability to discover optimizations like this, in follow we haven’t seen it discover this but.

Determine 1 from “FlashAttention: Quick and Reminiscence-Environment friendly Actual Consideration with IO-Consciousness”

The paper is price doing a deep dive on, however to offer a fast synopsis, FlashAttention is written to be IO-aware, thus stopping pointless (and time-consuming) calls to reminiscence. By lowering these, they’ll radically pace up the calculations.

After implementing this, we discover that we now have a median step of about 104ms.

Picture by Writer — A100 Coaching after Flash Consideration Optimization

Lastly, we will undergo the entire numbers we now have hard-coded and consider how “good” they’re. Once we do that, we discover that the vocabulary dimension is just not divisible by many powers of two and so might be extra time-consuming for our GPU’s reminiscence to load in. We repair this by going from the 50,257 vocab dimension to the following “good” quantity, which is 50,304. This can be a good quantity because it’s cleanly divisible by 2, 4, 8, 16, 32, 64, and 128.

mannequin = GPT(GPTConfig(vocab_size=50304))

Now chances are you’ll keep in mind from the final weblog publish that our vocab dimension is just not an arbitrary worth — it’s decided by the tokenizer we’re utilizing. Thus begs the query, Once we arbitrarily add in additional values to our vocab dimension, what occurs? Through the coaching, the mannequin will discover that these new vocab by no means seem, so it can begin to push the chances of those tokens to 0 — thus our efficiency is protected. That doesn’t imply that there is no such thing as a tradeoff although. By loading into reminiscence vocab that’s by no means used, we’re losing time. Nevertheless, empirically we will see that loading in “good” numbers greater than compensates for this value.

With our final optimization, we now have a median of about 100 ms per step.

Picture by Writer — A100 Coaching after Vocab Measurement Optimization

With this last optimization, we discover that our coaching has improved ~10x from the start!

Should you’ve been following alongside however solely have entry to the consumer-grade T4 GPU, chances are you’ll surprise which optimizations you need to use. To recap, we can’t use the BF16 illustration, however we will use the vocabulary dimension change, flash consideration, and torch compile. To see this code in motion, try my Google Colab pocket book, which is optimized only for T4 utilization.

We will see from the graph under that whereas the torch compile does take a variety of time for the primary spherical, the following rounds will not be considerably higher than the unoptimized variations (roughly an 8% drop on T4 vs 90% drop on A100).

Picture by Writer — Optimized run on T4 GPU

Nonetheless, when OpenAI was coaching GPT-2 it was working on much more superior {hardware} than the T4. The truth that we will run this workload on a T4 right this moment means that {hardware} necessities have gotten much less onerous, serving to create a future the place {hardware} is just not a barrier to ML work.

By optimizing our code, we’ve seen main pace ups and in addition realized a bit about the place the massive bottlenecks for coaching occur. At the start, datatypes are critically essential for pace, as this alteration by itself contributed majorly to the pace ups. Second, we see that {hardware} optimizations can play a serious function in rushing up calculations — so GPU {hardware} is price its weight in gold. Lastly, compiler optimizations have a serious function to play right here as effectively.

To see the code I ran within the A100, try this gist right here. When you have any recommendations for how you can optimize the {hardware} additional, I’d like to see them within the feedback!

It’s an thrilling time to be constructing!

Leave a Reply