Coaching Language Fashions on Google Colab | by John Hawkins | Dec, 2024

A information to iterative fine-tuning and serialisation

Picture by Shio Yang on Unsplash

So, you lately found Hugging Face and the host of open supply fashions like BERT, Llama, BART and an entire host of generative language fashions by Mistral AI, Fb, Salesforce and different firms. Now you wish to experiment with advantageous tuning some Giant Language Fashions in your aspect tasks. Issues begin off nice, however you then uncover how computationally grasping they’re and also you do not need a GPU processor useful.

Google Colab generously affords you a approach to entry to free computation so you’ll be able to resolve this downside. The draw back is, you could do all of it inside a transitory browser based mostly surroundings. To make matter worse, the entire thing is time restricted, so it looks as if it doesn’t matter what you do, you will lose your treasured advantageous tuned mannequin and all the outcomes when the kernel is ultimately shut down and the surroundings nuked.

By no means worry. There’s a means round this: make use of Google Drive to avoid wasting any of your intermediate outcomes or mannequin parameters. It will can help you proceed experimentation at a later stage, or take and use a skilled mannequin for inference elsewhere.

To do that you will have a Google account that has enough Google Drive house for each your coaching information and also you mannequin checkpoints. I’ll presume you’ve got created a folder referred to as information in Google Drive containing your dataset. Then one other referred to as checkpoints that’s empty.

Inside your Google Colab Pocket book you then mount your Drive utilizing the next command:

from google.colab import drive
drive.mount('/content material/drive')

You now checklist the contents of your information and checkpoints directories with the next two instructions in a brand new cell:

!ls /content material/drive/MyDrive/information
!ls /content material/drive/MyDrive/checkpoint

If these instructions work you then now have entry to those directories inside your pocket book. If the instructions don’t work you then might need missed the authorisation step. The drive.mount command above ought to have spawned a pop up window which requires you to click on via and authorise entry. You could have missed the pop up, or not chosen the entire required entry rights. Attempt re-running the cell and checking.

After you have that entry sorted, you’ll be able to then write your scripts such that fashions and outcomes are serialised into the Google Drive directories in order that they persist over classes. In a super world, you’ll code your coaching job in order that any script that takes too lengthy to run can load partially skilled fashions from the earlier session and proceed coaching from that time.

A easy means for attaining that’s making a save and cargo perform that will get utilized by your coaching scripts. The coaching course of ought to at all times test if there’s a partially skilled mannequin, earlier than initialising a brand new one. Right here is an instance save perform:

def save_checkpoint(epoch, mannequin, optimizer, scheduler, loss, model_name, overwrite=True):
checkpoint = {
'epoch': epoch,
'model_state_dict': mannequin.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'scheduler_state_dict': scheduler.state_dict(),
'loss': loss
}
direc = get_checkpoint_dir(model_name)
if overwrite:
file_path = direc + '/checkpoint.pth'
else:
file_path = direc + '/epoch_'+str(epoch) + '_checkpoint.pth'
if not os.path.isdir(direc):
attempt:
os.mkdir(direc)
besides:
print("Error: listing doesn't exist and can't be created")
file_path = direc +'_epoch_'+str(epoch) + '_checkpoint.pth'
torch.save(checkpoint, file_path)
print(f"Checkpoint saved at epoch {epoch}")

On this occasion we’re saving the mannequin state together with some meta-data (epochs and loss) inside a dictionary construction. We embody an choice to overwrite a single checkpoint file, or create a brand new file for each epoch. We’re utilizing the torch save perform, however in precept you can use different serialisation strategies. The important thing concept is that your program opens the file and determines what number of epochs of coaching had been used for the present file. This enables this system to determine whether or not to proceed coaching or transfer on.

Equally, within the load perform we cross in a reference to a mannequin we want to use. If there may be already a serialised mannequin we load the parameters into our mannequin and return the variety of epochs it was skilled for. This epoch worth will decide what number of further epochs are required. If there isn’t a mannequin then we get the default worth of zero epochs and we all know the mannequin nonetheless has the parameters it was initialised with.

def load_checkpoint(model_name, mannequin, optimizer, scheduler):
direc = get_checkpoint_dir(model_name)
if os.path.exists(direc):
file_path = get_path_with_max_epochs(direc)
checkpoint = torch.load(file_path, map_location=torch.gadget('cpu'))
mannequin.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
epoch = checkpoint['epoch']
loss = checkpoint['loss']
print(f"Checkpoint loaded from {epoch} epoch")
return epoch, loss
else:
print(f"No checkpoint discovered, ranging from epoch 1.")
return 0, None

These two capabilities will have to be referred to as inside your coaching loop, and you could be certain that the returned worth for epochs worth is used to replace the worth of epochs in your coaching iterations. The result’s you now have a coaching course of that may be re-started when a kernel dies, and it’ll decide up and proceed from the place it left off.

That core coaching loop may look one thing like the next:


EPOCHS = 10
for exp in experiments:
mannequin, optimizer, scheduler = initialise_model_components(exp)
train_loader, val_loader = generate_data_loaders(exp)
start_epoch, prev_loss = load_checkpoint(exp, mannequin, optimizer, scheduler)
for epoch in vary(start_epoch, EPOCHS):
print(f'Epoch {epoch + 1}/{EPOCHS}')
# ALL YOUR TRAINING CODE HERE
save_checkpoint(epoch + 1, mannequin, optimizer, scheduler, train_loss, exp)

Observe: On this instance I’m experimenting with coaching a number of completely different mannequin setups (in a listing referred to as experiments), doubtlessly utilizing completely different coaching datasets. The supporting capabilities initialise_model_components and generate_data_loaders are taking good care of guaranteeing that I get the proper mannequin and information for every experiment.

The core coaching loop above permits us to reuse the general code construction that trains and serialises these fashions, guaranteeing that every mannequin will get to the specified variety of epochs of coaching. If we restart the method, it can iterate via the experiment checklist once more, however it can abandon any experiments which have already reached the utmost variety of epochs.

Hopefully you need to use this boilerplate code to setup your personal course of for experimenting with coaching some deep studying language fashions inside Google Colab. Please remark and let me know what you’re constructing and the way you utilize this code.