On this article, we are going to see find out how to substitute softmax self-attention in Llama-3.2-1B with hybrid consideration combining softmax sliding window and linear consideration. This implementation will assist us higher perceive the rising curiosity in linear consideration analysis, whereas additionally inspecting its limitations and potential future instructions.
This walkthrough builds upon the next works:
This text shall be principally a recreation of the LoLCATs paper utilizing Llama 3.2 1B, the place we are going to substitute 50% of self-attention layers in a pretrained Llama mannequin. The article consists of 4 essential elements:
- Hybrid Consideration Block
- Consideration Switch
- LoRA finetuning
- Analysis
The principle purpose of this text is that may we by some means substitute softmax consideration in already skilled fashions in order that we will pace up inference whereas not dropping an excessive amount of on accuracy. If we will obtain this then we will carry the price of utilizing LLMs down drastically!
Let’s see what the Llama-3.2-1B mannequin seems to be like:
As we will see now we have 16 repeating decoder blocks, our focus shall be on the self_attn half so the purpose of this part is to grasp how the LlamaSdpAttention block works! Let’s see what the definition of LlamaSdpAttention is:
class LlamaSdpaAttention(LlamaAttention):
"""
Llama consideration module utilizing torch.nn.purposeful.scaled_dot_product_attention. This module inherits from
`LlamaAttention` because the weights of the module stays untouched. The one adjustments are on the ahead cross to adapt to
SDPA API.
"""
You may examine what this perform seems to be like utilizing the next code:
import examineattention_layer = mannequin.mannequin.layers[0].self_attn
print(examine.getsource(attention_layer.__class__))
Let’s go over the principle elements of this code and perceive what every half is doing and see the place we have to make a change,
Let’s take a dummy enter to be of the form [2,4,2048] → [batch_size, seq_len, embedding dimension]. Llama makes use of multi-headed attn with 32 heads.
Block 1:
After proj → query_states is a tensor of [2,4,2048], key_states is a tensor of [2,4,512] and value_states is a tensor of [2,4,512].
After view and transpose it’s: query_states → [2,32,4,64] key_states → [2,8,4,64] value_states → [2,8,4,64]
Right here 64 is the embedding dimension, key and worth have heads as 8 as a result of llama makes use of key-value teams the place mainly out of the 32 whole heads, teams of 4 heads share the identical key_states and value_states among the many 32 whole heads.
Block 2:
On this block we simply apply positional encoding particularly llama makes use of Rotary Place Embeddings (RoPE). I gained’t go into element why that is wanted however you may learn the next article to get a greater thought:
Block 3:
Right here we simply apply the repeat_kv perform which simply repeats the kv worth within the teams of 4, additionally we use past_key_value in order that we will use some precomputed kv values in order that we don’t should compute them once more for computational effectivity.
Block 4:
Block 4 handles two essential preparation steps for consideration: organising the causal masks to make sure tokens solely attend to earlier positions, and optimizing reminiscence structure with contiguous tensors for environment friendly GPU operations.
Block 5:
That is the place we apply softmax consideration — the part we’ll be changing in our implementation.
Block 6:
The eye output shall be a tensor of form [2, 32, 4, 64]. We convert it again to [2, 4, 2048] and apply the ultimate output projection.
And that’s the journey of an enter via Llama self-attention!
So now let’s take a look at our HybridAttention block:
class HybridAttention(LlamaSdpaAttention):
def __init__(self, config, layer_idx=None):
tremendous().__init__(config, layer_idx=layer_idx)
self.window_size = 64
#self.layer_idx = layer_idx# Initialize learnable components
# Create one issue pair per consideration head
num_heads = config.num_attention_heads
self.window_factors = torch.nn.Parameter(torch.ones(1, num_heads, 1, 1) * 0.5)
self.linear_factors = torch.nn.Parameter(torch.ones(1, num_heads, 1, 1) * 0.5)
self.factor_activation = torch.nn.Sigmoid()
def sliding_window_attention(self, query_states, key_states, value_states, window_size, window_factor):
"""Compute sliding window consideration"""
batch_size, num_heads, seq_len, head_dim = query_states.form
key_windows = F.pad(key_states, (0, 0, window_size - 1, 0), worth=0)
key_windows = key_windows.unfold(2, window_size, 1)
value_windows = F.pad(value_states, (0, 0, window_size - 1, 0), worth=0)
value_windows = value_windows.unfold(2, window_size, 1)
attn_weights = torch.einsum('bhld,bhldw->bhlw', query_states, key_windows) * (head_dim ** -0.5)
attn_weights = torch.the place(attn_weights == 0,
torch.tensor(-float('inf'), system=attn_weights.system),
attn_weights)
# Apply learnable window issue (with sigmoid to make sure positivity)
attn_weights = self.factor_activation(window_factor) * F.softmax(attn_weights, dim=-1)
attn_output = torch.einsum('bhlw,bhldw->bhld', attn_weights, value_windows)
sum_weights = attn_weights.sum(dim=-1, keepdim=True)
return attn_output, sum_weights
def linear_attention(self, query_states, key_states, value_states, window_size, linear_factor):
"""Compute linear consideration with cumsum"""
def feature_map(x):
return F.elu(x) + 1
query_prime = feature_map(query_states)
key_prime = feature_map(key_states)
key_prime = F.pad(key_prime, (0, 0, window_size, 0), worth=0)[:, :, :-window_size, :]
value_padded = F.pad(value_states, (0, 0, window_size, 0), worth=0)[:, :, :-window_size, :]
# Compute KV
kv = torch.einsum('bhlf,bhld->bhlfd', key_prime, value_padded)
# Apply learnable linear issue (with sigmoid to make sure positivity)
qkv = self.factor_activation(linear_factor) * torch.einsum('bhlf,bhlfd->bhld',
query_prime,
kv.cumsum(dim=2))
sum_k = key_prime.cumsum(dim=2)
sum_qk = self.factor_activation(linear_factor) * torch.einsum('bhld,bhld->bhl',
query_prime,
sum_k)[..., None]
sum_qk = torch.the place(sum_qk == 0, torch.tensor(1e-12, system=sum_qk.system), sum_qk)
return qkv, sum_qk
def hybrid_attention(self, query_states, key_states, value_states):
"""Mix sliding window and linear consideration with learnable components"""
qkv_window, sum_window = self.sliding_window_attention(
query_states, key_states, value_states,
self.window_size, self.window_factors
)
qkv_linear, sum_linear = self.linear_attention(
query_states, key_states, value_states,
self.window_size, self.linear_factors
)
output = (qkv_window + qkv_linear) / (sum_window + sum_linear)
return output
def ahead(
self,
hidden_states: torch.Tensor,
attention_mask: Elective[torch.Tensor] = None,
position_ids: Elective[torch.LongTensor] = None,
past_key_value: Elective[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Elective[torch.LongTensor] = None,
position_embeddings: Elective[Tuple[torch.Tensor, torch.Tensor]] = None,
**kwargs,
):
bsz, q_len, _ = hidden_states.measurement()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, -1, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, -1, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, -1, self.head_dim).transpose(1, 2)
if position_embeddings is None:
cos, sin = self.rotary_emb(value_states, position_ids)
else:
cos, sin = position_embeddings
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
if past_key_value shouldn't be None:
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
key_states, value_states = past_key_value.replace(key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
attn_output = self.hybrid_attention(
query_states,
key_states,
value_states
)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(bsz, q_len, -1)
attn_output = self.o_proj(attn_output)
return attn_output, None, past_key_value
We solely made one change in ahead(), we changed block 5 with the next:
attn_output = self.hybrid_attention(
query_states,
key_states,
value_states
)
We mainly partitioned the eye mechanism into sliding window and linear consideration blocks.
Sliding Window Consideration:
def sliding_window_attention(self, query_states, key_states, value_states, window_size, window_factor):
"""Compute sliding window consideration"""
batch_size, num_heads, seq_len, head_dim = query_states.formkey_windows = F.pad(key_states, (0, 0, window_size - 1, 0), worth=0)
key_windows = key_windows.unfold(2, window_size, 1)
value_windows = F.pad(value_states, (0, 0, window_size - 1, 0), worth=0)
value_windows = value_windows.unfold(2, window_size, 1)
attn_weights = torch.einsum('bhld,bhldw->bhlw', query_states, key_windows) * (head_dim ** -0.5)
attn_weights = torch.the place(attn_weights == 0,
torch.tensor(-float('inf'), system=attn_weights.system),
attn_weights)
# Apply learnable window issue (with sigmoid to make sure positivity)
attn_weights = self.factor_activation(window_factor) * F.softmax(attn_weights, dim=-1)
attn_output = torch.einsum('bhlw,bhldw->bhld', attn_weights, value_windows)
sum_weights = attn_weights.sum(dim=-1, keepdim=True)
return attn_output, sum_weights
For a deeper understanding of window consideration ideas, I like to recommend referring to this paper:
The thought I’ve carried out right here is that as a substitute of calculating the eye of all key-value pairs collectively(the place every token attends to each different token), we break it into home windows of ‘w’ measurement after which calculate the eye for every window. Utilizing this within the above code, the time complexity comes down from O(n²) to O(n*w), since every token solely must attend to w tokens as a substitute of all n tokens. It may be made even higher through the use of ideas equivalent to sinks and solely doing window for final w tokens which I would implement in future updates.
Linear Consideration:
def linear_attention(self, query_states, key_states, value_states, window_size, linear_factor):
"""Compute linear consideration with cumsum"""
def feature_map(x):
return F.elu(x) + 1query_prime = feature_map(query_states)
key_prime = feature_map(key_states)
key_prime = F.pad(key_prime, (0, 0, window_size, 0), worth=0)[:, :, :-window_size, :]
value_padded = F.pad(value_states, (0, 0, window_size, 0), worth=0)[:, :, :-window_size, :]
# Compute KV
kv = torch.einsum('bhlf,bhld->bhlfd', key_prime, value_padded)
# Apply learnable linear issue (with sigmoid to make sure positivity)
qkv = self.factor_activation(linear_factor) * torch.einsum('bhlf,bhlfd->bhld',
query_prime,
kv.cumsum(dim=2))
sum_k = key_prime.cumsum(dim=2)
sum_qk = self.factor_activation(linear_factor) * torch.einsum('bhld,bhld->bhl',
query_prime,
sum_k)[..., None]
sum_qk = torch.the place(sum_qk == 0, torch.tensor(1e-12, system=sum_qk.system), sum_qk)
return qkv, sum_qk
For linear consideration, I exploit a quite simple characteristic map of elu(x) + 1 however the principle half to notice there may be the preliminary padding being achieved. The thought right here is that we will use linear consideration just for the primary [sequence length — window size] as we have already got sliding window to maintain monitor of current context.
The mix of those two kinds of consideration turns into our new hybrid consideration and we use window_factor and linear_factor as learnable parameters that management how a lot every kind of consideration contributes to the ultimate output.