4M Tokens? MiniMax-Textual content-01 Outperforms DeepSeek V3

Chinese language AI labs are making regular progress within the AI race. Fashions like DeepSeek-V3 and Qwen 2.5 are giving robust competitors to GPT-4o, Claude, and Grok. How are these Chinese language fashions higher? They stand out for his or her price effectivity, openness, and excessive efficiency. Many are open-source and obtainable beneath commercially permissive licenses, making them accessible to a variety of builders and companies.

MiniMax-Textual content-01 is the newest addition to the Chinese language LLMs. With a 4 million token context size—far exceeding trade requirements of 128K-256K tokens—it units a brand new benchmark in dealing with long-context duties. The mannequin’s Hybrid Consideration structure ensures operational effectivity, and its open-source, commercially permissive license empowers innovation with out the burden of hefty prices.

Let’s discover MiniMax-Textual content-01!

Hybrid Structure

MiniMax-Textual content-01 combines Lightning ConsiderationSoftmax Consideration, and Combination-of-Consultants (MoE) to realize a steadiness between effectivity and efficiency.

Supply: MiniMax-Textual content-01
  • 7/8 Linear Consideration (Lightning Consideration-2):
    • Lightning Consideration is a linear consideration mechanism that reduces computational complexity from O(n²d) to O(d²n), making it extremely environment friendly for long-context duties.
    • The mechanism entails:
      1. Enter transformation utilizing SiLU activation.
      2. Matrix operations to compute consideration scores.
      3. Normalization and scaling utilizing RMSNorm and sigmoid.
  • 1/8 Softmax Consideration:
    • Conventional consideration with RoPE (Rotary Place Embedding) utilized to half the eye head dimension, enabling size extrapolation with out efficiency degradation.

Combination-of-Consultants (MoE) Technique

MiniMax-Textual content-01 employs a distinctive MoE structure that differs from fashions like DeepSeek-V3:

Supply: MiniMax-Textual content-01
  • Token Drop Technique: Makes use of an auxiliary loss to steadiness token distribution throughout specialists, in contrast to DeepSeek’s dropless technique.
  • International Router: Optimizes token allocation to make sure balanced workloads throughout professional teams.
  • High-k Routing: Selects top-2 specialists per token, in comparison with DeepSeek’s top-8 + 1 shared professional.
  • Professional Configuration:
    • 32 specialists (vs. DeepSeek’s 256 + 1 shared).
    • Professional Hidden Dimension: 9216 (vs. DeepSeek’s 2048).
    • Complete Activated Parameters per Layer: 18,432 (similar as DeepSeek).

Coaching and Scaling Methods

  • Coaching Infrastructure:
    • Educated on ~2000 H100 GPUs utilizing superior parallelism methods like Professional Tensor Parallelism (ETP) and Linear Consideration Sequence Parallelism Plus (LASP+).
    • Optimized for 8-bit quantization, making certain environment friendly inference on 8x80GB H100 nodes.
  • Coaching Information:
    • Educated on ~12 trillion tokens with a WSD-like studying charge schedule.
    • Information consists of a mixture of high-quality and low-quality sources, with international deduplication and 4x repetition for high-quality knowledge.
  • Lengthy-Context Coaching:
    • Three Phases:
      1. Primary Coaching: 8k context size with RoPE base 10k.
      2. Part 1: 128k context size, 5M RoPE base, 30% quick sequences, 70% medium sequences.
      3. Part 2: 512k context size, 10M RoPE base, 35% quick, 35% medium, 30% lengthy sequences.
      4. Part 3: 1M context size, 10M RoPE base, 30% quick, 30% medium, 40% lengthy sequences.
    • Linear Interpolation: Mitigates distribution shifts throughout context size scaling.

Publish-Coaching Optimization

  • Iterative Advantageous-Tuning:
    • Combines Supervised Advantageous-Tuning (SFT) and Reinforcement Studying (RL) in cycles.
    • RL makes use of Offline DPO and On-line GRPO for alignment.
  • Lengthy-Context Advantageous-Tuning:
    • Quick-Context SFT → Lengthy-Context SFT → Quick-Context RL → Lengthy-Context RL.
    • This phased strategy is important for attaining superior long-context efficiency.

Key Improvements

  • DeepNorm: A post-norm structure that scales residual connections and improves stability throughout coaching.
  • Batch Measurement Warmup: Step by step will increase batch measurement from 16M to 128M tokens to optimize coaching dynamics.
  • Environment friendly Parallelism:
    • Ring Consideration: Reduces reminiscence overhead for lengthy sequences.
    • Padding Optimization: Minimizes wasted computation throughout coaching.

Core Tutorial Benchmarks

Supply: MiniMax-Textual content-01

Normal Duties Benchmarks

Job GPT-4o Claude-3.5-Sonnet Gemini-1.5-Professional Gemini-2.0-Flash Qwen2.5-72B-Inst. DeepSeek-V3 Llama-3.1-405B-Inst. MiniMax-Textual content-01
MMLU* 85.7 88.3 86.8 86.5 86.1 88.5 88.6 88.5
MMLU-Professional* 74.4 78.0 75.8 76.4 71.1 75.9 73.3 75.7
SimpleQA 39.0 28.1 23.4 26.6 10.3 24.9 23.2 23.7
C-SimpleQA 64.6 56.8 59.4 63.3 52.2 64.8 54.7 67.4
IFEval (avg) 84.1 90.1 89.4 88.4 87.2 87.3 86.4 89.1
Area-Laborious 92.4 87.6 85.3 72.7 81.2 91.4 63.5 89.1

Reasoning Duties Benchmarks

Job GPT-4o Claude-3.5-Sonnet Gemini-1.5-Professional Gemini-2.0-Flash Qwen2.5-72B-Inst. DeepSeek-V3 Llama-3.1-405B-Inst. MiniMax-Textual content-01
GPQA* 46.0 65.0 59.1 62.1 49.0 59.1 50.7 54.4
DROP* 89.2 88.8 89.2 89.3 85.0 91.0 92.5 87.8

Arithmetic & Coding Duties Benchmarks

Job GPT-4o Claude-3.5-Sonnet Gemini-1.5-Professional Gemini-2.0-Flash Qwen2.5-72B-Inst. DeepSeek-V3 Llama-3.1-405B-Inst. MiniMax-Textual content-01
GSM8k* 95.6 96.9 95.2 95.4 95.8 96.7 96.7 94.8
MATH* 76.6 74.1 84.6 83.9 81.8 84.6 73.8 77.4
MBPP + 76.2 75.1 75.4 75.9 77.0 78.8 73.0 71.7
HumanEval 90.2 93.7 86.6 89.6 86.6 92.1 89.0 86.9
Supply: MiniMax-Textual content-01

You possibly can checkout different analysis parameters right here.

Let’s Get Began with MiniMax-Textual content-01

This script units up and runs the MiniMax-Textual content-01 language mannequin utilizing the Hugging Face transformers library. It consists of steps to configure the mannequin for multi-GPU environments, apply quantization for effectivity, and generate responses from a user-provided enter immediate.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, GenerationConfig

# Guarantee QuantoConfig is imported or outlined
strive:
    from transformers import QuantoConfig
besides ImportError:
    class QuantoConfig:
        def __init__(self, weights, modules_to_not_convert):
            self.weights = weights
            self.modules_to_not_convert = modules_to_not_convert

# Load Hugging Face config
hf_config = AutoConfig.from_pretrained("MiniMaxAI/MiniMax-Textual content-01", trust_remote_code=True)

# Quantization config (int8 really useful)
quantization_config = QuantoConfig(
    weights="int8",
    modules_to_not_convert=[
        "lm_head",
        "embed_tokens",
    ] + [f"model.layers.{i}.coefficient" for i in range(hf_config.num_hidden_layers)]
    + [f"model.layers.{i}.block_sparse_moe.gate" for i in range(hf_config.num_hidden_layers)]
)

# Set gadget map for multi-GPU setup
world_size = 8  # Assume 8 GPUs
device_map = {
    'mannequin.embed_tokens': 'cuda:0',
    'mannequin.norm': f'cuda:{world_size - 1}',
    'lm_head': f'cuda:{world_size - 1}'
}
layers_per_device = hf_config.num_hidden_layers // world_size
for i in vary(world_size):
    for j in vary(layers_per_device):
        device_map[f'model.layers.{i * layers_per_device + j}'] = f'cuda:{i}'

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-Textual content-01")

# Put together enter immediate
immediate = "Howdy!"
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]},
    {"function": "consumer", "content material": [{"type": "text", "text": prompt}]},
]
if hasattr(tokenizer, 'apply_chat_template'):
    textual content = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
else:
    increase NotImplementedError("The tokenizer doesn't help 'apply_chat_template'. Verify the documentation or replace the tokenizer model.")

# Tokenize and transfer to gadget
model_inputs = tokenizer(textual content, return_tensors="pt").to("cuda")

# Load mannequin with quantization
quantized_model = AutoModelForCausalLM.from_pretrained(
    "MiniMaxAI/MiniMax-Textual content-01",
    torch_dtype="bfloat16",
    device_map=device_map,
    trust_remote_code=True,
    offload_buffers=True,
)

# Generate response
generation_config = GenerationConfig(
    max_new_tokens=20,
    eos_token_id=200020,
    use_cache=True,
)
generated_ids = quantized_model.generate(**model_inputs, generation_config=generation_config)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Finish Word

MiniMax-Textual content-01 is a extremely succesful mannequin with state-of-the-art efficiency in long-context and general-purpose duties. Whereas it has some areas for enchancment, its open-source nature, price effectivity, and modern structure make it a robust contender within the AI panorama. It’s significantly well-suited for purposes requiring in depth reminiscence and sophisticated reasoning, however may have additional refinement for coding-specific duties.

Keep tuned to Analytics Vidhya Information for extra such insightful content material!

Howdy, I’m Nitika, a tech-savvy Content material Creator and Marketer. Creativity and studying new issues come naturally to me. I’ve experience in creating result-driven content material methods. I’m properly versed in web optimization Administration, Key phrase Operations, Net Content material Writing, Communication, Content material Technique, Enhancing, and Writing.