Speculative decoding on sglang

What is Tree-Attention?

Tree-based Speculative Decoding (EAGLE) Implementation Summary

image.png

  1. Prefill: This step is the same as standard auto-regressive decoding in LLMs.
  2. Generate Draft Tree Candidates: Given an input prompt, the draft model (SLM) generates draft tokens for gamma time steps.
  3. Tree Decoding: The generated draft candidates are flattened into a 1-dimensional input_ids format. Along with this, both attention_mask and tree_attention_mask are used to ensure that attention operations are skipped for unrelated draft tokens.

As a result, the target model determines which tokens are verified (accepted) and updates the output accordingly.

( Important Implementation Detail*: It is necessary to selectively update the KV Cache based on the indices of accepted tokens.)

Issue: Speculative Decoding on vLLM

image.png