Speculative decoding on sglang

What is Tree-Attention?

Prefill: This step is the same as standard auto-regressive decoding in LLMs.
Generate Draft Tree Candidates: Given an input prompt, the draft model (SLM) generates draft tokens for gamma time steps.
Tree Decoding: The generated draft candidates are flattened into a 1-dimensional input_ids format. Along with this, both attention_mask and tree_attention_mask are used to ensure that attention operations are skipped for unrelated draft tokens.

As a result, the target model determines which tokens are verified (accepted) and updates the output accordingly.

( Important Implementation Detail*: It is necessary to selectively update the KV Cache based on the indices of accepted tokens.)

Currently, vLLM only implements a top-1 proposer (i.e., sequence-based) engine.
- Sequence-based speculative decoding is a relatively naive approach.
  - If rejected tokens occur, theoretically, it can function without issues by simply modifying the cache_position or position_ids for the corresponding indices in the next decoding step.
- However, tree-based attention necessarily requires a KV cache rewrite process, making this step crucial.
Reference: https://github.com/vllm-project/vllm/blob/256a2d29dc2358d7c0a5d38c0faf152095335929/vllm/spec_decode/spec_decode_worker.py#L132
Tree-attention is left as future work (No significant progress for over six months).