Problem statements - Existing Methods (basic speculative decoding)
To address the memory-bound issue, a small draft model generates multiple (gamma
) tokens in advance, and the target LLM verifies these draft tokens in parallel.
The major issue with the existing sequence-based speculative decoding is that if even a single token in the sequence is rejected, all subsequent tokens are discarded.
Example)
To address this issue, a new technique called "Tree attention-based Speculative Decoding" was introduced.
Tree-based Speculative Decoding
Unlike the traditional sequence-based method, tree-attention generates draft tokens in a tree structure and passes them through the target model in parallel for verification.
While the target model still performs parallel verification, the key advantage is that multiple candidate trees increase the acceptance rate.
However, since the target model verifies tokens in parallel, proper handling of the tree_mask
is essential.
tree_mask
, tree_structure
Example
gamma
time steps.input_ids
format. Along with this, both attention_mask
and tree_attention_mask
are used to ensure that attention operations are skipped for unrelated draft tokens.As a result, the target model determines which tokens are verified (accepted) and updates the output accordingly.
( Important Implementation Detail*: It is necessary to selectively update the KV Cache based on the indices of accepted tokens.)
Currently, vLLM only implements a top-1 proposer (i.e., sequence-based) engine.
cache_position
or position_ids
for the corresponding indices in the next decoding step.Tree-attention is left as future work (No significant progress for over six months).