Report - Support Torch compile @ispobock | Notion

아까 extend/decode에 대한 설명이 미비했던 것 같아서 덧붙이자면, 일부가 이미 prefilling되어 있을 때 prefill하는 phase를 “EXTEND”, decoding을 “DECODE”로 나타낸 것으로 보입니다.

system prompt 같은 게 이미 KV cache가 계산되어 있고 user prompt는 kv cache가 없을 때 쓰이는 게 EXTEND라고 보시면 됩니다.

#1422: Enable torch.compile for triton backend

이미 여러 백엔드에 대해 torch.compile이 작동할 수 있도록, CudaGraphRunner이나 AttentionBackend와 같은 개념과 클래스들이 정의되어 있는 상태에서, 커밋 내용은 triton backend의 버그를 해결해서 torch.compile이 되게 만드는 것 위주입니다
- 타입 변경하는 거랑 reshape 하나 넣은 거 말고는 볼 내용이 없습니다
코파일럿에게 README/test 제외하고 요약해달라고 질문한 결과
또한 전반적인 내용과 파일/폴더 구조가 이때와 현재 바뀌어 있는 상태입니다
이 PR보다는 그냥 최신 브랜치 내용물을 읽어보고 어떻게 triton backend에 torch.compile이 가능한지 파악하는 것이 더 도움될 것으로 보입니다

#1442: Fix torch compile for deepseek-v2

deepseek에 대해 torch.compile이 작동하게 만든 PR로 보입니다
changes가 정말 마이너합니다
- 사실상 deepseek에만 관련된 change는 다음밖에 없습니다
  - model_executor/cuda_graph_runner.py에서 FusedMoE를 torch native impl 쓰지 않게 만든 것
  - models/deepseek_v2.py에서 forward에 torch.no_grad() 데코레이터 단 것
- 다른 changes는 torch compile에 대한 batch size가 configurable하게 만들어주는 것이고 딱히 deepseek-specific한 내용은 아닙니다

(최신 커밋에서) torch.compile이 sglang에서 어떻게 동작하는지 알아보기

python/sglang/srt/layers의 구조입니다

python/sglang/srt/layers의 구조입니다

위 중 torch native backend, flashinfer backend + triton backend 정도만 보면 될 것 같습니다. 아래 주석이 flashinfer_backend.py에도 달려 있습니다.

Support different attention backends.
Now there are two backends: FlashInfer and Triton.
FlashInfer is faster and Triton is easier to customize.
Each backend supports two operators: extend (i.e. prefill with cached prefix) and decode.

triton에 대한 실제 attention ops 구현체는 triton_ops 폴더에 들어있는 것으로 보입니다.

AttentionBackend