• Support CUDA Graph @HandH1998 @ispobock
  • Support Torch compile @ispobock
  • Use BF16 for bmm @zhyncs
  • Improve the accuracy for FP8 @HandH1998 @zhyncs @ispobock
  • Tuning FP8 GEMM @HandH1998 @zhyncs
  • Replace moe_align_block_size @HandH1998 @zhyncs @BBuf
  • FusedMoE tuning for H200 E=256,N=256,device_name=NVIDIA_H200,dtype=fp8_w8a8.json @BBuf
  • TP+DP Attention @Ying1123
  • Support overlap scheduler with DP attention @merrymercy