Kawrakow 847e191936
Graph parallel for Gemma4 MoE (#1600)
* Use build_std_attention for Gemma4 when possible

It is possible for the 26b MoE and 31b dense models.
It is not possible for the E4B/E2B vaiants because they
don't have KV cache in each layer.

* Standardize Gemma4 dense ffn

* WIP: Gemma4 split mode graph

Runs but produces NaNs

* WIP: Gemma4 split mode graph

Runs but very high PPL. At least it is no longer NaN.

* WIP

* This works!

* Put attn_norm, attn_post_norm, ffn_norm, ffn_post_norm on all GPUs

* Fix crash when saving/loading KV cache

* WIP: split mode graph for Gemma4-MoE - crashes

* Split mode graph for Gemma4-MoE - this works

* Disable SWA optimization

Something goes wrong there

* Consolidate MoE and dense graph parallel
2026-04-09 14:07:29 +02:00
..
2024-07-27 07:55:01 +02:00
2026-04-09 14:07:29 +02:00
2024-07-27 07:55:01 +02:00