sycl : support --split-mode tensor (#24152)

* Sycl tp stage1 (#1)

* SYCL: tensor parallelism (--split-mode tensor) for dual-GPU

Adds the comm_init/comm_free/comm_allreduce_tensor trio that the
meta-backend queries via get_proc_address to enable backend-specific
all-reduce, mirroring the pattern used by ggml-cuda.cu.

For N=2 (the common dual-GPU case) implements a degenerate ring
all-reduce with two size-branched paths:

  * Small (nelem < 32768): FP32 direct memcpy + per-device ADD kernel
    chained via depends_on(memcpy_event). 4 SYCL submissions/call.

  * Large (nelem >= 32768): BF16-compressed. Each device compresses
    FP32 -> BF16 in a local outbox, cross-device memcpys to the peer's
    inbox (HALF the PCIe bytes), then decompresses + adds into the
    local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved
    -- wins for any tensor where PCIe dominates kernel time.

Threshold and BF16 path pattern mirror the CUDA NCCL allreduce.

Storage: ONE persistent uint8_t buffer per device, 4 * nelem bytes
(matches both path layouts: FP32 nelem floats; BF16 outbox+inbox =
2 * nelem uint16_t each). Single alloc+free per device keeps the
SYCL pool's strict-LIFO invariant trivial.

Initial impl handles N=2 FP32 contiguous tensors. Other cases return
false, causing the meta-backend to use its generic butterfly fallback.

Per-call sync is intentionally omitted. SYCL in-order queue semantics
ensure that the meta-backend's next compute on the same per-device
queue waits for our final ADD, and the next allreduce's first op on
the same persistent buffer waits via the same queue. Only comm_free
does an explicit final wait.

OneCCL is NOT used: OneCCL 2021.17 hardcodes single-device-per-process
in communicator_impl.hpp:47 (condition devices.size() == 1), which is
incompatible with llama.cpp's single-process multi-GPU model.

Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 +
DPC++ nightly):

  Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16:
    pp512 = 377.08 t/s  (vs 313.65 layer mode = +20.2%)
    tg128 = 17.40 t/s   (vs   9.74 layer mode = +78.6%)

  Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE):
    pp512 = 216.56 t/s  (vs 156.58 meta-backend butterfly = +38.3%)
    tg128 = 17.60 t/s   (vs  14.31 meta-backend butterfly = +23.0%)

  Qwen3-4B Q4_K_M:
    pp64  = 984.51 t/s, tg16 = 49.29 t/s

Llama-3.3-70B in SYCL TP now comfortably beats production layer mode
on both prefill and decode. Coder-Next-80B-A3B (MoE) also wins on
both — the BF16 path is what unlocks the many-medium-allreduces
prefill pattern.

Build/CMake: no changes. No new dependencies. ~210 lines added across
ggml-sycl.h and ggml-sycl.cpp.

* Fix comments

* documentation update to address PR feedback

* Bring over my device-to-device memcpy chagnes

* move the dev2dev_memcpy calls to the upstream 7-parameter variety

* Fix a typo and remove a trailing whitespace
This commit is contained in:
David Spruill 2026-06-25 01:35:21 -04:00 committed by GitHub
parent 9c10954865
commit e9fb3b3fc0
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 281 additions and 0 deletions

View File

@ -413,6 +413,15 @@ In two device selection modes, the default SYCL backend is level_zero, you can c
|------------------|----------------------------------------| |------------------|----------------------------------------|
| Single device | --split-mode none --main-gpu DEVICE_ID | | Single device | --split-mode none --main-gpu DEVICE_ID |
| Multiple devices | --split-mode layer (default) | | Multiple devices | --split-mode layer (default) |
| Multiple devices | --split-mode tensor (tensor parallelism) |
`--split-mode tensor` (tensor parallelism) shards each layer across the selected
GPUs. It requires flash attention, which is auto-enabled when `--flash-attn` is
left at its default `auto`, so `--split-mode tensor` works out of the box.
Passing `--flash-attn off` together with `--split-mode tensor` is rejected at
context creation. The default `f16` KV cache is recommended. Tensor parallelism
is currently optimized for 2 GPUs; other device counts fall back to a generic
all-reduce.
Examples: Examples:
@ -715,6 +724,15 @@ In two device selection modes, the default SYCL backend is level_zero, you can c
|------------------|----------------------------------------| |------------------|----------------------------------------|
| Single device | --split-mode none --main-gpu DEVICE_ID | | Single device | --split-mode none --main-gpu DEVICE_ID |
| Multiple devices | --split-mode layer (default) | | Multiple devices | --split-mode layer (default) |
| Multiple devices | --split-mode tensor (tensor parallelism) |
`--split-mode tensor` (tensor parallelism) shards each layer across the selected
GPUs. It requires flash attention, which is auto-enabled when `--flash-attn` is
left at its default `auto`, so `--split-mode tensor` works out of the box.
Passing `--flash-attn off` together with `--split-mode tensor` is rejected at
context creation. The default `f16` KV cache is recommended. Tensor parallelism
is currently optimized for 2 GPUs; other device counts fall back to a generic
all-reduce.
Examples: Examples:

View File

@ -27,6 +27,14 @@ GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_sycl_buffer_type(int de
// split tensor buffer that splits matrices by rows across multiple devices // split tensor buffer that splits matrices by rows across multiple devices
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_sycl_split_buffer_type(const float * tensor_split); GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_sycl_split_buffer_type(const float * tensor_split);
// Tensor parallelism (--split-mode tensor): comm_init/free/allreduce_tensor
// trio queried by the meta-backend via ggml_backend_reg_get_proc_address.
// See typedefs in ggml/include/ggml-backend.h. Mirrors the CUDA backend's
// pattern (ggml_backend_cuda_comm_*).
GGML_BACKEND_API void * ggml_backend_sycl_comm_init(ggml_backend_t * backends, size_t n_backends);
GGML_BACKEND_API void ggml_backend_sycl_comm_free(void * comm_ctx);
GGML_BACKEND_API bool ggml_backend_sycl_comm_allreduce_tensor(void * comm_ctx, struct ggml_tensor ** tensors);
// pinned host buffer for use with the CPU backend for faster copies between CPU and GPU // pinned host buffer for use with the CPU backend for faster copies between CPU and GPU
GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_sycl_host_buffer_type(void); GGML_BACKEND_API ggml_backend_buffer_type_t ggml_backend_sycl_host_buffer_type(void);

View File

@ -5859,6 +5859,250 @@ static ggml_backend_dev_t ggml_backend_sycl_reg_get_device(ggml_backend_reg_t re
return ctx->devices[index]; return ctx->devices[index];
} }
// ==========================================================================
// Tensor parallelism (--split-mode tensor) for the SYCL backend.
//
// The meta-backend invokes these three entry points via get_proc_address:
// * ggml_backend_sycl_comm_init - one-time per-graph setup
// * ggml_backend_sycl_comm_allreduce_tensor - per-allreduce step
// * ggml_backend_sycl_comm_free - tear-down
//
// For N=2 (dual-GPU), this is a degenerate ring allreduce with dual paths
// chosen by tensor size:
//
// * Small (nelem < 32K): FP32 direct memcpy + per-device ADD
// kernel. The kernel depends_on() its corresponding memcpy event
// so it doesn't read partial data. Both devices run in parallel.
//
// * Large (nelem >= 32K): BF16-compressed. Each device compresses
// its FP32 partial to BF16 locally, cross-device memcpys
// to the peer (half the PCI bandwidth), where it is decompressed
// and added into the local FP32 partial. 6 SYCL submissions per
// allreduce (2 compress + 2 memcpy + 2 decompress-add) vs the
// 4 for the small path, but the bandwidth saving > 6 GB/s PCIe x 2
// dominates for larger tensors.
//
// Storage: A persistent uint8_t buffer per device, sized to
// 4 * nelem bytes. Both paths reinterpret the same bytes (small path
// as nelem floats; large path as outbox + inbox = 2*nelem uint16_t
// each, using the full 4*nelem byte budget either way). Single
// alloc+free per device keeps the SYCL pool's strict-LIFO invariant
// trivial.
//
// For non-(N=2 FP32 contiguous) cases, comm_init or comm_allreduce_tensor
// returns null/false, causing the meta-backend to use its generic
// butterfly all-reduce fallback.
// ==========================================================================
struct ggml_backend_sycl_comm_context {
std::vector<ggml_backend_t> backends;
// ONE persistent per-device byte buffer, 4*nelem bytes. Both the
// FP32 small-tensor path and the BF16 large-tensor path share it
// by reinterpreting.
std::unique_ptr<ggml_sycl_pool_alloc<uint8_t>> buf0;
std::unique_ptr<ggml_sycl_pool_alloc<uint8_t>> buf1;
int64_t buf_nelem = 0;
};
void * ggml_backend_sycl_comm_init(ggml_backend_t * backends, size_t n_backends) try {
for (size_t i = 0; i < n_backends; ++i) {
if (!ggml_backend_is_sycl(backends[i])) {
return nullptr;
}
}
// Initial version: N=2 only. For N!=2, returning null makes the
// meta-backend skip this backend-specific allreduce entirely.
if (n_backends != 2) {
return nullptr;
}
auto * ctx = new ggml_backend_sycl_comm_context;
ctx->backends.assign(backends, backends + n_backends);
auto * sctx0 = (ggml_backend_sycl_context *) backends[0]->context;
auto * sctx1 = (ggml_backend_sycl_context *) backends[1]->context;
ctx->buf0 = std::make_unique<ggml_sycl_pool_alloc<uint8_t>>(sctx0->pool());
ctx->buf1 = std::make_unique<ggml_sycl_pool_alloc<uint8_t>>(sctx1->pool());
return ctx;
}
catch (const sycl::exception &) { return nullptr; }
catch (...) { return nullptr; }
void ggml_backend_sycl_comm_free(void * comm_ctx_v) {
auto * comm_ctx = static_cast<ggml_backend_sycl_comm_context *>(comm_ctx_v);
if (comm_ctx == nullptr) {
return;
}
// Sync both per-device queues so the pool_alloc destructors don't
// return memory still in use by the last kernel.
if (comm_ctx->backends.size() == 2) {
auto * sctx0 = (ggml_backend_sycl_context *) comm_ctx->backends[0]->context;
auto * sctx1 = (ggml_backend_sycl_context *) comm_ctx->backends[1]->context;
try {
sctx0->stream()->wait();
sctx1->stream()->wait();
} catch (...) { /* best effort during shutdown */ }
}
delete comm_ctx;
}
bool ggml_backend_sycl_comm_allreduce_tensor(void * comm_ctx_v, struct ggml_tensor ** tensors) try {
if (comm_ctx_v == nullptr) {
return false;
}
auto * comm_ctx = static_cast<ggml_backend_sycl_comm_context *>(comm_ctx_v);
const size_t n_backends = comm_ctx->backends.size();
// Fast path: N=2, F32/F16, contiguous, matching shapes.
if (n_backends != 2) {
return false;
}
// Accept F32 or F16 inputs natively (types must match). F16 takes the
// direct 2-byte memcpy + add path below; other types return false so the
// meta-backend uses its generic all-reduce.
if (tensors[0]->type != tensors[1]->type) {
return false;
}
if (tensors[0]->type != GGML_TYPE_F32 && tensors[0]->type != GGML_TYPE_F16) {
return false;
}
if (!ggml_is_contiguous(tensors[0]) || !ggml_is_contiguous(tensors[1])) {
return false;
}
if (ggml_nelements(tensors[0]) != ggml_nelements(tensors[1])) {
return false;
}
const int64_t nelem = ggml_nelements(tensors[0]);
const size_t nbytes = ggml_nbytes(tensors[0]);
if (nelem == 0) {
return true;
}
auto * ctx0 = (ggml_backend_sycl_context *) comm_ctx->backends[0]->context;
auto * ctx1 = (ggml_backend_sycl_context *) comm_ctx->backends[1]->context;
queue_ptr q0 = ctx0->stream();
queue_ptr q1 = ctx1->stream();
// Grow per-device byte buffers if needed (4 * nelem bytes each).
if (comm_ctx->buf_nelem < nelem) {
comm_ctx->buf0->realloc(nelem * 4);
comm_ctx->buf1->realloc(nelem * 4);
comm_ctx->buf_nelem = nelem;
}
uint8_t * buf0 = comm_ctx->buf0->get();
uint8_t * buf1 = comm_ctx->buf1->get();
// F16 native path: direct 2-byte cross-device copy + add, skipping the
// F32 round-trip the meta-backend fallback would force. Cross-device copies
// go through dev2dev_memcpy because the two devices are in separate SYCL
// contexts (a raw peer-USM q->memcpy would be a silent no-op).
if (tensors[0]->type == GGML_TYPE_F16) {
sycl::half * f16_out0 = (sycl::half *) tensors[0]->data;
sycl::half * f16_out1 = (sycl::half *) tensors[1]->data;
sycl::half * f16_tmp0 = (sycl::half *) buf0;
sycl::half * f16_tmp1 = (sycl::half *) buf1;
q0->wait();
q1->wait();
dev2dev_memcpy(ctx0->device, *q0, ctx1->device, *q1, f16_tmp0, tensors[1]->data, nbytes);
dev2dev_memcpy(ctx1->device, *q1, ctx0->device, *q0, f16_tmp1, tensors[0]->data, nbytes);
q0->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
f16_out0[i] = (sycl::half) ((float) f16_out0[i] + (float) f16_tmp0[i]);
});
});
q1->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
f16_out1[i] = (sycl::half) ((float) f16_out1[i] + (float) f16_tmp1[i]);
});
});
return true;
}
float * out0 = (float *) tensors[0]->data;
float * out1 = (float *) tensors[1]->data;
// BF16 threshold: above this, the PCIe savings from halving the
// cross-device bytes outweigh the 2 extra compress kernels.
// Below: stay on the FP32 fast path. Threshold mirrors the CUDA
// NCCL allreduce pattern for n_backends=2.
static constexpr int64_t BF16_THRESHOLD = 32768;
if (nelem < BF16_THRESHOLD) {
// FP32 small path: 4 SYCL submissions per allreduce.
float * tmp0 = (float *) buf0;
float * tmp1 = (float *) buf1;
// COMM-D2D-FIX: the two devices are in SEPARATE SYCL contexts, so a raw
// q->memcpy of a peer USM pointer is a silent no-op. Route cross-device
// copies through dev2dev_memcpy (L0 direct copy / host staging). It is
// synchronous, so wait for the local partials to be produced first.
q0->wait();
q1->wait();
dev2dev_memcpy(ctx0->device, *q0, ctx1->device, *q1, tmp0, tensors[1]->data, nbytes);
dev2dev_memcpy(ctx1->device, *q1, ctx0->device, *q0, tmp1, tensors[0]->data, nbytes);
q0->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
out0[i] += tmp0[i];
});
});
q1->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
out1[i] += tmp1[i];
});
});
return true;
}
// BF16 large path: 6 SYCL submissions per allreduce, but the
// cross-device memcpy is HALF the bytes. Pure bit-shift
// conversion (no rounding) — matches ggml's truncating fp32->bf16.
uint16_t * outbox0 = (uint16_t *) buf0;
uint16_t * inbox0 = outbox0 + nelem;
uint16_t * outbox1 = (uint16_t *) buf1;
uint16_t * inbox1 = outbox1 + nelem;
// Phase A: compress each device's local partial in parallel.
sycl::event c0 = q0->parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
outbox0[i] = (uint16_t) (sycl::bit_cast<uint32_t>(out0[i]) >> 16);
});
sycl::event c1 = q1->parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
outbox1[i] = (uint16_t) (sycl::bit_cast<uint32_t>(out1[i]) >> 16);
});
// Phase B: COMM-D2D-FIX-BF16 cross-device copy of compressed bytes via
// dev2dev_memcpy (separate SYCL contexts; sync copy after compress).
const size_t bf16_bytes = nelem * sizeof(uint16_t);
c0.wait();
c1.wait();
dev2dev_memcpy(ctx0->device, *q0, ctx1->device, *q1, inbox0, outbox1, bf16_bytes);
dev2dev_memcpy(ctx1->device, *q1, ctx0->device, *q0, inbox1, outbox0, bf16_bytes);
// Phase C: decompress + add into local FP32 partial.
q0->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
out0[i] += sycl::bit_cast<float>(((uint32_t) inbox0[i]) << 16);
});
});
q1->submit([&](sycl::handler & h) {
h.parallel_for(sycl::range<1>(nelem), [=](sycl::id<1> i) {
out1[i] += sycl::bit_cast<float>(((uint32_t) inbox1[i]) << 16);
});
});
return true;
}
catch (const sycl::exception &) { return false; }
catch (...) { return false; }
static void *ggml_backend_sycl_reg_get_proc_address(ggml_backend_reg_t reg, const char *name) { static void *ggml_backend_sycl_reg_get_proc_address(ggml_backend_reg_t reg, const char *name) {
GGML_UNUSED(reg); GGML_UNUSED(reg);
@ -5866,6 +6110,17 @@ static void *ggml_backend_sycl_reg_get_proc_address(ggml_backend_reg_t reg, cons
return (void *)ggml_backend_sycl_split_buffer_type; return (void *)ggml_backend_sycl_split_buffer_type;
} }
// Tensor parallelism (--split-mode tensor) entry points.
if (strcmp(name, "ggml_backend_comm_init") == 0) {
return (void *)ggml_backend_sycl_comm_init;
}
if (strcmp(name, "ggml_backend_comm_free") == 0) {
return (void *)ggml_backend_sycl_comm_free;
}
if (strcmp(name, "ggml_backend_comm_allreduce_tensor") == 0) {
return (void *)ggml_backend_sycl_comm_allreduce_tensor;
}
// SYCL doesn't support registering host memory, left here for reference // SYCL doesn't support registering host memory, left here for reference
// "ggml_backend_register_host_buffer" // "ggml_backend_register_host_buffer"
// "ggml_backend_unregister_host_buffer" // "ggml_backend_unregister_host_buffer"