llama.cpp

mirror of https://github.com/ggml-org/llama.cpp.git synced 2026-06-27 23:50:20 -05:00

History

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057 )

* [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy

Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies.
When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel.

This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps.

* Add new tests that execute the new optimized strided copy path

* Return unsupported for strided copy in OpenVINO, as new tests are failing

2026-06-27 17:46:21 +05:30

cmake

ggml : Parallelize quant LUT init (#23595 )

2026-05-25 10:15:46 +03:00

include

sycl : support --split-mode tensor (#24152 )

2026-06-25 08:35:21 +03:00

src

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057 )

2026-06-27 17:46:21 +05:30

.gitignore

vulkan : cmake integration (#8119 )

2024-07-13 18:12:39 +02:00

CMakeLists.txt

ggml : bump version to 0.15.3 (ggml/1550)

2026-06-26 15:04:42 +03:00