* [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy
Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies.
When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel.
This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps.
* Add new tests that execute the new optimized strided copy path
* Return unsupported for strided copy in OpenVINO, as new tests are failing