mirror of
https://github.com/ggml-org/llama.cpp.git
synced 2026-06-27 23:50:20 -05:00
* hex-mm: new weight layout and fusion updates * hvx-mm: unroll the new tiled vec_dots to optimize hvx register util * hex-mm: optimize dyn.quant format for q8_0 and q8_1 to reduce overhead in vec_dots. * hvx-mm: parallel quantizer per block for large rows * hvx-mm: simplify and futher optimize dyn.quant and vec_dots * hvx-mm: keep intermediate per tile accumulators in fp16 * hmx-mm: optimize weight dequant by aligning the repacked tiles with the DMA * hmx-mm: remove qweight scratch and just use vtcm_weight * hmx-mm: remove all unused and obsolete code * hmx-mm: the new tiled repack format is here to stay -- rename all x4x2 to _tiled * hmx-mm: improve activation processing with dma prefetch * hex-mm: fix hmx/hvx fallback logic and MUL_MAT_ID allocation (unbreaks OLMoE) * hex-mm: align the weight tiles with dma just like we did in hmx-mm * hex-mm: factor out common mm bits into htp/matmul-ops.h * hex-mm: start moving mm kernel selection to the host * hex-mm: move all of the matmul param compute into the host * hmx-mm: restore pipelined mode * hmx-mm: unroll the dequant functions to optimize register usage * hmx-mm: further improve activation process * hex-mm: use vtcm_seq_alloc for all vtcm allocations and define more common functions * hex-mm: improve mm optimizer to acount for number of activation threads * hex-mm: fix matmul-id kernel params selection (unbreaks OLMoE and LFM) * hexagon: remove support for arch < v73 since HMX is now required for most use-cases * hex-mm: cleanup naming for consistency * hex-mm: make sure matmul fusion accounts for vtcm allocation * hex-mm: minor cleanup for kernel_params definition * hex-mm: replace hardcoded limits with proper checks for vtcm requirements * hex-mm: add support for non-tiled mm as a fallback option and factor out hvx kernels into separate header * hex-mm: remove unused functions * hex-mm: add shorthand for MM_SELECT in run-tool script * hvx-mm: factor out hvx/hmx microkernels and unify matmul entry and dispatch * hex-mm: further cleanup matmul fallback path * hex-mm: refactor matmul entry point and dispatch a bit further * hexagon: update cmake build to enable hmx for everything * hex-ops: optimize kernel_param updates and include summary in the logs * hex-mm: add support for GGML_HEXAGON_MM_SELECT * hex-mm: add hex-common header * hex-mm: pass correct number of tasks to workpool * hex-mm: add proper checks for no-work in dyn.quant tasks * hex-mm: convert all quantizers into a macro * hex-mm: fix hvx-flat fallback to pass all MUL_MAT tests * hex-mm: vectorize q8_1 quantizer * hex-mm: improve fused ffn mm stride handling * hex-mm: consistent use of n_threads and pipeline in kernel_params * hexagon: minor formatting * hex-mm: update MUL_MAT_ID kernel_param handling to make sure host/npu are in sync * hvx-mm: go back to accumulating in fp32 in tiled hvx kernels, more accurate and same perf * hvx-mm: unroll the loops and remove masking that is not needed for tiled accums * hmx-mm: optimize activation processing (slit loops, some unrolling, etc) * hmx-mm: minor optimization for output processing * hex-mm: consistent use of uint32_t and size_t in mm kernels * hex-mm: remove legacy restrictions for rows to be multiple of 256 * hexagon: replace sprintf with snprintf * hex-mm: relax hardcoded nrows checks and rely on VTCM size requirements * hexagon: minor alignment fix * hexagon: fix trailing spaces * hex-mm: relax padding from 256 to 128 (leftovers) * hex-mm: remove redundant checks for weight align to 128 we always use 2D dma for the weights and align them properly * hmx-mm: MUL_MAT_ID better work distribution between hvx threads and hmx tracing * hex-mm: specialize per-token mmid activation handling * hex-profile: update python scripts to handle kernel-params section in the logging output * hex-mm: move n_prefetch (aka dma_depth) into kernel params and remove unused fields * hex-trace: use easier to parse format, simply and fix post-proc scripts * hmx-mm: relax 32 row limit for output processing which helps utilization * hmx-mm: use start-chunk idx for tracing info * hmx-mm: parameterize activation dma pipeline * hexagon: add support for simple graph caching to avoid recomputing kernel-params * hex-mm: remove left-over repack functions * hex-mm: tighten n_prefetch asserts * hex-mm: remove duplicate round/align_up helper * hexagon: cleanup common header used in host/npu * hexagon: update early wakeup threshold * hmx-mm: define cost constants and update solver to assume that repacked ne[1] is padded to 32 * hmx-mm: make precompute_matmul a bit more readable (split into smaller functions, etc) * hex-mm: remove n_threads constraint * hex-mm: minor formatting updates * hex-mm: remove obsolete profiling logs * hex-mm: restore hardcode gate to refuse lm-head to avoid repacking that tensor
84 lines
1.9 KiB
Bash
Executable File
84 lines
1.9 KiB
Bash
Executable File
#!/bin/sh
|
|
#
|
|
|
|
# Basedir on device
|
|
basedir=/data/local/tmp/llama.cpp
|
|
|
|
cli_opts=
|
|
|
|
branch=.
|
|
[ "$B" != "" ] && branch=$B
|
|
|
|
adbserial=
|
|
[ "$S" != "" ] && adbserial="-s $S"
|
|
|
|
adbhost=
|
|
[ "$H" != "" ] && adbhost="-H $H"
|
|
|
|
model="Llama-3.2-3B-Instruct-Q4_0.gguf"
|
|
[ "$M" != "" ] && model="$M"
|
|
|
|
device="HTP0"
|
|
[ "$D" != "" ] && device="$D"
|
|
|
|
verbose=
|
|
[ "$V" != "" ] && verbose="GGML_HEXAGON_VERBOSE=$V" cli_opts="$cli_opts -v"
|
|
|
|
sched=
|
|
[ "$SCHED" != "" ] && sched="GGML_SCHED_DEBUG=2" cli_opts="$cli_opts -v"
|
|
|
|
profile=
|
|
[ "$PROF" != "" ] && profile="GGML_HEXAGON_PROFILE=$PROF" cli_opts="$cli_opts -v"
|
|
|
|
opmask=
|
|
[ "$OPSTAGE" != "" ] && opmask="GGML_HEXAGON_OPSTAGE=$OPSTAGE"
|
|
|
|
nhvx=
|
|
[ "$NHVX" != "" ] && nhvx="GGML_HEXAGON_NHVX=$NHVX"
|
|
|
|
hmx=
|
|
[ "$HMX" != "" ] && hmx="GGML_HEXAGON_USE_HMX=$HMX"
|
|
|
|
ndev=
|
|
[ "$NDEV" != "" ] && ndev="GGML_HEXAGON_NDEV=$NDEV"
|
|
|
|
hb=
|
|
[ "$HB" != "" ] && hb="GGML_HEXAGON_HOSTBUF=$HB"
|
|
|
|
opbatch=
|
|
[ "$OB" != "" ] && opbatch="GGML_HEXAGON_OPBATCH=$OB"
|
|
|
|
opqueue=
|
|
[ "$OQ" != "" ] && opqueue="GGML_HEXAGON_OPQUEUE=$OQ"
|
|
|
|
oppoll=
|
|
[ "$OP" != "" ] && oppoll="GGML_HEXAGON_OPPOLL=$OP"
|
|
|
|
opflt=
|
|
[ "$OF" != "" ] && opflt="GGML_HEXAGON_OPFILTER=$OF"
|
|
|
|
opfuse=
|
|
[ "$OC" != "" ] && opfuse="GGML_HEXAGON_OPFUSION=$OC"
|
|
|
|
vmem=
|
|
[ "$VM" != "" ] && vmem="GGML_HEXAGON_VMEM=$VM"
|
|
|
|
mbuf=
|
|
[ "$MB" != "" ] && mbuf="GGML_HEXAGON_MBUF=$MB"
|
|
|
|
mmsel=
|
|
[ "$MM" != "" ] && mmsel="GGML_HEXAGON_MM_SELECT=$MM"
|
|
|
|
set -x
|
|
|
|
adb $adbserial $adbhost shell " \
|
|
cd $basedir; ulimit -c unlimited; \
|
|
LD_LIBRARY_PATH=$basedir/$branch/lib \
|
|
ADSP_LIBRARY_PATH=$basedir/$branch/lib \
|
|
$verbose $sched $opmask $profile $nhvx $hmx $ndev $hb $opbatch $opqueue $oppoll $opflt $opfuse $vmem $mbuf $mmsel \
|
|
./$branch/bin/llama-completion --no-mmap -m $basedir/../gguf/$model \
|
|
--poll 1000 -t 6 --cpu-mask 0xfc --cpu-strict 1 \
|
|
--ctx-size 8192 --ubatch-size 1024 -fa on \
|
|
-ngl 99 --device $device $cli_opts $@ \
|
|
"
|