ACL Pack Performance Whitepaper
Version: 1.0.3
Date: May 2026 (v1.0.3 release benchmark)
Test Platform: Android arm64-v8a, 4 flagship devices cross-validated
Comparison Baseline: OpenCV 4.13.0
Executive Summary
ACL Pack is a high-performance image processing library for Android arm64-v8a, deeply optimized with hand-written ARM NEON SIMD intrinsics. The metrics below are aggregated from the v1.0.3 release regression run (4 devices × 7 size tiers, see Changelog). All four devices produce bit-identical output on every size tier. Per-call timings on sub-millisecond operators carry normal mobile-OS jitter and may fluctuate run-to-run; the headline aggregate metrics are stable.
- 163 callable entries across 11 benchmarked categories: analysis, arithmetic, contour, cvtcolor, draw, feature, filter, geometric, math, memory, transform
- 4 devices × 7 size tiers full matrix coverage — bit-identical output across all four devices
- NEON path, all 7 size tiers × 4 devices combined: 72.3% of operator samples (n=6864) are faster than OpenCV 4.13.0 (>5%), aggregate speedup 4.72×
- Peak NEON speedup ~78× on M tier (
resize_nn_up4x_1chon Snapdragon 8 Gen 3, ACL 0.046 ms vs OCV 3.598 ms), with other notable peaks atinRange_1ch76.42×,threshold_binary61.79×,resize_area_up4x_1ch49.51×,bitwise_and_1ch29.59×,resize_area_up2x_1ch28.03× - Static library ~6 MB (paid tiers 6.2–7.5 MB, Trial 0.24 MB) vs OpenCV's 15–50 MB shared library
- Zero external runtime dependencies — just add the include path
Competitive Landscape
NEON Operator Count Comparison
The ACL Pack column is counted against the authoritative operator_tiers.json manifest — each row is a distinct standard documented operator family (CPP and NEON variants of the same operator share one row, overloads and type specializations are not counted separately) so the row count lines up with how the competitor libraries report their own coverage. Total ACL Pack documented operator families: 113, of which 75 have a NEON implementation; the rest are CPP-only (mostly in the contour, draw, and math utility categories). Counted as customer-facing callable entries (including N-image variants, kernel-size dispatchers, and CPP/NEON pairs that ship as separate functions), the SDK exposes 163 callable entries.
| Category | ACL Pack (NEON) | ACL Pack (Total) | ppl.cv (SenseTime) | ARM Compute Library | Simd Library | libyuv |
|---|---|---|---|---|---|---|
| Filter | 16 | 20 | 5 | 0 | 8 | 0 |
| Color Convert | 7 | 11 | 10 | 0 | 20+ | ~40 |
| Geometric | 7 | 7 | 4 | 5 | 4 | ~6 |
| Arithmetic | 12 | 14 | 8 | 10 | 5 | ~8 |
| Analysis | 8 | 18 | 3 | 0 | 7 | ~3 |
| Feature Detection | 16 | 16 | 0 | 0 | 3 | 0 |
| Transform | 4 | 10 | 5 | 0 | 0–1 | 0 |
| Math (DFT) | 5 | 5 | — | — | — | 0 |
| Total NEON | 75 | 113 | ~35 | ~15 | ~50 | ~57 |
OpenCV is not in this table — it is a mixed-architecture CV development platform (scalar + cv::hal multi-path SIMD + 10+ external dependencies), a different category from pure NEON acceleration libraries. The page already compares against OpenCV 4.13.0 extensively on speed — see NEON-only Aggregate and Top NEON Operators.
Competitor numbers are external-research estimates based on each project's public headers / documentation. Counting conventions may differ slightly from ACL Pack's. For a strict per-operator comparison, email
zangotech@163.comfor the raw mapping table.
Unique Differentiators
- The only ARM NEON library with a complete SIFT / SURF / ORB / Harris feature-detection suite — ppl.cv and ARM Compute Library ship zero feature-detection operators
- 3× filter coverage vs the nearest competitor — bilateral, guided, NL-Means, Gabor, stack blur are all absent from ppl.cv
- Complete transform pipeline — warpAffine, warpPerspective, findHomography, remap all NEON-optimized
- NV21 fused pipeline — resize + rotate + color conversion in a single pass (for AI inference preprocessing)
Test Methodology
Hardware (4 flagship arm64-v8a devices)
| SoC | Role |
|---|---|
| MediaTek Dimensity 9400 | Reference baseline |
| MediaTek Dimensity 9500 | Fastest absolute timings |
| Qualcomm Snapdragon 8 Gen 2 | Slowest absolute timings |
| Qualcomm Snapdragon 8 Gen 3 | Mid-range |
Size Coverage
Benchmarks cover 7 input-size tiers spanning 640×480 (mobile preview) → 1920×1280 (primary FHD) → 4096×3072 (4K stress), plus odd-dimension and tiny-image edge cases. The FHD tier (M, 1920×1280) is the canonical reference on this page.
Test Protocol
- Warm-up: CPU frequency stabilization phase before timing capture
- Measurement: per-call timing in ms, matched ACL vs OpenCV 4.13.0 on identical inputs
- Classification: ±5% tolerance band for
tie,faster,slower - Cross-validation: 4 independent devices run the full matrix; output PNGs are bit-identical across all four on every size tier
- Commercial regression: 4 commercial tier builds (Trial / Starter / Pro / Business) reproduce the release baseline bit-identically
NEON-only Aggregate (All 7 Size Tiers, 4 Devices Combined)
The headline: ACL Pack's NEON path vs OpenCV 4.13.0 across the full size matrix (7 size tiers × 4 devices, 6864 operator samples).
| Metric | Value |
|---|---|
| Faster than OpenCV (>5%) | 72.3% (4963 / 6864) |
| Tie (±5%) | 5.0% (342) |
| Slower than OpenCV (>5%) | 22.7% (1559) |
| Aggregate speedup (∑OCV / ∑ACL) | 4.72× |
| Peak speedup (ACL ≥ 0.05 ms) | 76.4× (inRange_1ch on SD8 Gen 2) |
The aggregate baseline includes a small number of denoise / Gabor / guided-filter / Scharr operators (~7% of rows) where the comparison row is ACL's own scalar C++ path rather than a 1:1 OpenCV call (OpenCV either lacks the operator, ships it in
opencv_contrib, or uses a different algorithm class). Excluding those rows, the operator-sample "faster" share is 72.4% (4632 / 6397) — the per-sample faster ratio is robust to the choice of baseline; the ∑/∑ ratio is more sensitive because a few large-ms denoise rows dominate the sum. Per-row truth is in the raw CSV — see Full Data Access.
M tier (1920×1280) — primary FHD reference
| Metric | Value |
|---|---|
| Faster than OpenCV (>5%) | 70.7% |
| Aggregate speedup (∑OCV / ∑ACL) | 4.56× |
Size-class speedup profile
- Tiny images (E3, 17×17 — ROI tiles, thumbnails): 22–59× aggregate speedup across the four devices. OpenCV's per-call fixed overhead dominates at microsecond workloads; ACL's header-only path has almost none.
- Mid-size (S 640×480 to M 1920×1280 — the realistic mobile camera / video regime): stable 4.4–4.6× aggregate speedup across all four devices.
- 4K (4096×3072): aggregate speedup is 4.80× as both libraries enter the memory-bandwidth-bound regime.
Per-device, per-operator timings across all 4 devices × 7 size tiers — including the complete raw CSV — are available to purchased-tier customers on request; see Full Data Access at the bottom of this page.
Top NEON Operators (M Tier)
| Operator | Speedup | ACL vs SOTA |
|---|---|---|
Resize NN Up 4× | up to 78.22× | |
inRange (Mask) | up to 76.42× | |
Threshold Binary | up to 61.79× | |
Resize AREA Up 4× | up to 49.51× | |
BitwiseAnd | up to 29.59× | |
Resize AREA Up 2× | up to 28.03× | |
Sobel gradX | up to 26.03× | |
Alpha Fusion (α=0.5) | up to 22.20× | |
Gaussian Blur 5×5 | up to 15.71× | |
Median Filter 3×3 | up to 14.99× | |
normalize MINMAX | up to 14.39× | |
Rotate 180° | up to 13.98× | |
BGR → Lab | up to 12.86× |
Bar length follows log(speedup). Numbers are the per-operator peak across the 4-device cohort.
Highlights (peak across the 4-device cohort, M tier):
- Resize NN upscale 4× (1ch): 78.22× on Snapdragon 8 Gen 3, 51.58× on SD8 Gen 2 — camera-pipeline upscaling
- Resize AREA upscale 4× (1ch): 49.51× on SD8 Gen 2
- gammaTransform: LUT-based implementation beats OpenCV's per-pixel
pow() - bgr2Lab (NEON): fixed-point + LUT, 12.86× over OpenCV's floating-point version on SD8 Gen 3
- Threshold / inRange / bitwise (1ch u8): 27–76× — these are the cheapest per-pixel ops where OCV's overhead dominates
Performance Highlights for Key Use Cases
AI Inference Preprocessing
Camera NV21 / NV12 → model input in one call:
YUV_utilities_float: resize + rotate + YUV→RGB + float output, all fused- Eliminates 3 intermediate buffers and 3 memory passes
Real-Time Video Processing (1920×1280)
- Sobel3x3 gradient family: ~26× on 1-channel u8
- Resize NN upscale 4× (1ch): 78× on Snapdragon 8 Gen 3, 52× on SD8 Gen 2 (camera upscale)
- Per-pixel ops (threshold / inRange / bitwise / LUT): 27–76× — OCV's per-call overhead dominates these
Tiny-Image Pipelines (E3, 17×17)
- ROI tile processing, thumbnail generation, vision-preprocess microbatches: 22–59× aggregate speedup across the 4 devices (Dimensity 9500 29.6×, SD8 Gen 3 22.7×, SD8 Gen 2 39.3×, Dimensity 9400 58.9×). OpenCV's fixed per-call overhead dominates at this size; ACL's header-only path has almost none.
Size Comparison
| Metric | ACL Pack | OpenCV |
|---|---|---|
| Delivery format | Header-only (.hpp) or static lib (.a) | Shared lib (.so) or static (.a) |
| Source size | ~43,000 LOC | ~500,000+ LOC |
| Compiled size (Android arm64) | Trial 0.24 MB / Starter 6.17 MB / Pro 7.12 MB / Business 7.51 MB (.a) | 15–50 MB (.so, depending on modules) |
| Compiled size (Linux aarch64) | Starter 3.94 MB / Pro 4.88 MB / Business 5.28 MB (.a) | same (.so) |
| External dependencies | None (zero) | zlib, libpng, libjpeg, libtiff, protobuf, etc. |
| Integration effort | Add include path | CMake find_package or prebuilt download |
| Runtime memory | User-managed (zero hidden allocation) | cv::Mat with implicit allocation + ref counting |
Conclusion
ACL Pack delivers measurable, reproducible performance advantages over OpenCV for the majority of image processing operations on Android ARM64. Across all 7 size tiers and 4 devices combined (6864 operator samples), 72.3% of NEON operator samples are faster than OpenCV 4.13.0 with an aggregate ∑OCV/∑ACL of 4.72×, an M-tier peak of 78× (resize_nn_up4x_1ch on Snapdragon 8 Gen 3) and 76× on per-pixel ops like inRange, and a 22–59× edge on tiny (≤ 32×32) images — positioning ACL Pack as the highest-performance image processing option for Android ARM64 applications. Delivered as a paid-tier static library of ~6–7.5 MB, with zero external runtime dependencies and bit-identical output across devices.
Full Data Access
This page presents the public aggregate view. The complete per-device, per-operator, per-size-tier dataset — including raw CSV (per-sample timing, accuracy, error, and pass status across all 4 devices × 7 size tiers × 163 callable entries) — is available to purchased-tier customers on request:
- Email:
zangotech@163.com - Subject:
[Perf Data]+ your license ID - Response within 1–2 business days
Trial-tier users can request a representative sample dataset for one device × one size tier by emailing with subject [Trial Perf Sample].
Benchmark data: v1.0.3 release regression run (May 2026), 4 devices × 7 size tiers.