Skip to content

ACL Pack Performance Whitepaper

Version: 1.0.3
Date: May 2026 (v1.0.3 release benchmark)
Test Platform: Android arm64-v8a, 4 flagship devices cross-validated
Comparison Baseline: OpenCV 4.13.0

Executive Summary

ACL Pack is a high-performance image processing library for Android arm64-v8a, deeply optimized with hand-written ARM NEON SIMD intrinsics. The metrics below are aggregated from the v1.0.3 release regression run (4 devices × 7 size tiers, see Changelog). All four devices produce bit-identical output on every size tier. Per-call timings on sub-millisecond operators carry normal mobile-OS jitter and may fluctuate run-to-run; the headline aggregate metrics are stable.

  • 163 callable entries across 11 benchmarked categories: analysis, arithmetic, contour, cvtcolor, draw, feature, filter, geometric, math, memory, transform
  • 4 devices × 7 size tiers full matrix coverage — bit-identical output across all four devices
  • NEON path, all 7 size tiers × 4 devices combined: 72.3% of operator samples (n=6864) are faster than OpenCV 4.13.0 (>5%), aggregate speedup 4.72×
  • Peak NEON speedup ~78× on M tier (resize_nn_up4x_1ch on Snapdragon 8 Gen 3, ACL 0.046 ms vs OCV 3.598 ms), with other notable peaks at inRange_1ch 76.42×, threshold_binary 61.79×, resize_area_up4x_1ch 49.51×, bitwise_and_1ch 29.59×, resize_area_up2x_1ch 28.03×
  • Static library ~6 MB (paid tiers 6.2–7.5 MB, Trial 0.24 MB) vs OpenCV's 15–50 MB shared library
  • Zero external runtime dependencies — just add the include path

Competitive Landscape

NEON Operator Count Comparison

The ACL Pack column is counted against the authoritative operator_tiers.json manifest — each row is a distinct standard documented operator family (CPP and NEON variants of the same operator share one row, overloads and type specializations are not counted separately) so the row count lines up with how the competitor libraries report their own coverage. Total ACL Pack documented operator families: 113, of which 75 have a NEON implementation; the rest are CPP-only (mostly in the contour, draw, and math utility categories). Counted as customer-facing callable entries (including N-image variants, kernel-size dispatchers, and CPP/NEON pairs that ship as separate functions), the SDK exposes 163 callable entries.

CategoryACL Pack (NEON)ACL Pack (Total)ppl.cv (SenseTime)ARM Compute LibrarySimd Librarylibyuv
Filter16205080
Color Convert71110020+~40
Geometric77454~6
Arithmetic12148105~8
Analysis818307~3
Feature Detection16160030
Transform410500–10
Math (DFT)550
Total NEON75113~35~15~50~57

OpenCV is not in this table — it is a mixed-architecture CV development platform (scalar + cv::hal multi-path SIMD + 10+ external dependencies), a different category from pure NEON acceleration libraries. The page already compares against OpenCV 4.13.0 extensively on speed — see NEON-only Aggregate and Top NEON Operators.

Competitor numbers are external-research estimates based on each project's public headers / documentation. Counting conventions may differ slightly from ACL Pack's. For a strict per-operator comparison, email zangotech@163.com for the raw mapping table.

Unique Differentiators

  1. The only ARM NEON library with a complete SIFT / SURF / ORB / Harris feature-detection suite — ppl.cv and ARM Compute Library ship zero feature-detection operators
  2. 3× filter coverage vs the nearest competitor — bilateral, guided, NL-Means, Gabor, stack blur are all absent from ppl.cv
  3. Complete transform pipeline — warpAffine, warpPerspective, findHomography, remap all NEON-optimized
  4. NV21 fused pipeline — resize + rotate + color conversion in a single pass (for AI inference preprocessing)

Test Methodology

Hardware (4 flagship arm64-v8a devices)

SoCRole
MediaTek Dimensity 9400Reference baseline
MediaTek Dimensity 9500Fastest absolute timings
Qualcomm Snapdragon 8 Gen 2Slowest absolute timings
Qualcomm Snapdragon 8 Gen 3Mid-range

Size Coverage

Benchmarks cover 7 input-size tiers spanning 640×480 (mobile preview)1920×1280 (primary FHD)4096×3072 (4K stress), plus odd-dimension and tiny-image edge cases. The FHD tier (M, 1920×1280) is the canonical reference on this page.

Test Protocol

  1. Warm-up: CPU frequency stabilization phase before timing capture
  2. Measurement: per-call timing in ms, matched ACL vs OpenCV 4.13.0 on identical inputs
  3. Classification: ±5% tolerance band for tie, faster, slower
  4. Cross-validation: 4 independent devices run the full matrix; output PNGs are bit-identical across all four on every size tier
  5. Commercial regression: 4 commercial tier builds (Trial / Starter / Pro / Business) reproduce the release baseline bit-identically

NEON-only Aggregate (All 7 Size Tiers, 4 Devices Combined)

The headline: ACL Pack's NEON path vs OpenCV 4.13.0 across the full size matrix (7 size tiers × 4 devices, 6864 operator samples).

MetricValue
Faster than OpenCV (>5%)72.3% (4963 / 6864)
Tie (±5%)5.0% (342)
Slower than OpenCV (>5%)22.7% (1559)
Aggregate speedup (∑OCV / ∑ACL)4.72×
Peak speedup (ACL ≥ 0.05 ms)76.4× (inRange_1ch on SD8 Gen 2)

The aggregate baseline includes a small number of denoise / Gabor / guided-filter / Scharr operators (~7% of rows) where the comparison row is ACL's own scalar C++ path rather than a 1:1 OpenCV call (OpenCV either lacks the operator, ships it in opencv_contrib, or uses a different algorithm class). Excluding those rows, the operator-sample "faster" share is 72.4% (4632 / 6397) — the per-sample faster ratio is robust to the choice of baseline; the ∑/∑ ratio is more sensitive because a few large-ms denoise rows dominate the sum. Per-row truth is in the raw CSV — see Full Data Access.

M tier (1920×1280) — primary FHD reference

MetricValue
Faster than OpenCV (>5%)70.7%
Aggregate speedup (∑OCV / ∑ACL)4.56×

Size-class speedup profile

  • Tiny images (E3, 17×17 — ROI tiles, thumbnails): 22–59× aggregate speedup across the four devices. OpenCV's per-call fixed overhead dominates at microsecond workloads; ACL's header-only path has almost none.
  • Mid-size (S 640×480 to M 1920×1280 — the realistic mobile camera / video regime): stable 4.4–4.6× aggregate speedup across all four devices.
  • 4K (4096×3072): aggregate speedup is 4.80× as both libraries enter the memory-bandwidth-bound regime.

Per-device, per-operator timings across all 4 devices × 7 size tiers — including the complete raw CSV — are available to purchased-tier customers on request; see Full Data Access at the bottom of this page.

Top NEON Operators (M Tier)

OperatorSpeedupACL vs SOTA
Resize NN Up 4×up to 78.22×
inRange (Mask)up to 76.42×
Threshold Binaryup to 61.79×
Resize AREA Up 4×up to 49.51×
BitwiseAndup to 29.59×
Resize AREA Up 2×up to 28.03×
Sobel gradXup to 26.03×
Alpha Fusion (α=0.5)up to 22.20×
Gaussian Blur 5×5up to 15.71×
Median Filter 3×3up to 14.99×
normalize MINMAXup to 14.39×
Rotate 180°up to 13.98×
BGR → Labup to 12.86×

Bar length follows log(speedup). Numbers are the per-operator peak across the 4-device cohort.

Highlights (peak across the 4-device cohort, M tier):

  • Resize NN upscale 4× (1ch): 78.22× on Snapdragon 8 Gen 3, 51.58× on SD8 Gen 2 — camera-pipeline upscaling
  • Resize AREA upscale 4× (1ch): 49.51× on SD8 Gen 2
  • gammaTransform: LUT-based implementation beats OpenCV's per-pixel pow()
  • bgr2Lab (NEON): fixed-point + LUT, 12.86× over OpenCV's floating-point version on SD8 Gen 3
  • Threshold / inRange / bitwise (1ch u8): 27–76× — these are the cheapest per-pixel ops where OCV's overhead dominates

Performance Highlights for Key Use Cases

AI Inference Preprocessing

Camera NV21 / NV12 → model input in one call:

  • YUV_utilities_float: resize + rotate + YUV→RGB + float output, all fused
  • Eliminates 3 intermediate buffers and 3 memory passes

Real-Time Video Processing (1920×1280)

  • Sobel3x3 gradient family: ~26× on 1-channel u8
  • Resize NN upscale 4× (1ch): 78× on Snapdragon 8 Gen 3, 52× on SD8 Gen 2 (camera upscale)
  • Per-pixel ops (threshold / inRange / bitwise / LUT): 27–76× — OCV's per-call overhead dominates these

Tiny-Image Pipelines (E3, 17×17)

  • ROI tile processing, thumbnail generation, vision-preprocess microbatches: 22–59× aggregate speedup across the 4 devices (Dimensity 9500 29.6×, SD8 Gen 3 22.7×, SD8 Gen 2 39.3×, Dimensity 9400 58.9×). OpenCV's fixed per-call overhead dominates at this size; ACL's header-only path has almost none.

Size Comparison

MetricACL PackOpenCV
Delivery formatHeader-only (.hpp) or static lib (.a)Shared lib (.so) or static (.a)
Source size~43,000 LOC~500,000+ LOC
Compiled size (Android arm64)Trial 0.24 MB / Starter 6.17 MB / Pro 7.12 MB / Business 7.51 MB (.a)15–50 MB (.so, depending on modules)
Compiled size (Linux aarch64)Starter 3.94 MB / Pro 4.88 MB / Business 5.28 MB (.a)same (.so)
External dependenciesNone (zero)zlib, libpng, libjpeg, libtiff, protobuf, etc.
Integration effortAdd include pathCMake find_package or prebuilt download
Runtime memoryUser-managed (zero hidden allocation)cv::Mat with implicit allocation + ref counting

Conclusion

ACL Pack delivers measurable, reproducible performance advantages over OpenCV for the majority of image processing operations on Android ARM64. Across all 7 size tiers and 4 devices combined (6864 operator samples), 72.3% of NEON operator samples are faster than OpenCV 4.13.0 with an aggregate ∑OCV/∑ACL of 4.72×, an M-tier peak of 78× (resize_nn_up4x_1ch on Snapdragon 8 Gen 3) and 76× on per-pixel ops like inRange, and a 22–59× edge on tiny (≤ 32×32) images — positioning ACL Pack as the highest-performance image processing option for Android ARM64 applications. Delivered as a paid-tier static library of ~6–7.5 MB, with zero external runtime dependencies and bit-identical output across devices.

Full Data Access

This page presents the public aggregate view. The complete per-device, per-operator, per-size-tier dataset — including raw CSV (per-sample timing, accuracy, error, and pass status across all 4 devices × 7 size tiers × 163 callable entries) — is available to purchased-tier customers on request:

  • Email: zangotech@163.com
  • Subject: [Perf Data] + your license ID
  • Response within 1–2 business days

Trial-tier users can request a representative sample dataset for one device × one size tier by emailing with subject [Trial Perf Sample].

Benchmark data: v1.0.3 release regression run (May 2026), 4 devices × 7 size tiers.