ACL Pack Performance Whitepaper

Version: 1.0.3
Date: May 2026 (v1.0.3 release benchmark)
Test Platform: Android arm64-v8a, 4 flagship devices cross-validated
Comparison Baseline: OpenCV 4.13.0

Executive Summary

ACL Pack is a high-performance image processing library for Android arm64-v8a, deeply optimized with hand-written ARM NEON SIMD intrinsics. The metrics below are aggregated from the v1.0.3 release regression run (4 devices × 7 size tiers, see Changelog). All four devices produce bit-identical output on every size tier. Per-call timings on sub-millisecond operators carry normal mobile-OS jitter and may fluctuate run-to-run; the headline aggregate metrics are stable.

163 callable entries across 11 benchmarked categories: analysis, arithmetic, contour, cvtcolor, draw, feature, filter, geometric, math, memory, transform
4 devices × 7 size tiers full matrix coverage — bit-identical output across all four devices
NEON path, all 7 size tiers × 4 devices combined: 72.3% of operator samples (n=6864) are faster than OpenCV 4.13.0 (>5%), aggregate speedup 4.72×
Peak NEON speedup ~78× on M tier (resize_nn_up4x_1ch on Snapdragon 8 Gen 3, ACL 0.046 ms vs OCV 3.598 ms), with other notable peaks at inRange_1ch 76.42×, threshold_binary 61.79×, resize_area_up4x_1ch 49.51×, bitwise_and_1ch 29.59×, resize_area_up2x_1ch 28.03×
Static library ~6 MB (paid tiers 6.2–7.5 MB, Trial 0.24 MB) vs OpenCV's 15–50 MB shared library
Zero external runtime dependencies — just add the include path

Competitive Landscape

NEON Operator Count Comparison

The ACL Pack column is counted against the authoritative operator_tiers.json manifest — each row is a distinct standard documented operator family (CPP and NEON variants of the same operator share one row, overloads and type specializations are not counted separately) so the row count lines up with how the competitor libraries report their own coverage. Total ACL Pack documented operator families: 113, of which 75 have a NEON implementation; the rest are CPP-only (mostly in the contour, draw, and math utility categories). Counted as customer-facing callable entries (including N-image variants, kernel-size dispatchers, and CPP/NEON pairs that ship as separate functions), the SDK exposes 163 callable entries.

Category	ACL Pack (NEON)	ACL Pack (Total)	ppl.cv (SenseTime)	ARM Compute Library	Simd Library	libyuv
Filter	16	20	5	0	8	0
Color Convert	7	11	10	0	20+	~40
Geometric	7	7	4	5	4	~6
Arithmetic	12	14	8	10	5	~8
Analysis	8	18	3	0	7	~3
Feature Detection	16	16	0	0	3	0
Transform	4	10	5	0	0–1	0
Math (DFT)	5	5	—	—	—	0
Total NEON	75	113	~35	~15	~50	~57

OpenCV is not in this table — it is a mixed-architecture CV development platform (scalar + cv::hal multi-path SIMD + 10+ external dependencies), a different category from pure NEON acceleration libraries. The page already compares against OpenCV 4.13.0 extensively on speed — see NEON-only Aggregate and Top NEON Operators.
Competitor numbers are external-research estimates based on each project's public headers / documentation. Counting conventions may differ slightly from ACL Pack's. For a strict per-operator comparison, email zangotech@163.com for the raw mapping table.

Unique Differentiators

The only ARM NEON library with a complete SIFT / SURF / ORB / Harris feature-detection suite — ppl.cv and ARM Compute Library ship zero feature-detection operators
3× filter coverage vs the nearest competitor — bilateral, guided, NL-Means, Gabor, stack blur are all absent from ppl.cv
Complete transform pipeline — warpAffine, warpPerspective, findHomography, remap all NEON-optimized
NV21 fused pipeline — resize + rotate + color conversion in a single pass (for AI inference preprocessing)

Test Methodology

Hardware (4 flagship arm64-v8a devices)

SoC	Role
MediaTek Dimensity 9400	Reference baseline
MediaTek Dimensity 9500	Fastest absolute timings
Qualcomm Snapdragon 8 Gen 2	Slowest absolute timings
Qualcomm Snapdragon 8 Gen 3	Mid-range

Size Coverage

Benchmarks cover 7 input-size tiers spanning 640×480 (mobile preview) → 1920×1280 (primary FHD) → 4096×3072 (4K stress), plus odd-dimension and tiny-image edge cases. The FHD tier (M, 1920×1280) is the canonical reference on this page.

Test Protocol

Warm-up: CPU frequency stabilization phase before timing capture
Measurement: per-call timing in ms, matched ACL vs OpenCV 4.13.0 on identical inputs
Classification: ±5% tolerance band for tie, faster, slower
Cross-validation: 4 independent devices run the full matrix; output PNGs are bit-identical across all four on every size tier
Commercial regression: 4 commercial tier builds (Trial / Starter / Pro / Business) reproduce the release baseline bit-identically

NEON-only Aggregate (All 7 Size Tiers, 4 Devices Combined)

The headline: ACL Pack's NEON path vs OpenCV 4.13.0 across the full size matrix (7 size tiers × 4 devices, 6864 operator samples).

Metric	Value
Faster than OpenCV (>5%)	72.3% (4963 / 6864)
Tie (±5%)	5.0% (342)
Slower than OpenCV (>5%)	22.7% (1559)
Aggregate speedup (∑OCV / ∑ACL)	4.72×
Peak speedup (ACL ≥ 0.05 ms)	76.4× (`inRange_1ch` on SD8 Gen 2)

The aggregate baseline includes a small number of denoise / Gabor / guided-filter / Scharr operators (~7% of rows) where the comparison row is ACL's own scalar C++ path rather than a 1:1 OpenCV call (OpenCV either lacks the operator, ships it in opencv_contrib, or uses a different algorithm class). Excluding those rows, the operator-sample "faster" share is 72.4% (4632 / 6397) — the per-sample faster ratio is robust to the choice of baseline; the ∑/∑ ratio is more sensitive because a few large-ms denoise rows dominate the sum. Per-row truth is in the raw CSV — see Full Data Access.

M tier (1920×1280) — primary FHD reference

Metric	Value
Faster than OpenCV (>5%)	70.7%
Aggregate speedup (∑OCV / ∑ACL)	4.56×

Size-class speedup profile

Tiny images (E3, 17×17 — ROI tiles, thumbnails): 22–59× aggregate speedup across the four devices. OpenCV's per-call fixed overhead dominates at microsecond workloads; ACL's header-only path has almost none.
Mid-size (S 640×480 to M 1920×1280 — the realistic mobile camera / video regime): stable 4.4–4.6× aggregate speedup across all four devices.
4K (4096×3072): aggregate speedup is 4.80× as both libraries enter the memory-bandwidth-bound regime.

Per-device, per-operator timings across all 4 devices × 7 size tiers — including the complete raw CSV — are available to purchased-tier customers on request; see Full Data Access at the bottom of this page.

Top NEON Operators (M Tier)

Operator	Speedup	ACL vs SOTA
`Resize NN Up 4×`	up to 78.22×
`inRange (Mask)`	up to 76.42×
`Threshold Binary`	up to 61.79×
`Resize AREA Up 4×`	up to 49.51×
`BitwiseAnd`	up to 29.59×
`Resize AREA Up 2×`	up to 28.03×
`Sobel gradX`	up to 26.03×
`Alpha Fusion (α=0.5)`	up to 22.20×
`Gaussian Blur 5×5`	up to 15.71×
`Median Filter 3×3`	up to 14.99×
`normalize MINMAX`	up to 14.39×
`Rotate 180°`	up to 13.98×
`BGR → Lab`	up to 12.86×

Bar length follows log(speedup). Numbers are the per-operator peak across the 4-device cohort.

Highlights (peak across the 4-device cohort, M tier):

Resize NN upscale 4× (1ch): 78.22× on Snapdragon 8 Gen 3, 51.58× on SD8 Gen 2 — camera-pipeline upscaling
Resize AREA upscale 4× (1ch): 49.51× on SD8 Gen 2
gammaTransform: LUT-based implementation beats OpenCV's per-pixel pow()
bgr2Lab (NEON): fixed-point + LUT, 12.86× over OpenCV's floating-point version on SD8 Gen 3
Threshold / inRange / bitwise (1ch u8): 27–76× — these are the cheapest per-pixel ops where OCV's overhead dominates

Performance Highlights for Key Use Cases

AI Inference Preprocessing

Camera NV21 / NV12 → model input in one call:

YUV_utilities_float: resize + rotate + YUV→RGB + float output, all fused
Eliminates 3 intermediate buffers and 3 memory passes

Real-Time Video Processing (1920×1280)

Sobel3x3 gradient family: ~26× on 1-channel u8
Resize NN upscale 4× (1ch): 78× on Snapdragon 8 Gen 3, 52× on SD8 Gen 2 (camera upscale)
Per-pixel ops (threshold / inRange / bitwise / LUT): 27–76× — OCV's per-call overhead dominates these

Tiny-Image Pipelines (E3, 17×17)

ROI tile processing, thumbnail generation, vision-preprocess microbatches: 22–59× aggregate speedup across the 4 devices (Dimensity 9500 29.6×, SD8 Gen 3 22.7×, SD8 Gen 2 39.3×, Dimensity 9400 58.9×). OpenCV's fixed per-call overhead dominates at this size; ACL's header-only path has almost none.

Size Comparison

Metric	ACL Pack	OpenCV
Delivery format	Header-only (.hpp) or static lib (.a)	Shared lib (.so) or static (.a)
Source size	~43,000 LOC	~500,000+ LOC
Compiled size (Android arm64)	Trial 0.24 MB / Starter 6.17 MB / Pro 7.12 MB / Business 7.51 MB (.a)	15–50 MB (.so, depending on modules)
Compiled size (Linux aarch64)	Starter 3.94 MB / Pro 4.88 MB / Business 5.28 MB (.a)	same (.so)
External dependencies	None (zero)	zlib, libpng, libjpeg, libtiff, protobuf, etc.
Integration effort	Add include path	CMake find_package or prebuilt download
Runtime memory	User-managed (zero hidden allocation)	cv::Mat with implicit allocation + ref counting

Conclusion

ACL Pack delivers measurable, reproducible performance advantages over OpenCV for the majority of image processing operations on Android ARM64. Across all 7 size tiers and 4 devices combined (6864 operator samples), 72.3% of NEON operator samples are faster than OpenCV 4.13.0 with an aggregate ∑OCV/∑ACL of 4.72×, an M-tier peak of 78× (resize_nn_up4x_1ch on Snapdragon 8 Gen 3) and 76× on per-pixel ops like inRange, and a 22–59× edge on tiny (≤ 32×32) images — positioning ACL Pack as the highest-performance image processing option for Android ARM64 applications. Delivered as a paid-tier static library of ~6–7.5 MB, with zero external runtime dependencies and bit-identical output across devices.

Full Data Access

This page presents the public aggregate view. The complete per-device, per-operator, per-size-tier dataset — including raw CSV (per-sample timing, accuracy, error, and pass status across all 4 devices × 7 size tiers × 163 callable entries) — is available to purchased-tier customers on request:

Email: zangotech@163.com
Subject: [Perf Data] + your license ID
Response within 1–2 business days

Trial-tier users can request a representative sample dataset for one device × one size tier by emailing with subject [Trial Perf Sample].

Benchmark data: v1.0.3 release regression run (May 2026), 4 devices × 7 size tiers.

ACL Pack Performance Whitepaper ​

Executive Summary ​

Competitive Landscape ​

NEON Operator Count Comparison ​

Unique Differentiators ​

Test Methodology ​

Hardware (4 flagship arm64-v8a devices) ​

Size Coverage ​

Test Protocol ​

NEON-only Aggregate (All 7 Size Tiers, 4 Devices Combined) ​

M tier (1920×1280) — primary FHD reference ​

Size-class speedup profile ​

Top NEON Operators (M Tier) ​

Performance Highlights for Key Use Cases ​

AI Inference Preprocessing ​

Real-Time Video Processing (1920×1280) ​

Tiny-Image Pipelines (E3, 17×17) ​

Size Comparison ​

Conclusion ​

Full Data Access ​