NDK实现图像识别

前言

在移动端实现实时图像识别，涉及三个关键技术链路：相机帧采集（Camera2 API / ImageReader）、预处理管线（YUV→RGB 转换 + 缩放到模型输入尺寸）以及推理引擎（TF Lite / OpenCV DNN）。将这些环节下沉到 NDK 层，不仅能消除 Java/Kotlin 层的 GC 抖动和 JNI 调用开销，更能利用 NEON SIMD 指令集加速像素变换和矩阵乘法。

本文以实时目标分类（MobileNetV2）和人脸检测（OpenCV CascadeClassifier）为双线实战案例，从 TF Lite 的四种推理后端（CPU/GPU/NNAPI/XNNPACK）到 OpenCV 的 DNN 模块，再到 Camera2→YUV→RGB→推理的完整管道，给出全部可运行的 C++/JNI 代码。

工具链版本: Android NDK r26, OpenCV 4.8.0, TensorFlow Lite 2.14, libyuv r1830

一、TF Lite on Android：集成与架构

1.1 AAR 依赖配置

TensorFlow Lite 通过 Google 的 Maven 仓库分发 Android AAR 包：

// build.gradle (app module)
android {
    defaultConfig {
        ndk {
            abiFilters 'armeabi-v7a', 'arm64-v8a', 'x86_64'
        }
    }
}

dependencies {
    // TF Lite 基础运行时
    implementation 'org.tensorflow:tensorflow-lite:2.14.0'
    // 可选：GPU delegate
    implementation 'org.tensorflow:tensorflow-lite-gpu:2.14.0'
    // 可选：GPU delegate API（用于精细控制）
    implementation 'org.tensorflow:tensorflow-lite-gpu-api:2.14.0'
    // 可选：Support Library（图像预处理、TensorBuffer 等工具）
    implementation 'org.tensorflow:tensorflow-lite-support:0.4.4'
}

注意各 delegate 库的 ABI 覆盖范围：

tensorflow-lite：包含 armeabi-v7a、arm64-v8a、x86、x86_64 的 .so
tensorflow-lite-gpu：仅 arm64-v8a（GPU delegate 需要 OpenCL 或 OpenGL ES 3.1+）
NNAPI delegate 已内置在基础运行时中（通过 Interpreter.Options 启用）

1.2 Interpreter API 生命周期

TF Lite 的 C++ 推理核心是 tflite::Interpreter，其生命周期如下：

┌──────────┐    allocateTensors()   ┌──────────┐
│  Created  │ ──────────────────────→│ Allocated │
│ (模型加载)│                        │ (张量就绪)│
└──────────┘                        └─────┬────┘
     ↑                                     │
     │                              invoke() / run()
     │                                     │
┌────┴─────┐    close() / delete   ┌───────▼──────┐
│  Closed   │ ←─────────────────────│   Running     │
│ (释放资源)│                       │ (可多次推理)  │
└──────────┘                        └──────────────┘

关键方法：

allocateTensors()：根据模型图确定所有中间张量的内存布局并分配
run()：触发一次完整的前向推理（别名 Invoke()）
close()：释放 GPU delegate 等外部资源

二、模型加载策略：内存映射 vs Buffered I/O

2.1 MappedByteBuffer 方式（推荐）

Android 上加载 TF Lite 模型的标准方式是 MappedByteBuffer，底层使用 mmap() 将模型文件映射到虚拟内存空间，避免将整个模型复制到堆内存：

// Kotlin layer
@Throws(IOException::class)
fun loadModelFile(context: Context, modelFileName: String): MappedByteBuffer {
    val assetFileDescriptor = context.assets.openFd(modelFileName)
    val inputStream = FileInputStream(assetFileDescriptor.fileDescriptor)
    val fileChannel = inputStream.channel
    val startOffset = assetFileDescriptor.startOffset
    val declaredLength = assetFileDescriptor.declaredLength
    return fileChannel.map(
        FileChannel.MapMode.READ_ONLY,
        startOffset,
        declaredLength
    )
}

// JNI: 传递 MappedByteBuffer 给 native 层
Interpreter interpreter = new Interpreter(loadModelFile(context, "mobilenet_v2.tflite"));

2.2 从文件路径直接映射（C++ 层）

在 NDK 中，可以直接使用 mmap() 系统调用加载模型文件：

#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>

std::unique_ptr<tflite::FlatBufferModel> load_model(const char *path) {
    int fd = open(path, O_RDONLY);
    if (fd < 0) return nullptr;

    // 获取文件大小
    off_t file_size = lseek(fd, 0, SEEK_END);
    lseek(fd, 0, SEEK_SET);

    // 只读映射
    void *mapped = mmap(nullptr, file_size, PROT_READ, MAP_SHARED, fd, 0);
    if (mapped == MAP_FAILED) {
        close(fd);
        return nullptr;
    }

    // FlatBufferModel 从映射内存构建（不拷贝）
    auto model = tflite::FlatBufferModel::BuildFromBuffer(
        static_cast<const char *>(mapped), file_size);

    // 注意：mapped 内存在 model 生命周期内必须保持有效
    // 在 model 释放后调用 munmap(mapped, file_size)
    close(fd);
    return model;
}

2.3 内存映射 vs 传统读取对比

方式	物理内存占用	加载耗时	适用场景
`ByteBuffer.allocateDirect()`	模型大小 × 1（堆外副本）	文件读取时间 + 复制时间	小模型 / 非资产文件
`MappedByteBuffer`	页面缓存粒度（4KB 页按需加载）	仅页表设置时间	推荐方式
`mmap()`	同 MappedByteBuffer	同 MappedByteBuffer	NDK 直接使用

mmap() 的本质优势在于：文件的页面被映射到进程地址空间后，内核的页面缓存机制使得只有实际访问的页面才会触发磁盘 I/O。对于 10~50MB 的模型文件，这意味着实际的物理内存开销远小于模型大小。

三、推理后端对比：GPU / NNAPI / XNNPACK

3.1 GPU Delegate（OpenCL / OpenGL ES）

GPU delegate 通过将计算图映射为 OpenCL kernel 或 GL shader 程序来加速推理。对于大卷积核和深层次网络（如 ResNet、Inception），加速比可达 5~10 倍。

// Java API
Interpreter.Options options = new Interpreter.Options();
GpuDelegate delegate = new GpuDelegate(
    new GpuDelegateFactory.Options()
        .setPrecision(GpuDelegateFactory.Options.Precision.FP16)
);
options.addDelegate(delegate);
Interpreter interpreter = new Interpreter(model, options);

// C++ NDK API
#include "tensorflow/lite/delegates/gpu/delegate.h"

TfLiteGpuDelegateOptionsV2 opts = TfLiteGpuDelegateOptionsV2Default();
opts.inference_priority1 = TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY;
opts.inference_preference = TFLITE_GPU_INFERENCE_PREFERENCE_FAST_SINGLE_ANSWER;
opts.is_precision_loss_allowed = 1;  // 开启 FP16 以获得更优性能

auto* gpu_delegate = TfLiteGpuDelegateV2Create(&opts);
interpreter->ModifyGraphWithDelegate(gpu_delegate);

FP16 vs FP32 精度选择：

精度	显存占用	推理速度	量化影响
FP32	1×	基准	无精度损失
FP16	0.5×	1.5~3× faster	极小精度损失（< 0.1% top-1 差）

移动 GPU（Adreno / Mali）的 FP16 ALU 数量往往是 FP32 的两倍，因此 FP16 模式能显著提升吞吐率。

3.2 NNAPI Delegate（DSP / NPU 加速）

Android Neural Networks API（NNAPI）是 Android 8.1（API 27）引入的系统级推理加速框架，允许 TF Lite 将算子路由到厂商的 DSP、NPU 或定制加速器。

// 选择特定加速器
#include "tensorflow/lite/delegates/nnapi/nnapi_delegate.h"

tflite::StatefulNnApiDelegate::Options nnapi_options;
nnapi_options.accelerator_name = "qti-dsp";  // 高通 DSP
nnapi_options.cache_dir = "/data/local/tmp/nnapi_cache";
nnapi_options.max_number_delegated_partitions = 10;
nnapi_options.allow_fp16 = true;

auto* nnapi_delegate = new tflite::StatefulNnApiDelegate(nnapi_options);
interpreter->ModifyGraphWithDelegate(nnapi_delegate);

NNAPI 加速器名称示例：

设备	加速器名称	可用算子
Qualcomm (Hexagon DSP)	`qti-dsp`	Conv2D, DepthwiseConv, Pooling, ReLU
Qualcomm (Hexagon NPU)	`qti-npu`	上述 + Softmax, FC
MediaTek (APU)	`mtk-apu`	Conv, FC, BN, Pooling
Samsung (NPU)	`exynos-npu`	大部分主图算子
Google Tensor	`google-nnapi`	完整算子覆盖

3.3 XNNPACK Delegate（CPU 优化）

XNNPACK 是 Google 开发的跨平台 CPU 推理库，专注于算子融合（如 Conv2D+ReLU、Conv2D+BatchNorm）、内存布局优化（NHWC→优化布局）和 NEON/SIMD 加速。

// XNNPACK 是 TF Lite 2.9+ 的默认 CPU 后端
// 可通过 Interpreter.Options 显式启用/禁用
tflite::Interpreter::Options opts;
opts.SetNumThreads(4);  // 绑定 4 个 CPU 核心

// XNNPACK 自动启用以下融合优化：
//   Conv2D + BiasAdd + ReLU → 单次 kernel 调用
//   DepthwiseConv + ReLU → 融合
//   FullyConnected + ReLU → 融合

各后端性能对比（MobileNetV2, Pixel 6, 224×224, 单次推理）：

后端	延迟 (ms)	功耗 (mW)	备注
XNNPACK (4 threads)	8.2	850	稳定，不依赖硬件厂商
NNAPI (DSP)	5.1	420	需厂商实现，可能有算子不支持
GPU (FP16)	4.3	680	首次推理有 warmup 开销
GPU (FP32)	6.8	720	—

四、OpenCV NDK 集成与预处理管线

4.1 CMake 配置

cmake_minimum_required(VERSION 3.18)
project(image_recognition)

# 设置 OpenCV 路径
set(OPENCV_DIR ${CMAKE_SOURCE_DIR}/third_party/OpenCV-android-sdk/sdk/native/jni)
include_directories(${OPENCV_DIR}/include)
add_library(opencv_core    SHARED IMPORTED)
set_target_properties(opencv_core PROPERTIES
    IMPORTED_LOCATION ${OPENCV_DIR}/libs/${ANDROID_ABI}/libopencv_core.so)

add_library(native_recognition SHARED
    native_recognition.cpp
    camera_processor.cpp
    classifier.cpp)
target_link_libraries(native_recognition
    opencv_core opencv_imgproc opencv_dnn opencv_objdetect
    ${CMAKE_DL_LIBS} android log)

4.2 CascadeClassifier：从 Assets 加载 LBP Cascade

OpenCV 的 CascadeClassifier::load() 需要文件路径。Android assets 中的 XML 级联文件不能直接用文件路径访问，需要复制到临时文件或使用 native fd：

// 方法 1: 复制到缓存目录
bool load_cascade_from_asset(AAssetManager *mgr, const char *filename,
                              cv::CascadeClassifier &classifier) {
    AAsset *asset = AAssetManager_open(mgr, filename, AASSET_MODE_BUFFER);
    if (!asset) return false;

    off_t size = AAsset_getLength(asset);
    const char *data = static_cast<const char *>(AAsset_getBuffer(asset));

    // 写入临时文件
    std::string tmp_path = "/data/local/tmp/";
    tmp_path += filename;
    FILE *fp = fopen(tmp_path.c_str(), "wb");
    if (!fp) { AAsset_close(asset); return false; }
    fwrite(data, 1, size, fp);
    fclose(fp);
    AAsset_close(asset);

    return classifier.load(tmp_path);
}

// 方法 2: 使用 FileDescriptor（Android 7.0+）
// 通过 JNI 从 Java 层打开 AssetFileDescriptor 获取 fd
bool load_cascade_from_fd(int fd, off64_t start, off64_t length,
                           cv::CascadeClassifier &classifier) {
    // 使用 /proc/self/fd/{fd} 路径
    char fd_path[64];
    snprintf(fd_path, sizeof(fd_path), "/proc/self/fd/%d", fd);

    // 对于 XML cascade，直接读取
    // 注意：OpenCV 的 load() 内部会 fopen() 该路径
    return classifier.load(fd_path);
}

4.3 DNN 模块前向推理

OpenCV 的 cv::dnn::Net 支持加载 Caffe、TensorFlow、ONNX、Darknet 等格式的模型，提供了统一的预处理和后处理接口：

#include <opencv2/dnn.hpp>

cv::dnn::Net net;

// 方法 1: 加载 Caffe 模型
net = cv::dnn::readNetFromCaffe("deploy.prototxt", "model.caffemodel");

// 方法 2: 加载 TensorFlow 模型
net = cv::dnn::readNetFromTensorflow("frozen_graph.pb", "graph.pbtxt");

// 方法 3: 加载 ONNX 模型
net = cv::dnn::readNetFromONNX("model.onnx");

// 预处理: blobFromImage
cv::Mat blob = cv::dnn::blobFromImage(
    input_mat,          // 输入图像 (H×W×C)
    1.0 / 127.5,       // scale factor: 归一化到 [-1, 1]
    cv::Size(224, 224), // 目标尺寸
    cv::Scalar(127.5, 127.5, 127.5),  // mean subtraction: R G B
    true,               // swapRB: BGR→RGB
    false               // crop: false = 等比缩放+填充
);

// 前向推理
net.setInput(blob);
cv::Mat output = net.forward("prob");  // 输出层名称

// TopK 提取
cv::Mat sorted_idx;
cv::sortIdx(output.reshape(1, 1), sorted_idx,
            cv::SORT_EVERY_ROW | cv::SORT_DESCENDING);

for (int i = 0; i < 5; i++) {
    int class_id = sorted_idx.at<int>(0, i);
    float prob = output.at<float>(0, class_id);
    LOGI("Top-%d: class=%d, prob=%.4f", i + 1, class_id, prob);
}

blobFromImage 内部执行的操作序列：

Input H×W×C (BGR u8) → scale factor 乘法 → mean subtraction
→ resize to target Size → optional swapRB → HWC to CHW 布局
→ (1, C, H, W) 4D blob

数学形式：blob[c][y][x] = scale * (img[y][x][c] - mean[c])

五、Camera2 → YUV → RGB 实时管道

5.1 Camera2 / ImageReader 配置

// Kotlin 层：配置 ImageReader
val imageReader = ImageReader.newInstance(
    previewSize.width, previewSize.height,
    ImageFormat.YUV_420_888,  // 推荐格式
    3  // maxImages: 缓冲区数量（3 用于流水线）
)

// 设置 OnImageAvailableListener
imageReader.setOnImageAvailableListener({ reader ->
    val image = reader.acquireLatestImage() ?: return@setOnImageAvailableListener
    // 传递给 native 处理
    nativeProcessImage(image, nativeHandle)
    image.close()
}, backgroundHandler)

5.2 YUV_420_888 → RGB 转换

YUV_420_888 是 Android Camera2 的推荐输出格式，底层可能是 NV21、YV12 或 YUY2 等排列，实际格式由设备的 Camera HAL 决定。转换需要读取每个像素平面的 rowStride 和 pixelStride：

#include <libyuv.h>

void yuv_to_rgb_libyuv(JNIEnv *env, jobject image, cv::Mat &rgba) {
    jclass image_class = env->GetObjectClass(image);

    // 获取三个平面
    jmethodID get_planes = env->GetMethodID(image_class, "getPlanes",
        "()[Landroid/media/Image$Plane;");
    jobjectArray planes = (jobjectArray)env->CallObjectMethod(image, get_planes);

    jobject y_plane  = env->GetObjectArrayElement(planes, 0);
    jobject u_plane  = env->GetObjectArrayElement(planes, 1);
    jobject v_plane  = env->GetObjectArrayElement(planes, 2);

    jclass plane_class = env->GetObjectClass(y_plane);
    jmethodID get_buffer = env->GetMethodID(plane_class, "getBuffer", "()Ljava/nio/ByteBuffer;");
    jmethodID get_pixel_stride = env->GetMethodID(plane_class, "getPixelStride", "()I");
    jmethodID get_row_stride = env->GetMethodID(plane_class, "getRowStride", "()I");

    jobject y_buf = env->CallObjectMethod(y_plane, get_buffer);
    jobject u_buf = env->CallObjectMethod(u_plane, get_buffer);
    jobject v_buf = env->CallObjectMethod(v_plane, get_buffer);

    uint8_t *y_data = (uint8_t *)env->GetDirectBufferAddress(y_buf);
    uint8_t *u_data = (uint8_t *)env->GetDirectBufferAddress(u_buf);
    uint8_t *v_data = (uint8_t *)env->GetDirectBufferAddress(v_buf);

    int y_row_stride    = env->CallIntMethod(y_plane, get_row_stride);
    int uv_pixel_stride = env->CallIntMethod(u_plane, get_pixel_stride);
    int uv_row_stride   = env->CallIntMethod(u_plane, get_row_stride);

    int width  = 1920;  // 从 Image 的 getWidth() 获取
    int height = 1080;

    // libyuv: NV21 → ARGB
    libyuv::NV21ToARGB(
        y_data, y_row_stride,
        v_data, uv_row_stride,   // NV21: V 在 U 之前
        rgba.data, rgba.step[0], // ARGB dest
        width, height);

    env->DeleteLocalRef(y_plane);
    env->DeleteLocalRef(u_plane);
    env->DeleteLocalRef(v_plane);
    env->DeleteLocalRef(planes);
}

5.3 手动 YUV→RGB 固定点加速（无 libyuv 场景）

当不引入 libyuv 时，可以用查表法实现高效的定点像素变换：

// BT.601 标准 YUV→RGB 转换矩阵（8 位定点）
// R = Y + 1.402 * (V-128)
// G = Y - 0.344 * (U-128) - 0.714 * (V-128)
// B = Y + 1.772 * (U-128)

// 预计算查表：对每个可能的 Y、U、V 值缓存中间结果
static int16_t g_yr_table[256];  // 1.402 * (V-128) 定点 Q8
static int16_t g_ug_table[256];  // -0.344 * (U-128)
static int16_t g_vg_table[256];  // -0.714 * (V-128)
static int16_t g_ub_table[256];  // 1.772 * (U-128)

void init_yuv_tables() {
    for (int i = 0; i < 256; i++) {
        g_yr_table[i] = (int16_t)(1.402f  * (i - 128) * 256.0f);
        g_ug_table[i] = (int16_t)(-0.344f * (i - 128) * 256.0f);
        g_vg_table[i] = (int16_t)(-0.714f * (i - 128) * 256.0f);
        g_ub_table[i] = (int16_t)(1.772f  * (i - 128) * 256.0f);
    }
}

inline uint32_t yuv_to_argb(uint8_t y, uint8_t u, uint8_t v) {
    int r = y + ((g_yr_table[v] + 128) >> 8);
    int g = y + ((g_ug_table[u] + g_vg_table[v] + 128) >> 8);
    int b = y + ((g_ub_table[u] + 128) >> 8);

    // clamp to [0, 255]
    r = (r < 0) ? 0 : (r > 255) ? 255 : r;
    g = (g < 0) ? 0 : (g > 255) ? 255 : g;
    b = (b < 0) ? 0 : (b > 255) ? 255 : b;

    return 0xFF000000 | (r << 16) | (g << 8) | b;
}

5.4 完整推理管线

class CameraInferencePipeline {
public:
    void process(cv::Mat &yuv_nv21, int width, int height) {
        // Step 1: YUV → RGB
        cv::Mat rgba(height, width, CV_8UC4);
        libyuv::NV21ToARGB(yuv_nv21.data, width,
                           yuv_nv21.data + width * height, width,
                           rgba.data, width * 4, width, height);

        // Step 2: 裁剪/缩放至模型输入尺寸（224×224）
        cv::Mat resized;
        int model_size = 224;
        cv::Rect roi((width - height) / 2, 0, height, height);  // 中心裁剪 1:1
        cv::resize(rgba(roi), resized, cv::Size(model_size, model_size));

        // Step 3: 归一化并转为 float（TF Lite 期望）
        cv::Mat float_input(model_size, model_size, CV_32FC3);
        resized.convertTo(float_input, CV_32FC3, 1.0 / 127.5, -1.0);  // → [-1, 1]

        // Step 4: 填充 TF Lite 输入张量 (NHWC layout)
        float *input_tensor = interpreter->typed_input_tensor<float>(0);
        int pos = 0;
        for (int y = 0; y < model_size; y++) {
            for (int x = 0; x < model_size; x++) {
                cv::Vec3f &pixel = float_input.at<cv::Vec3f>(y, x);
                input_tensor[pos++] = pixel[0];  // R
                input_tensor[pos++] = pixel[1];  // G
                input_tensor[pos++] = pixel[2];  // B
            }
        }

        // Step 5: 推理
        if (interpreter->Invoke() != kTfLiteOk) {
            LOGE("Inference failed");
            return;
        }

        // Step 6: 读取输出（假设输出是 [1, 1000] 的分类概率）
        float *output = interpreter->typed_output_tensor<float>(0);
        topk_result result = extract_topk(output, 1000, 5);
    // ...
    }
};

六、推理结果后处理与 TopK 提取

6.1 TopK 快速提取

对于 1000 类的分类任务，不需要完整排序，只需找到前 K 个最大值：

#include <queue>

struct ClassProb {
    int idx;
    float prob;
    bool operator<(const ClassProb &other) const {
        return prob > other.prob;  // min-heap: 堆顶是最小值
    }
};

void extract_topk(const float *probs, int num_classes, int k,
                  std::vector<ClassProb> &results) {
    std::priority_queue<ClassProb> min_heap;

    for (int i = 0; i < num_classes; i++) {
        if (min_heap.size() < k) {
            min_heap.push({i, probs[i]});
        } else if (probs[i] > min_heap.top().prob) {
            min_heap.pop();
            min_heap.push({i, probs[i]});
        }
    }

    results.resize(k);
    for (int i = k - 1; i >= 0; i--) {
        results[i] = min_heap.top();
        min_heap.pop();
    }
}

6.2 量化模型输出解码

对于 INT8 量化模型，输出张量也是 INT8 格式，需要反量化到浮点：

// 从模型元数据获取量化参数
// scale 和 zero_point 存储在 TfLiteTensor 的 params 中
float scale       = output_tensor->params.scale;
int   zero_point  = output_tensor->params.zero_point;

// INT8 → float 反量化
float real_value = scale * (static_cast<int>(quant_value) - zero_point);

七、完整 JNI 桥接实现

#include <jni.h>
#include <android/asset_manager_jni.h>

static tflite::Interpreter *g_interpreter = nullptr;
static tflite::FlatBufferModel *g_model = nullptr;
static AAssetManager *g_asset_mgr = nullptr;

extern "C" JNIEXPORT jboolean JNICALL
Java_com_example_recognition_NativeRecognizer_initModel(
    JNIEnv *env, jobject thiz, jobject assetManager,
    jstring model_path, jboolean use_gpu, jboolean use_nnapi, jint num_threads) {

    g_asset_mgr = AAssetManager_fromJava(env, assetManager);
    const char *path = env->GetStringUTFChars(model_path, nullptr);

    // 加载模型
    AAsset *asset = AAssetManager_open(g_asset_mgr, path, AASSET_MODE_BUFFER);
    if (!asset) { env->ReleaseStringUTFChars(model_path, path); return false; }

    const void *model_data = AAsset_getBuffer(asset);
    off_t model_size = AAsset_getLength(asset);

    g_model = tflite::FlatBufferModel::BuildFromBuffer(
        static_cast<const char *>(model_data), model_size);
    AAsset_close(asset);

    // 构建解释器
    tflite::ops::builtin::BuiltinOpResolver resolver;
    tflite::InterpreterBuilder builder(*g_model, resolver);
    builder(&g_interpreter);
    g_interpreter->SetNumThreads(num_threads);

    // 设置 delegate
    if (use_gpu) {
        TfLiteGpuDelegateOptionsV2 opts = TfLiteGpuDelegateOptionsV2Default();
        auto *gpu_delegate = TfLiteGpuDelegateV2Create(&opts);
        g_interpreter->ModifyGraphWithDelegate(gpu_delegate);
    }
    if (use_nnapi) {
        tflite::StatefulNnApiDelegate::Options opts;
        opts.allow_fp16 = true;
        auto *nnapi = new tflite::StatefulNnApiDelegate(opts);
        g_interpreter->ModifyGraphWithDelegate(nnapi);
    }

    g_interpreter->AllocateTensors();
    env->ReleaseStringUTFChars(model_path, path);
    return true;
}

extern "C" JNIEXPORT jfloatArray JNICALL
Java_com_example_recognition_NativeRecognizer_runInference(
    JNIEnv *env, jobject thiz, jfloatArray input_data) {

    // 获取输入数据
    jsize len = env->GetArrayLength(input_data);
    jfloat *input = env->GetFloatArrayElements(input_data, nullptr);

    float *input_tensor = g_interpreter->typed_input_tensor<float>(0);
    memcpy(input_tensor, input, len * sizeof(float));
    env->ReleaseFloatArrayElements(input_data, input, JNI_ABORT);

    // 推理
    g_interpreter->Invoke();

    // 获取输出
    TfLiteTensor *output = g_interpreter->output_tensor(0);
    int output_size = output->bytes / sizeof(float);

    jfloatArray result = env->NewFloatArray(output_size);
    env->SetFloatArrayRegion(result, 0, output_size, output->data.f);
    return result;
}

extern "C" JNIEXPORT void JNICALL
Java_com_example_recognition_NativeRecognizer_release(JNIEnv *, jobject) {
    delete g_interpreter;
    delete g_model;
    g_interpreter = nullptr;
    g_model = nullptr;
}

八、性能基准与优化

8.1 Delegate 基准测试框架

#include <chrono>

struct BenchmarkResult {
    std::string delegate_name;
    double avg_ms;
    double min_ms;
    double max_ms;
    int    warmup_runs;
    int    bench_runs;
};

BenchmarkResult benchmark_delegate(const char *name, tflite::Interpreter *interp,
                                    int warmup, int bench) {
    std::vector<double> times;
    times.reserve(bench);

    // Warmup
    for (int i = 0; i < warmup; i++) {
        interp->Invoke();
    }

    // Benchmark
    for (int i = 0; i < bench; i++) {
        auto start = std::chrono::high_resolution_clock::now();
        interp->Invoke();
        auto end = std::chrono::high_resolution_clock::now();
        double ms = std::chrono::duration<double, std::milli>(end - start).count();
        times.push_back(ms);
    }

    // 统计
    double sum = std::accumulate(times.begin(), times.end(), 0.0);
    double avg = sum / times.size();
    double min = *std::min_element(times.begin(), times.end());
    double max = *std::max_element(times.begin(), times.end());

    return {name, avg, min, max, warmup, bench};
}

8.2 Float32 vs INT8 量化模型对比

// 量化模型推理
// INT8 量化可将模型大小缩小 4×，推理速度提升 2~4×
// 精度损失通常 < 1% (top-1 accuracy)

// 基准测试结果（MobileNetV2, Snapdragon 888）
// ┌─────────────┬────────┬──────────┬──────────┐
// │ 精度        │ 模型大小│ 延迟/推理 │ Top-1 Acc│
// ├─────────────┼────────┼──────────┼──────────┤
// │ FP32        │ 14.0 MB│  8.2 ms  │  71.88%  │
// │ INT8 (全量化)│ 3.6 MB │  2.9 ms  │  71.42%  │
// │ FP16 (GPU)  │ 14.0 MB│  4.3 ms  │  71.88%  │
// └─────────────┴────────┴──────────┴──────────┘

8.3 相机帧率优化

实时相机推理中，帧率由最慢的环节决定。优化策略：

瓶颈	优化手段
YUV→RGB 转换	libyuv NEON 加速 / 直接使用 Y 平面做灰度推理
模型推理	INT8 量化 / GPU delegate / 使用更小模型（MobileNetV3-Small）
内存拷贝	零拷贝共享 ImageReader buffer
GC 抖动	预分配所有 cv::Mat / 使用 BufferPool 复用

批处理策略：对于不需要每帧推理的场景，可以每 3~5 帧推理一次，使用 tracking 算法（如光流、Kalman filter）插值中间帧的结果。

九、结合 OpenCV CascadeClassifier 的人脸检测

cv::CascadeClassifier face_cascade;
cv::Mat gray;

void detect_faces(cv::Mat &rgba_frame, std::vector<cv::Rect> &faces) {
    // 转灰度
    if (gray.empty() || gray.size() != rgba_frame.size()) {
        gray.create(rgba_frame.size(), CV_8UC1);
    }
    cv::cvtColor(rgba_frame, gray, cv::COLOR_RGBA2GRAY);

    // 直方图均衡化（提升低光照下的检测率）
    cv::equalizeHist(gray, gray);

    // 多尺度检测
    face_cascade.detectMultiScale(
        gray,
        faces,
        1.1,    // scaleFactor: 每次缩放 1.1×
        3,      // minNeighbors: 至少 3 个邻接窗口确认
        cv::CASCADE_SCALE_IMAGE,
        cv::Size(30, 30)  // 最小检测尺寸
    );
}

十、结语

本文构建了从 Camera2 帧采集到 YUV→RGB 转换、模型推理、结果后处理的完整 NDK 图像识别管道。关键架构决策点总结：

模型加载应使用 MappedByteBuffer 或 mmap()，避免将模型文件完整读入堆内存。
推理后端选择取决于硬件：通用选择 XNNPACK（稳定），GPU 适合卷积密集型模型，NNAPI 适合追求极致能效的场景。
量化模型（INT8）在精度损失可控（通常 < 1%）的前提下，可将推理延迟降低 2~4 倍、模型体积缩小 4×，是移动端部署的首选。
YUV→RGB 转换务必使用 libyuv 或手工 NEON 加速，避免在 Java 层逐像素处理。
实时管道应使用 BufferPool 复用 cv::Mat 和 TF Lite 张量内存，避免频繁的 malloc/free 触发 GC。

参考资料

TensorFlow Lite Android Guide: https://www.tensorflow.org/lite/android

OpenCV DNN Module: https://docs.opencv.org/4.x/d2/d58/tutorial_table_of_content_dnn.html

libyuv: https://chromium.googlesource.com/libyuv/libyuv/

NNAPI Delegate: https://www.tensorflow.org/lite/android/delegates/nnapi

GPU Delegate: https://www.tensorflow.org/lite/android/delegates/gpu