How to Port CV/ML Models to Rockchip NPU for Faster Face Recognition

In 2024 the 3DiVi Face SDK team faced a new challenge: one of our partners decided to build an access control system (ACS) on a single-board computer from Forlinx.

To meet the strict time constraints for face recognition, we ported our models to the NPU. Long story short—it worked! NPUs turned out to be a solid way to put heavy processing to an edge device.

How Face Recognition Works: A Quick Recap

Skipping the business details, here’s what our partner needed: detecting faces in a video stream and verifying them. To better understand what that means, let’s take a closer look at the basic face recognition pipeline.

Face Detection:

The detection module identifies a face in an image. Most face detectors in production today rely on convolutional neural networks (CNNs).

Key Point Detection:

Next, key points (like the eyes and nose) are located on the face. This is also done by a neural network—sometimes a separate one (called a "Face Fitter") or the same one used for detection. In our case, we use a dedicated CNN.

Face Alignment:

Using the detected key points, the face is aligned to a frontal position, a necessary step in biometric template generation.

Template Extraction:

Finally, another neural network extracts the biometric template from the aligned face crop.

Speed and Hardware Constraints

In most scenarios, we can overlook the time spent on image preprocessing, postprocessing of neural network results, and comparing two biometric templates—these are just a few milliseconds. The bottleneck? Neural network inference time.

In our case, there were three neural networks:

Face Detector
Face Fitter
Face Template Extractor

And we were working under these time constraints:

Combined time for Face Detector and Face Fitter: ≤40 ms.
Template extraction and comparison of two templates: ≤500 ms.

Sounds manageable, right? But then we looked at our hardware: OK3568-C. Not exactly ideal for heavy processing.

We selected specific models for the detector, fitter, and template extractor from the 3DiVi Face SDK and tested their inference times. As expected, the results shown in the table below didn’t meet the stated time constraints.

After that, we moved on to inference on the NPU.

Rockchip NPU Inference

Rockchip NPU inference can run in two modes:

Default Mode: Models are converted from Float32 to Float16. This leads to minimal (often negligible) accuracy loss.

Quantized Mode: Models are converted from Float32 to Int8. This significantly speeds up inference but can result in noticeable accuracy drops.

During the experiments, we obtained the following time measurements:

But what about accuracy? Yes, there was a slight dip. However, after some fine-tuning, the Int8 quantized model performed well enough for production on a standard dataset (LFW).

Final Thoughts

Porting CV/ML models to an NPU proved to be an effective way to accelerate inference. The minor accuracy drop—acceptable for access control systems—was worth it to meet the recognition speed requirements.

If your face recognition project needs NPU inference, don’t hesitate to reach out. I'm always ready to help!