Back to blog home

Rhombus Product Insight – Face Detection: How Our Security Camera System Captures the Best Face

November 08, 2018

Efficient business operation and employee well-being starts with smarter technology. Here at Rhombus, we are dedicated in bringing powerful artificial intelligence features, such as facial recognition and people analytics, to provide organizations with incredibly intelligent video security. In this blog post, I want to share some insight into how one of our most popular features work. Let’s dive deeper into facial recognition and what we do to make it more enjoyable to use for our customers.

Facial recognition is a feature that makes us very different than existing systems in the market. As part of our facial recognition system, we store thumbnails of faces that each camera identifies so that you can search through them or be alerted when appropriate.


For each frame of a video clip, our facial recognition algorithm first uses a convolutional neural network (CNN) to detect all of the faces in view. Now for example, imagine we do this on a thirty second clip of three people walking down a hallway towards a Rhombus camera. In this brief thirty second clip, our system generates hundreds of different facial detections, here’s a formula to help demonstrate this.

3 people * 4 fps * 30 secs = 360 faces

This presents two challenges -

  1. It is inefficient to show you each of these faces when doing a search
  2. Running our facial recognition algorithm on each detection is too computationally heavy to issue prompt alerts.

One solution is to arbitrarily pick a small collection of faces from the list of detected faces and continue the analysis from there. As you can imagine, this can be a very inefficient way of going about it.

Let’s break this down and talk about how our system determines the best face for each person in a clip so that you don’t have to scrub through the video yourself to get that perfect snapshot of a person’s face.

Whenever we watch a video, we know how to track faces between frames. If we look at two frames of a video slowly, we are able to recognize a particular face even though the position of that face may change. Since a computer processes a single image at a time, a face detection algorithm can detect a face in two consecutive frames and have no idea whether the two faces belong to the same person.

To remedy this, we’ve implemented a multi-person tracking algorithm, which groups the detected faces into “people” across video frames. For each person in a frame, we either associate it with a person in the previous frame or mark a new person via a combination of logical predictions and neural network feature extraction. After this process has been completed, if there are three people in a clip, we have three groups of faces to analyze which we know correspond to different people.

Next, we still need to choose the “best” face for each person in the clip. Our goal is to find the most straight on face captured by the camera. Yet again (can you tell we like applying deep learning algorithms?), we apply another neural network to find the facial landmarks of each face and use this data to project a 3D human face into the image to compute the yaw, pitch, and roll of the face from the camera’s perspective. Choosing the face with the smallest yaw and pitch will provide the best face shot for that person.

This process allows us to narrow down hundreds of faces for a clip to the best corresponding face for each person. Not only does this provide prompt face alerts but it also ensures that only the ideal face shots of a person are shown in the console for a simple, enjoyable, and more accurate searching experience.


I hope you enjoyed learning more about Rhombus and what we do to ensure that our solution performs at the highest level for modern organizations. If you have any questions, or would like to learn more, you can reach us at any time!