Mediapipe, There is nothing wrong with falling in Optimium

Optimium

Solutions

Company

Resources

Contact

Select Language

Optimium

Solutions

Company

Resources

Contact

Select Language

Technology

Mediapipe, There is nothing wrong with falling in Optimium

Hello. This is Sewoong Moh from the Optimium team, who is developing an AI inference optimization engine.

Sewoong Moh

April 4, 2024

Hello. This is Sewoong Moh from the Optimium team, who is developing an AI inference optimization engine.

In this post, we would like to introduce MediaPipe, an open source project distributed by Google that enables quick and easy implementation of on-device AI and development of various services. With MediaPipe, you can easily deploy applications on mobile devices that perform tasks such as face recognition, as shown below.

Shall we take a closer look at MediaPipe?

What is MediaPipe?

MediaPipe is a machine learning framework developed and released by Google to make it easy and fast for people to implement on-device AI. With a simple installation and just a few lines of code, AI models can be applied effortlessly. In addition, MediaPipe offers scalability and applicability across various environments such as web apps, mobile (Android, iOS), desktop, edge devices, and IoT devices, thanks to its unified standards. This is considered an important feature in the on-device AI market, where there are diverse hardware options, unlike the large scale AI market where NVIDIA GPUs are commonly used. Additionally, being open source allows anyone to contribute to the code and modify it according to their preferences.

MediaPipe consists of MediaPipe Tasks, which make up the Deploy Pipeline, and MediaPipe Models, which contain various AI Tasks.

MediaPipe Task

MediaPipe Task is the interface that users encounter when deploying AI models. Typically, an AI task does not consist of a single model. For example, a task to locate the position of eyes and nose involves the following steps: 1) detecting the face, 2) cropping the facial region, and 3) recognizing facial landmarks to find the positions of the eyes and nose. In many cases, multiple AI models are involved in the pre-processing, post-processing, and execution of the tasks. MediaPipe abstracts and pipelines these logics internally, allowing users to perform desired tasks by calling the exposed APIs a few times.

Moreover, it supports multiple platforms such as Android, Web/JavaScript, and Python, allowing it to be deployed on various hardware devices. It also provides APIs for training models tailored to user data.

MediaPipe Models

MediaPipe provides many pre-trained AI models that are ready to use. As of March 2024, it offers pre-trained solutions for various tasks in the fields of Computer Vision, NLP, and Audio. Some notable examples are listed below, and the full list can be found at https://developers.google.com/mediapipe/solutions/examples.

Pose Landmark Detection
Iris Landmark Detection
Face Mesh Detection
Selfie Segmentation

Well done, MediaPipe 👏

MediaPipe is being used in various industries such as education, security, AR/VR, and more, thanks to its multiple advantages.

1. Easy Deployment

With just a few lines of code, AI models can be deployed, making it suitable for prototyping and easy to apply to various hardware and platforms, thereby reducing development time.

2. High Performance

Despite being open source, it offers excellent performance, making it suitable not only for prototyping but also for commercial products. For example, it demonstrates high performance across diverse ethnicities, races, genders, etc., as shown in the table below, even when dealing with data imbalance issues commonly found in open source models. (Data imbalance refers to the phenomenon where models trained on biased datasets perform well only in specific situations and struggle to generalize to broader contexts.) Real-world cases of MediaPipe models being used commercially in various industries are readily available.

3. License

One of the reasons for MediaPipe’s popularity is its adoption of the Apache License 2.0 while being open-source. This means there is no obligation to disclose the source code, allowing for patent applications of derivatives and unrestricted commercial use.

💡 <Apache License>
Alongside the Apache License, another prominent license in the open source community is the GPL license. The GPL license allows for internal usage but requires full source code disclosure when distributing externally. Therefore, open-source projects adopting the GPL license may find it challenging to utilize for commercial purposes.

4. Customization

While It is advantageous to use pre-trained models that have demonstrated high performance is advantageous, sometimes users may require models that are tailored to their specific data. For instance, when detecting human pose landmarks, a pre-trained model might recognize the pelvic region with two points, but a user may have labeled the pelvic region with three points. To handle such cases, MediaPipe provides a convenient API called Mediapipe Model Maker. (However, there are limitations, such as the inability to modify the model structure).

Now, let me show the limits of speed(feat.Optimium)

Despite its high versatility, MediaPipe can be limited in its applicability, particularly when latency does not meet user requirements. Typically deployed in on-device AI scenarios, MediaPipe may struggle to satisfy latency requirements due to factors such as insufficient hardware performance or constraints on thread utilization due to issues like overheating. In fact, many publicly available MediaPipe demos often run at less than 30FPS.

However, with Optimium, we can achieve real-time performance. You can verify this through the sample demo provided below.

The left image shows the result of running the Pose Landmark Detection model using TFLite (XNNPACK) as the backend in MediaPipe, while the right image shows the result of running the same model using our Optimium as the backend. As you can see, Optimium achieves significantly higher FPS compared to TFLite (XNNPACK), with approximately 1.5 to 2.0 times faster performance. TFLite does not meet real-time requirements and suffer from frame drops, resulting in detected Pose Landmarks being misaligned with the person in the video. In contrast, our Optimium meets real-time requirements and accurately detects the person’s movements in the video without interruption. Such results make a significant difference in applications where real-time performance is critical, such as autonomous driving, driver gaze tracking, and drowsiness analysis, where even 1–2 frame drops can be fatal.

Or those who want to explore more diverse performance metrics of Optimium in various environments, please refer to the following link.

👉 https://perf.enerzai.com

Optimium is currently in beta testing, and if you wish to achieve the same effects in your current services/research models, please apply for beta testing through the following link ✨

👉 https://wft8y29gq1z.typeform.com/to/fp059MY5

We’ll see you in the next post as we embark on another exploration toward optimization.

Life is too short, you need Optimium

Optimium

Solutions

Company

Resources