6 Things You Can Do With The New Raspberry Pi AI Kit

Raspberry Pi has just released an AI Kit which is designed to work with the Raspberry Pi 5. Previous models of these tiny computers would have struggled to keep up with the processing power necessary for the kinds of tasks that this kit was designed for, but the new Raspberry Pi 5 is powered by a 1.5 GHz quad-core Cortex-A72 CPU chip. This is further boosted to 1.8GHz on the 8GB model.

The Raspberry Pi AI Kit bundles together a Raspberry Pi M.2 HAT+ and a Hailo AI acceleration module. The M.2 HAT+ is essentially a PCIe-to-NVMe converter. It's an adapter board that the company says "enables you to connect M.2 peripherals such as NVMe drives and other PCIe accessories to Raspberry Pi 5's PCIe interface." Among other uses, this is essential for connecting NVMe SSD drives which offer higher read and write speeds, as well as significantly higher capacity. The Hailo AI acceleration module is the real heart of the kit though. It contains an NPU (neural processing unit) which is a dedicated chip specifically designed for handling AI processing tasks. This can reportedly perform 13 tera-operations per second (TOPS).

But what can Raspberry Pis actually do with this kit? While no projects appear to have been implemented just yet, we're already seeing some of the technologies that will power them begin to take shape.

Object detection

Raspberry Pi and Hailo have shared a few of the demo applications that makers will be able to use with the kit. 

The first feature they showcased was object detection. What that means is that there are already machine learning functions for AI programs to identify objects within an image and separate them into categories. This is how AI can distinguish between different subjects. Object detection is the program's way of identifying what is in an image and where that object is located within that image. It does this by identifying the exact pixels that the object is made up of, what their coordinates are relative to the rest of the image, and what patterns coincide with other objects that have been tagged with the same keyword. A dog, for instance, will probably be quadropedal, have a long snout, pointy or floppy ears, fur and a tail. By identifying these details, among others, the AI learns to recognize what is in an image.

This kind of operation is essential for several different kinds of image editing software. A program will need to identify a subject in order to remove a background, imitate a depth-of-field blur, or perform other subject-isolating tasks. It's also how programs learn to identify subjects and associate them with words for text-to-image generation.

Image segmentation and enhancement

The list also includes demos for image segmentation and image enhancement programs. The first part of this is another essential capability for image identification. AI uses image segmentation to partition separate groups of pixels into smaller, more manageable clusters. This helps with object detection while also speeding up the image processing by allowing it to focus on one segment at a time. According to IBM, these techniques include simple analysis, such as identifying the shape and size of things. There is also more advanced, algorithm-based analysis to identify color, brightness and the finer details of an object's boundaries.

Image segmentation is essential to image enhancement. Once the system analyzes the features of each pixel in an image, it can use machine learning-built datasets to train the model to produce more pixels that adequately estimate what should be in the space. Raspberry Pi claims that the demo, "performs object detection and segments the object by drawing a colour [sic] mask on the viewfinder image." This can be used to create more accurate and realistic images, remove blemishes, or enhance a photo's lighting and colors. Halio lists separate programs for super-resolution and low-light enhancement as well, further expanding the possibilities.

Pose estimation

To fully understand the composition of an image, AI programs don't just need to be able to identify what is in the image, they need to be able to identify the positioning of the objects too. This is where pose estimation comes into play. 

Pose estimation is the process of identifying and tracking 'key semantic points.' So what does that mean? It's basically finding angle points that can be used to identify and track the shape and movement of objects. For human and animal models, this might include joints like shoulders, elbows, wrists, and knuckles. Pose estimation usually only has to measure a handful of key points when it comes to humans. We only have so many joints and the rest of our bodies are fairly ridged, so we are primarily made up of hard lines where length is the only identifying factor in terms of positioning. The Raspberry Pi demo claims to perform human pose estimation on 17 different points throughout the body, then draw lines between them. This is much simpler in still images where these points are stationary but can be more complex in videos.

Pose estimation is essential for machine learning programs to gather data on images and videos, but generative AI software may also use this when it constructs new images and videos. Programs like OpenPose and DeepCut already do this, but they are more general-use computer programs and aren't specialized to work with something like the Raspberry Pi AI Kit.

Face detection and recognition

Pose estimation is good for identifying heads, shoulders, knees, and toes, but faces are even more complicated. Most of us have two eyes, two ears, and a nose, but the subtle variances in these common features are what help us identify one another. An extra wrinkle for AI is that our faces move quite a bit, so the software also needs to be able to recognize different expressions and tag them accordingly.

Face detection is typically specialized to focus on eyes, noses, mouths, and chins. Some AIs track more fine details like the position of our mouths or the size of our eyes, helping to determine our emotions. 

This facial recognition technology for the Raspberry Pi AI Kit will be valuable for security software, entertainment, social media, and other use cases. It can also bolster image generation technologies, helping fill in backgrounds after removing objects from an image, for example.

Depth estimation

Another aspect of AI-based identification is the software's ability to locate where an object resides within the space of an image. Objects in our three-dimensional world, for example, have a length, width, and depth. Images (whether moving or still) need to represent both the depth of the subject itself and the distance of the subject from the lens. The art world has worked with perspective and depth of field for centuries, and AI is trained to identify and replicate some of these techniques.

Depth estimation measures the distance between pixels within an image and uses those numbers to gauge how far the objects that they represent should be from the camera. There are two kinds of depth estimation: monocular and stereo. Older methods of determining depth involved finding multiple points within the image and performing some complex geometry to find out where everything should be and thus how big it should be. Newer techniques use AI to effectively create its own perspective. 

These capabilities are also extremely useful in the field of robotics, where training machines to react to their surroundings relies on them being able to analyze depth.

Semantic segmentation

One of the most important aspects of machine learning is its ability to catalog the objects that it identifies. This is done through a process called semantic segmentation, a computer vision task that labels groups of pixels to understand what object it's looking at. The way this ultimately works is that the computer adds a mask over portions of the image where each of the objects it detects are positioned. So if you have an image of a football player standing at the 50-yard line, this process would identify the football player, the field, the bleachers, and the sky in the background all as separate objects.

It takes a lot of training from pre-labeled datasets, such as images that have labeling information already on them that the computer can study and learn from, for the AI to learn how to do this. Ultimately, the ability to translate back and forth between visual objects and linguistic descriptions of what they are is at the core of machine learning.