This LIDAR Smart Speaker Imagines Alexa With Eyes
LIDAR may be best known right now for helping power autonomous cars (and infuriating Elon Musk), but the same technology could improve how we interact with smart speakers, a team of Intel-backed researchers suggest. SurfaceSight speculates on the potential for more useful IoT devices when they understand what's around them, including object and hand recognition.
The goal was to give existing smart speakers and the applications they run some situational awareness. By stacking an Amazon Echo or Google Home Mini on top of a compact LIDAR sensor, researchers Gierad Laput and Chris Harrison of Carnegie Mellon University demonstrated how the devices could make inferences based on shape and movement about what was nearby. They'll present their findings at ACM CHI 2019 today.
LIDAR uses lasers for range-finding, effectively bouncing non-visible light off objects and then building up a point cloud map based on the time it takes for that light to be reflected back. While it's out commonly associated with autonomous car projects, where being able to create a real-time plan of the surrounding area is useful for avoiding traffic or pedestrians, it's also commonly used in robotics, with UAVs, and other applications.

Importantly, it's also moving into the realm of relative affordability. While high-range and high-accuracy LIDAR for automotive applications is still relatively expensive – something manufacturers are looking to change with new production processes – smaller, more affordable sensors are available. SurfaceSight, for example, relies on a sub-$100 unit, and the researchers speculate that the broader availability of solid-state LIDAR will only reduce that further.
For SurfaceSight, the applications are varied. One possibility is using fingers and hands to do gesture input; alternatively, a smart speaker could track when a smartphone is placed down on the table nearby, and then automatically recognize that as the user intending to stream music.
Since SurfaceSight can also estimate which way a person is facing, it can prioritize command recognition when the user is actively pointed in the direction of the speaker. That, it's suggested, could help in situations where voice commands can't be heard over background audio. Defined boundary areas, only within which gestures are recognized, are also supported, and these can even themselves be established by hand gestures.
The plane of recognition needn't be horizontal, either. In another demo, SurfaceSight could track movement against a wall, with a LIDAR integrated into a smart thermostat. That could recognize taps, swipes, and circular motions against the wall, effectively turning the surface into an extended control pad. Think along the lines of Google Soli, but on a larger scale.
Where SurfaceSight really gets interesting is in how it uses LIDAR to recognize objects. The team trained the sensor on different kitchen objects, like scales and measuring cups, as well as workshop items such as tools. A multi-step recipe could use the LIDAR to track which part is being completed, advancing automatically. Alternatively, motion could be linked with spoken requests to lend further context, like shaking a measuring cup while simultaneously asking "how many ounces in this?"

LIDAR does have its downsides, of course. For a start there's the occlusion question: the sensor relies on line of sight. Different objects that have the same profile could also confuse SurfaceSight. The researchers suggest some combination of camera or even reflective barcodes could be used to differentiate between them, with the smart speaker also warning users to declutter the area surrounding them if they want the system to operate effectively.
Importantly, although LIDAR can be highly accurate, it does also have privacy advantages that, say, camera-based computer vision systems do not. A LIDAR sensor couldn't differentiate between different people, or be used to capture photos within the home, for instance.
It's fair to say that smart speakers are at the commodity level right now, with Amazon and Google racing each other down to the most affordable price. While both companies have bet on voice being the preferred primary method of interaction, however, they do so at the expense of other modalities. Baking in LIDAR might not be the only way to solve that, but there's no denying that a home hub-style device could be a lot more useful if it knew what you were doing, not just what you were telling it.
