AI-Driven Robotic Assembly System Builds Objects Based on Verbal Input

CAMBRIDGE, MA—Engineers at the Massachusetts Institute of Technology (MIT) have developed an AI-driven robotic assembly system lets users design and build simple, multicomponent objects by describing them with words. The system uses a generative AI model to build a 3D representation of an object’s geometry based on the user’s prompt. Then, a second generative AI model reasons about the desired object and figures out where different components should go, according to the object’s function and geometry.

The engineers used this end-to-end system to fabricate furniture, including chairs and shelves, from two types of premade components. The components can be disassembled and reassembled at will, reducing the amount of waste generated through the fabrication process.

“Sooner or later, we want to be able to communicate and talk to a robot and AI system the same way we talk to each other to make things together,” says Alex Kyaw, a graduate student at MIT studying architecture, electrical engineering and computer science. “Our system is a first step toward enabling that future.

”While generative AI models are good at generating 3D representations, known as meshes, from text prompts, Kyaw claims that most do not produce uniform representations of an object’s geometry that have the component-level details needed for robotic assembly. Separating these meshes into components is challenging for a model because assigning components depends on the geometry and functionality of the object and its parts.

Kyaw and his colleagues tackled these challenges using a vision-language model (VLM), a powerful generative AI model that has been pretrained to understand images and text. They task the VLM with figuring out how two types of prefabricated parts, structural components and panel components, should fit together to form an object.

“There are many ways we can put panels on a physical object, but the robot needs to see the geometry and reason over that geometry to make a decision about it, explains Kyaw. “By serving as both the eyes and brain of the robot, the VLM enables the robot to do this.”

A user prompts the system with text, perhaps by typing “make me a chair,” and gives it an AI-generated image of a chair to start. Then, the VLM reasons about the chair and determines where panel components go on top of structural components, based on the functionality of many example objects it has seen before. For instance, the model can determine that the seat and backrest should have panels to have surfaces for someone sitting and leaning on the chair.It outputs this information as text, such as “seat” or “backrest.”

Looking for quick answers on assembly and manufacturing topics? Try Ask ASM, our new smart AI search tool. Ask ASM

Each surface of the chair is then labeled with numbers, and the information is fed back to the VLM. Then, the VLM chooses the labels that correspond to the geometric parts of the chair that should receive panels on the 3D mesh to complete the design.

In the future, the MIT engineers hope to enhance their system to handle more complex and nuanced user prompts, such as a table made out of glass and metal. In addition, they want to incorporate additional prefabricated components, such as gears, hinges or other moving parts, so objects could have more functionality.

Looking for a reprint of this article?
From high-res PDFs to custom plaques, order your copy today!

Austin has been senior editor for ASSEMBLY Magazine since September 1999. He has more than 25 years of b-to-b publishing experience and has written about a wide variety of manufacturing and engineering topics. Austin is a native of Chicago and a graduate of the University of Michigan.