GenComUI | Yate Ge

Motivation

In service robotics, verbal communication allows non-expert users to naturally express their needs. However, speech-based robot programming remains limited by ambiguity, speech recognition errors, and difficulty describing spatial movements verbally. While visual aids are widely used in human-to-human communication to enhance comprehension, their potential in human-robot interaction remains underexplored. GenComUI investigates how dynamically generated visual aids can serve as a communication medium to bridge the gap between natural language instructions and robot understanding.

Overview of GenComUI: generative visual aids support verbal task communication between humans and robots.

Formative Study

To inform the design of GenComUI, we conducted an observational study examining how humans use external visual tools (paper maps and pens) to assist verbal communication in spatial tasks. The study revealed three key design considerations: (DC1) visual aids should provide timely and progressive feedback during communication; (DC2) visual aids should interpret and externalize user task intentions; (DC3) visual aids should integrate with speech to improve clarity and intuitiveness; (DC4) the system should support iterative task refinement through visual confirmation.

Formative study: observing how humans use visual tools to support verbal spatial task communication.

GenComUI System

Based on the design considerations, we developed GenComUI, an LLM-based robot end-user development system that incorporates generative visual aids. Implemented on the Temi V2 robot, the system consists of four core modules:

Voice Interaction Module: Handles bidirectional voice communication via speech-to-text and text-to-speech conversion.
User Intention Understanding Module: Analyzes input and dialogue context to track task specifications and plan responses, including visual aid generation.
Generative Visual Aids Module: Generates visual interface elements and animations on spatial maps according to the communication context.
Task Program Synthesis and Deployment Module: Generates and deploys executable robot code with built-in testing for iterative refinement.

GenComUI system architecture: (A) Voice Interaction, (B) User Intention Understanding, (C) Generative Visual Aids, (D) Task Program Synthesis and Deployment.

The communication process in GenComUI: (A) multi-turn task specification with progressive visual feedback, (B) visual confirmation of complete task flow, (C) task testing and deployment.

User Study

We conducted a within-subjects user study with 20 participants (12 female, 8 male, aged 18–60) comparing GenComUI with a voice-only baseline system. Participants completed tasks in a simulated office environment using both systems in counterbalanced order, combining think-aloud protocols with semi-structured interviews.

The results showed that generative visual aids enhanced communication quality by providing immediate feedback and supporting iterative refinement of task specifications. While visual aids did not significantly reduce task completion time, they promoted more natural and effective human-robot communication, particularly in complex tasks involving spatial movements and multi-step logic.

Experimental setup: (Left) paper floor plan for baseline condition, (Right) comparison between baseline system and GenComUI.