This article breaks down a sightseeing guide agent running on Rokid AI Glasses. It uses the camera to identify landmarks, generates short explanations in different styles, and synchronizes voice output with on-glasses display. The system solves common travel pain points such as looking down at a phone for information, being constrained by group tours, and relying on rigid audio guides. Keywords: Rokid, Lingzhu Agent, multimodal vision.
The technical specification snapshot highlights the project architecture
| Parameter | Details |
|---|---|
| Runtime device | Rokid AI Glasses |
| Development platform | Lingzhu Agent Platform |
| Core capabilities | Landmark recognition, stylized narration, voice interaction, on-glasses coordination |
| Input protocol | First-turn image input + voice commands |
| Core model | doubao-seed-1-6-vision-250815 |
| Key dependencies | Loki AI Glasses control plugin, vision foundation model, landmark knowledge orchestration logic |
| Repository / Star count | Not provided in the original source |
| Target scenarios | Cultural tourism guidance, wearable narration, AI glasses assistant |
This is a native AI glasses interaction pattern for tourism scenarios
The goal of this project is straightforward: turn the flow from “see a landmark, then search for it” into “see a landmark and instantly get an explanation.” The AI glasses handle capture and presentation, the multimodal model handles recognition and generation, and the Lingzhu platform assembles prompts, skills, and plugins into a complete execution chain.
Compared with mobile phone search, this approach offers clear advantages: users keep their eyes on the landmark, do not need to use their hands, and receive short text plus audio feedback immediately. That makes it a better fit for high-frequency, low-latency interactions during travel.
AI Visual Insight: This image presents the project’s core visual identity and scenario positioning. It emphasizes AI glasses as the front-end entry point, with a closed interaction loop of “capture landmark → recognize content → generate narration → display or speak the result.” This interaction model is well suited for tourism guidance use cases that combine visual recognition with lightweight information delivery.
The solution addresses three major breaks in traditional tour guidance first
First, group tours follow a fixed pace and cannot adapt to individual interests. Second, searching on a phone interrupts the immersive viewing experience. Third, traditional audio guides deliver fixed content and do not support personalized style switching.
Rokid AI Glasses provide a natural hardware entry point: the camera captures the scene, the lenses overlay text, and voice triggers the interaction. Together, these features create an ideal terminal form factor for deploying an intelligent agent.
agent_features = {
"vision_input": True, # Capture landmark images through the glasses camera
"voice_trigger": True, # Use voice commands to trigger narration
"style_switch": 3, # Support switching among three narration styles
"hands_free": True # Core value: hands-free interaction
}
This code summarizes the minimum capability set of the agent.
The Lingzhu platform setup can be broken down into four key steps
The first step is to create the agent and define its name, category, and functional description. In the original example, the agent is categorized under “Lifestyle” and named “Pocket Tour Guide · Scenic Spot Narration Assistant,” highlighting that users do not need to join a group tour and can switch narration styles.
The second step is to configure the input type. Because landmark recognition depends on visual understanding, the first-turn input must be configured as an image so that the scene captured by the glasses can enter the model context directly.
AI Visual Insight: This interface shows the basic agent configuration area in the Lingzhu platform, including the name, category, functional description, and test entry point. It demonstrates that the platform supports visual agent creation and instant prompt validation, which lowers the barrier to prototyping AI glasses applications.
AI Visual Insight: This image highlights the configuration for “Image (first-turn input).” That means visual input is placed at the very start of the conversation flow. It is the critical switch in the landmark recognition pipeline and determines whether the model can make real-time judgments based on the user’s current field of view.
The third step is to choose a vision model. The example uses doubao-seed-1-6-vision-250815. The goal is not complex visual reasoning, but stable semantic recognition of buildings, landmarks, and scenic spots, followed by concise explanation generation.
The fourth step is to bind the plugin so the agent can actively send device-side control commands such as taking a photo or exiting the session. This creates an action loop in which capture happens before recognition.
AI Visual Insight: This image shows the vision model selection interface. It reflects how the Lingzhu platform exposes multimodal models as configurable components for developers, shifting the development focus from low-level model training to task orchestration, input constraints, and output style control.
Prompt design determines whether the guide feels like a real product
The persona design in this project is intentionally restrained. It clearly defines the runtime environment as Rokid AI Glasses, the core task as landmark narration, and the input source as the camera feed. The value of this design is that it narrows the model’s freedom and improves response stability.
More importantly, the goals are constrained at a product level: recognition must check confidence first, narration must stay short, the language must fit voice playback, and style switching must take effect immediately. These are production-grade prompt requirements, not demo-grade prompts.
system_prompt = """
You are a pocket tour guide running on Rokid AI Glasses.
- Prioritize recognizing the current landmark view # Complete visual understanding first
- If confidence is below 80%, ask the user to confirm first # Avoid hallucinated guesses
- Keep the explanation between 150 and 200 characters # Fit lens display and voice playback
- Support three styles: formal, humorous, and in-depth historical
"""
This prompt constrains recognition accuracy, content length, and style control into an executable scope.
The skill module breakdown reflects an engineering mindset for agent systems
The example splits the capability stack into four layers: landmark recognition and confirmation, narration style management, narration generation, and handling of multi-scenario questions. This decomposition is not for conceptual completeness. It is meant to prevent one large prompt from carrying every responsibility.
The most critical part is the recognition confirmation logic: when confidence is insufficient, the agent asks a follow-up question instead of directly returning a wrong answer. This matters in both AI search and real-device scenarios because incorrect landmark recognition seriously damages user trust.
AI Visual Insight: This image shows the skill orchestration or response logic configuration area. It indicates that developers can split recognition, style switching, and content generation into modular rules to reduce output variance from the foundation model and improve interaction consistency.
def explain_spot(confidence, style):
if confidence < 0.8:
return "Which landmark are you looking at right now?" # Confirm first when confidence is low
return f"Generated a landmark explanation in {style} style" # Generate the explanation only when confidence is high
This logic reflects the product safety strategy of “confirm first, then generate.”
Plugins and real-device debugging create a closed validation loop from web to hardware
At the plugin layer, notify_take_photo is the key capability. It allows the agent to do more than passively receive images: it can actively notify the glasses to take a photo. Combined with notify_agent_off and notify_take_navigation, the system can later expand into linked scenarios such as session exit and navigation.
At the real-device integration layer, the process is as follows: after configuration is completed on the platform, submit the agent for review, open “Agent Debugging” on the developer page in the Rokid AI App, and deploy the target agent to the glasses for testing. This step validates actual device capabilities, not just web-based chat behavior.
AI Visual Insight: This image shows the glasses control plugin configuration interface. It indicates that the agent can already invoke native device actions, which means it has evolved from a pure chat interface into an agent layer capable of executing device commands.
AI Visual Insight: This image reflects the publish and review entry point for the agent. It shows that the Lingzhu platform uses a review-based release workflow for AI glasses applications, separating development state from public release and supporting personal testing as well as staged rollout validation.
AI Visual Insight: This image shows the developer configuration entry in the Rokid AI App. The key focus is on ADB debugging and agent debugging options, indicating that the mobile app acts as the bridge between platform configuration and the glasses hardware.
Real-world testing shows that the three narration styles produce meaningfully different outputs
The example uses Tiananmen as the test subject to validate the full flow from landmark recognition to style switching. The formal style emphasizes historical development and factual milestones. The humorous style focuses on conversational phrasing and personified description. The in-depth historical style emphasizes cultural meaning and historical context.
This shows that the prompt design does not stop at tone variation. It reaches into differences in information structure and narrative perspective, making the style change clearly perceptible at the product level.
AI Visual Insight: This image shows the first-round explanation generated after Tiananmen is recognized. It indicates that the system has already completed image recognition, landmark localization, and concise knowledge generation, while producing output suitable for both lens reading and real-time voice playback.
AI Visual Insight: This image illustrates the humorous output style. From a technical perspective, it shows that a style-specific generation template has been layered on top of the same recognition result, enabling stable content structure with switchable expression modes.
AI Visual Insight: This image corresponds to the formal narration style. The content is more focused on factual description and timeline organization, showing that when the model switches styles, it changes not only wording but also information ordering and the center of knowledge presentation.
AI Visual Insight: This image shows the in-depth historical style, emphasizing high-semantic-density content such as architectural evolution, cultural symbolism, and historical memory. It is well suited for users who care more deeply about cultural context.
The project’s real value is that it validates a vertical deployment path for AI glasses
This is not a simple port of a general-purpose Q&A assistant. It is a specialized agent built around the sequence of “visual capture → recognition and confirmation → short explanation generation → device playback.” For developers, the key is not whether they can build a model, but whether they can orchestrate hardware, input constraints, prompts, and plugin capabilities into a stable user experience.
If the system later integrates location, weather, ticketing, or route services, this kind of guide assistant can evolve from “an assistant that explains” into “a travel agent that accompanies the user.”
FAQ structured questions and answers
1. Why must this project use first-turn image input?
Because landmark recognition depends on the visual content in the user’s current field of view. Without an image in the first turn, the model can only guess from text and cannot deliver the core experience of “see it and instantly hear the explanation.”
2. Why is a low-confidence confirmation mechanism necessary?
Landmarks can be misidentified when appearances are similar or shooting angles are limited. Asking for confirmation before generating the explanation significantly reduces hallucinated answers and protects usability and trust in real-world scenarios.
3. Which scenarios is this solution best suited to extend into?
Beyond cultural tourism, it is also well suited for exhibition guidance, campus explanation, industrial inspection support, museum narration, and other AI glasses scenarios where users need a structured explanation immediately after seeing an object.
Core Summary: This article reconstructs the full solution for building the “Pocket Tour Guide · Scenic Spot Narration Assistant” on Rokid AI Glasses with the Lingzhu platform. It covers product design, image-based inputs, multimodal model selection, prompt orchestration, plugin attachment, real-device debugging, and style-switching validation. It is a strong reference for developers interested in AI glasses, agent development, and tourism-focused deployments.