Build an Immersive PC Spatial Music Workstation with HarmonyOS 6 Face AR and Body AR - Devuly | Smart Analytics for Developers & Projects

[AI Readability Summary] This HarmonyOS 6 API 23 spatial music workstation combines Face AR micro-expression input, Body AR skeletal gestures, the AudioKit processing chain, and an immersive ArkUI interface to solve the fragmented interaction model and limited expressive control of traditional DAWs. Keywords: HarmonyOS 6, Face AR, Body AR.

Table of Contents

Technical Specifications Snapshot

Parameter	Details
Target Platform	HarmonyOS 6 (API 23) PC
Core Language	ArkTS / ETS
AR Capabilities	Face AR, Body AR
UI Framework	ArkUI + HDS
Audio Capability	AudioKit
Key Dependencies	`@hms.core.ar.arengine`, `@kit.AudioKit`, `@kit.UIDesignKit`
License	Original article marked as CC 4.0 BY-SA
Stars	Not provided in the original content

This project redefines human input for music creation

Traditional DAWs rely on a mouse, keyboard, and MIDI controllers. The issue is not a lack of features, but an overly long chain between emotion and expression. Creators first feel something, then translate that feeling into parameter operations, losing a large amount of intuition in the process.

This approach turns the face and body into input devices. Face AR maps micro-expressions to timbre parameters. Body AR maps gestures, position, and posture to notes, velocity, octave, and sustain state, creating a closed-loop spatial composition workflow.

Insert image description here AI Visual Insight: This image shows the immersive visual concept for the main application interface. The center area highlights the music workstation layout and emphasizes the layered relationship between AR perception, a light-reactive background, and floating interaction panels in a large-screen PC scenario. It is well suited for real-time performance feedback and spectrum-driven lighting effects.

The architecture is split into four layers for parallel development

The first layer is the AR perception layer, which captures facial BlendShapes and human skeletal keypoints. The second layer is the music mapping layer, which computes the rules that convert expressions into timbre changes and posture into performance events. The third layer is the audio engine layer, which handles synthesizers, effects, and sequencing. The fourth layer is the ArkUI interaction layer, which renders light-reactive visuals and floating panels.

The value of this layered design is low coupling. You can replace the audio engine independently, or develop the UI first with mock data before connecting the real AR data stream.

export interface ToneParameters {
  reverbMix: number;        // Reverb amount
  highFreqBoost: number;    // High-frequency boost
  distortionAmount: number; // Distortion amount
  filterCutoff: number;     // Filter cutoff frequency
}

// Expression parameters act as an intermediate layer for timbre control
const tone: ToneParameters = {
  reverbMix: 0.2,
  highFreqBoost: 0,
  distortionAmount: 0,
  filterCutoff: 20000
};

This code defines the core parameter model required for expression-to-timbre mapping.

The Face AR mapping engine is responsible for emotion quantization

The most important part of the original implementation is mapping BlendShape readings into stable and usable audio parameters. For example, a smile increases reverb, raised eyebrows boost high frequencies, a frown adds distortion, surprise triggers an arpeggio, and an open mouth lowers the filter cutoff frequency.

The real engineering challenge here is not the if statements. It is smoothing. If you write instantaneous expression values directly into audio parameters, the timbre will jitter noticeably. That is why the implementation introduces a smoothingFactor, progressively moving the current value toward the target value to reduce abrupt changes.

private smoothParameters(): void {
  Object.keys(this.currentParams).forEach(key => {
    const current = (this.currentParams as any)[key]; // Current parameter
    const target = (this.targetParams as any)[key];   // Target parameter
    (this.currentParams as any)[key] = current + (target - current) * 0.1; // Smooth transition
  });
}

This code makes expression-driven timbre changes sound more natural and prevents parameter jumps.

Body AR upgrades both hands from pointers into instruments

The Body AR design assigns clear roles to the left and right hands. The left hand controls chord root notes and chord types. The right hand controls melodic pitch and modulation. The distance between both hands controls octave switching, while leaning forward triggers sustain or mode changes.

This mapping design is better suited to spatial interaction than tapping virtual piano keys with one hand, because it uses the body’s most natural depth, horizontal position, and distance information without requiring extra UI buttons.

const handDistance = Math.sqrt(
  Math.pow(leftWrist.x - rightWrist.x, 2) +
  Math.pow(leftWrist.y - rightWrist.y, 2)
);

if (handDistance < 0.12) {
  this.currentState.octave--; // Hands move closer together, lower the octave
} else if (handDistance > 0.45) {
  this.currentState.octave++; // Hands spread apart, raise the octave
}

This code shows how to map the distance between both hands to octave switching control.

The interface is not a decorative layer but a second expression surface for audio state

The spectrum-driven title bar and floating mixing panel in the original design are especially representative. The title bar uses FFT or simulated spectrum data to switch its primary color and pulse intensity, creating a global atmosphere of sound visualization. The floating panel handles the mixer, expression mapping, and effects chain configuration.

The advantage of this design is that even without looking at a parameter panel, users can still understand the current musical state through color, glow, border, and shadow intensity.

this.dominantColor = this.SPECTRUM_COLORS[dominantBand]; // Select the dominant frequency-band color
this.pulseIntensity = 0.3 + maxEnergy * 0.5;            // Adjust pulse intensity based on energy

This code completes the core mapping from spectrum energy to UI lighting effects.

Environment setup must cover both dependencies and permissions

Project dependencies are concentrated in four capability categories: AR, audio, UI, and sensors. At minimum, the app requires camera, microphone, and network permissions. Otherwise, AR tracking and audio capture will not work correctly.

{
  "dependencies": {
    "@hms.core.ar.arengine": "^6.1.0",
    "@kit.UIDesignKit": "^6.0.0",
    "@kit.AudioKit": "^6.0.0"
  }
}

This configuration declares the key dependencies required to build the AR music workstation.

{
  "module": {
    "requestPermissions": [
      { "name": "ohos.permission.CAMERA" },
      { "name": "ohos.permission.MICROPHONE" },
      { "name": "ohos.permission.INTERNET" }
    ]
  }
}

This configuration ensures the app can legally access the camera, microphone, and network.

This solution is best suited to three expansion directions

The first direction is AI-assisted composition, combining expression curves, posture trajectories, and harmony rules to generate accompaniment in real time. The second direction is distributed collaboration, allowing multiple HarmonyOS devices to join the same shared performance session. The third direction is AR glasses integration, projecting virtual keyboards and tracks into real physical space.

For production deployment, validate three things first: AR tracking stability, low-latency audio, and parameter mapping interpretability. The first two determine usability. The last one determines whether the creative experience truly feels like playing an instrument.

FAQ

1. Why must Face AR parameters be smoothed before directly controlling audio effects?

Because facial micro-expressions contain high-frequency jitter. If you write them directly into reverb, filter, or distortion parameters, they will cause obvious jumps and noise-like artifacts. Smoothing interpolation makes the sound more continuous and closer to the behavior of a real performance controller.

2. Why does the Body AR gesture mapping use a left-hand-for-chords and right-hand-for-melody split?

Because it aligns with common keyboard and arranging habits. The left hand provides the harmonic structure, while the right hand handles melodic expression. This lowers the learning curve and keeps spatial movement aligned with traditional musical thinking.

3. What capabilities should be added first to move this project from prototype to a releasable product?

Prioritize real AR data integration, an AudioKit low-latency playback pipeline, exception tracking and recovery mechanisms, and calibration panels for expression and gesture sensitivity. Without these, the interface may look complete, but the experience will be difficult to stabilize in production.

Core Summary: This article reconstructs an AR-based music creation solution built on HarmonyOS 6 (API 23), systematically breaking down the engineering implementation of Face AR expression-to-timbre mapping, Body AR gesture performance, spectrum-driven light-reactive UI, and floating mixing panels. It is especially useful for developers who want to build spatial interaction music applications on HarmonyOS PC.