Imad-Eddine NACIRI
← Back to Projects

Real-Time Human Pose Imitation on NAO Robot

Real-time system that detects human pose with MediaPipe, estimates 3D joint angles through geometric analysis, and streams them to a NAO robot via NAOqi for upper-body imitation — plus a synchronized human↔robot dataset for imitation learning.

4/1/2025
NAO RobotMediaPipePose EstimationNAOqiComputer VisionHuman-Robot InteractionDatasetPythonRobotics
NAO humanoid robot imitating human upper-body motions in real time

Overview

This project builds a real-time human pose imitation system for the NAO humanoid robot. A webcam feeds MediaPipe Pose, geometric reasoning extracts 3D joint angles for the shoulders, elbows, and head, and those angles are streamed over a TCP socket to a NAOqi-side process that replays them on the robot — frame by frame — using setAngles and angleInterpolationWithSpeed.

On top of the live behavior, the same pipeline logs a synchronized human↔robot dataset (45,120+ entries) where each detected gesture is paired with both the commanded and the measured NAO joint angles — opening the door to imitation learning and motion-transfer research.

Why NAO?

NAO is a 58 cm humanoid with 25 DoF distributed across head, arms, hands, legs, and pelvis. Each joint is reachable through the NAOqi API (ALMotion), which exposes both instantaneous (setAngles) and time-interpolated (angleInterpolationWithSpeed) commands. Mathematically, the kinematic chain is described in the Denavit–Hartenberg convention; the forward kinematics map joint vectors θ to Cartesian end-effector poses, and inverse kinematics solve the reverse mapping under joint and balance constraints.

NAO humanoid robot used in our experiments NAO joint structure and DH coordinate frames

System Architecture

The pipeline is a clean sender ↔ receiver split — vision lives on one side, robot motion on the other, and they only exchange small JSON messages over TCP. This decoupling keeps the Python 3 / MediaPipe world separate from the legacy Python 2.7 NAOqi runtime.

SENDER · Python 3.10 Webcam Capture OpenCV MediaPipe Pose 33 landmarks Joint Angle Calc geometric, 3D Gesture Detectors Clap / Think / YAAAY… Stabilizer + JSON RECEIVER · Python 2.7 + NAOqi TCP Socket listener Dispatcher cmd → handler ALMotion setAngles / interp Sensor Logger actual joint angles NAO Robot

Sender (Python 3.10)

  • MediaPipe Pose extracts 33 body landmarks per frame (head, torso, arms, legs).
  • A library of gesture detectors (clapping.py, flipKick.py, thinking.py, …) classifies poses from geometric conditions on the landmarks.
  • A stabilizer layer (HOLD_FRAMES, HOLD_TIME, COOLDOWN) suppresses jitter.
  • Each stable gesture is shipped over TCP as a JSON payload: cmd, run_id, event_id, plus four landmark variants (raw / normalized / visibility / trusted).

Receiver (Python 2.7 + NAOqi)

  • A socket listener accepts each command and routes it to a motion handler.
  • ALMotion executes either an instantaneous setAngles or a blocking angleInterpolationWithSpeed, depending on the gesture.
  • Every commanded target is paired with the actual sensor reading and written to disk.

Pose Estimation with MediaPipe

MediaPipe gives us 33 3D landmarks per frame. The detection layer uses three building blocks built on top of them:

Euclidean distance — used for “wrists close together” (claps, hand-on-head, etc.):

d_ij = √( (x_i − x_j)² + (y_i − y_j)² + (z_i − z_j)² )

Vertical ordering — “wrist above elbow” tells you the arm is raised:

y_wrist  <  y_elbow

Joint angle from three landmarks — the elbow / knee / shoulder bend:

θ = arccos(  (p_shoulder − p_elbow) · (p_wrist − p_elbow)
           / (‖p_shoulder − p_elbow‖ · ‖p_wrist − p_elbow‖) )

Because everything is relative (vectors and ratios), the detectors are invariant to user height and camera distance.

MediaPipe Pose 33-landmark human skeleton detection

Gesture Vocabulary

The system ships with a curated set of recognizable gestures, each defined by simple, human-interpretable geometric conditions:

GestureConditionNAO Behavior
CLAPWrists within ε at the same vertical level, double-contact in 2 sAlternating ElbowRoll for applause
FLIPKICKLeft knee above hip + left wrist above shoulderLHipPitch, LKneePitch, head tilt
THINKRight wrist near head + left arm ≈ 90° + legs straightRight hand to head + slow head nod
YAAAYBoth wrists above elbows + above shoulders + knees bentBoth arms up, knees bend
STOPArms crossed at chest heightArms crossed (blocking)
TAP FIVE (L/R)One arm raised verticallySide-specific arm raise, speed 0.15
GRABBINGTorso lean forward + hand closeHip pitch + close hand + reset
LOOK FARHand near face + head turnedHand shading face + head yaw
Predefined NAO robot gestures used in the system

Stabilization

Real-time landmark detection is noisy, so several layers smooth the signal before it can fire a robot motion:

  1. Hold frames — a gesture must persist for ≥ N frames (typ. 4) before being confirmed.
  2. Hold time — minimum stable duration (≥ 0.35 s) to filter transients.
  3. Cooldown — once fired, a gesture is suppressed for C seconds (CLAP uses 3 s) to avoid double-triggering.
  4. Trusted landmarks — points with MediaPipe visibility below a threshold are zeroed out.
  5. Exponential smoothingL̄_i(t) = α·L_trusted(t) + (1 − α)·L̄_i(t − 1) (typ. α = 0.2).

Together these turn a flickering raw stream into clean gesture commands.

Synchronized Dataset

A central goal of the project is the dataset, not just the live show. Every gesture event is logged on both sides of the socket, indexed by a run_id (per-session) and event_id (per-gesture):

Sender Log (JSONL) landmarks_raw / norm visibility / trusted Receiver Log (JSONL) commanded angles actual angles Merge on run_id + event_id Paired Dataset 45,120 entries human ↔ robot
Sender / receiver data workflow with synchronized identifiers

Normalization & filtering (made the data usable)

Before merging, landmarks are normalized by a torso reference point and a scale factor (shoulder width):

L̃_i(t) = ( L_i(t) − L_ref(t) ) / ‖ L_scale(t) − L_ref(t) ‖_2

This way a tall and a short person yield comparable values, and changes in camera distance are absorbed automatically. Low-visibility landmarks are zeroed via:

L_trusted(t) = L̃_i(t)   if v_i(t) ≥ τ
              0          otherwise

Results

The merged log contains 45,120 paired records, covering FLIPKICK, CLAP (multi-phase, blocking), STOP, YAAAY, TAP FIVE (L/R), and GRABBING.

Execution fidelity (commanded vs measured joint angle)

Gesture / phaseMethodMean |err| (rad)Max |err| (rad)
CLAP rep 1 phase AangleInterpolation (blocking)0.038 (≈ 2.2°)0.100 (≈ 5.7°)
CLAP rep 1 phase BangleInterpolation (blocking)0.014 (≈ 0.8°)0.023
FLIPKICK keyframesetAngles (non-blocking, mid-motion)0.670 (≈ 38°)2.178

Two clear takeaways:

  • Blocking interpolation gives sub-degree to low-degree fidelity when the motion has time to settle — exactly what we want for a clean dataset.
  • Non-blocking snapshots inflate “error” because the robot is sampled mid-trajectory — useful for studying transient dynamics, not for steady-state accuracy.

Recognition

  • CLAP and YAAAY were consistently recognized across users.
  • More subtle gestures (THINK, LOOK FAR) occasionally suffered from partial occlusions, but the visibility-based filtering and HOLD/COOLDOWN windows kept false positives low.
YAAAY gesture executed by NAO NAO performing a gesture during a live session

Challenges & Solutions

  • False positives / overlapping gestures — solved with multi-frame validation, visibility thresholds, and per-gesture suppression rules.
  • Clap detection — naive wrist proximity was too sensitive; replaced with a double-contact-within-2s rule plus a 3 s cooldown.
  • Sender/receiver synchronization — solved by stamping every payload with run_id + event_id, which makes log merging trivial.
  • Latency vs fidelity — blocking calls are accurate but slow; non-blocking is responsive but noisy. The dataset records both method and speed so downstream analyses can pick a side.

Tech Stack

  • Python 3.10 (sender) / Python 2.7 (receiver, required by NAOqi)
  • MediaPipe Pose — body landmark detection
  • NAOqi SDKALMotion (setAngles, angleInterpolationWithSpeed)
  • OpenCV + NumPy — frame capture and vector math
  • TCP sockets + JSON — inter-process communication
  • NAO V6 humanoid

Watch the Imitation

Future Work

  • Larger gesture vocabulary — waving, pointing, handshakes, multi-step interactions
  • Learning-based detection — replace the heuristic detectors with a model trained on our own dataset
  • Latency-aware control — predictive blends of blocking + non-blocking actuation
  • Cross-robot transfer — port the dataset and pipeline to other humanoids (Pepper, Reachy)
  • HRI studies — measure not only execution accuracy but also user trust and intuitiveness

Contributors

  • Imad-Eddine NACIRI
  • Oussama Errouji

Takeaway

Real-time imitation is the easy half — the dataset is the contribution. Pairing MediaPipe landmarks with measured NAO joint feedback turns a teleop demo into a reusable benchmark for imitation learning, motion transfer, and gesture-grounded HRI.