Vision-Based 3D Object Coordinate Estimation and Tracking

ROS 2 framework for real-time object detection, monocular 3D localization using camera geometry, and PID-based reactive tracking — validated in Gazebo simulation and on a real differential-drive robot.

10/1/2025

Live Demo View Code

ROS2Computer VisionOpenCVHough Transform3D TrackingGazeboMonocular DepthPID ControlAIRobotics

Overview

This work presents a ROS 2 framework for real-time spherical object detection, monocular 3D localization, and reactive tracking. Departing from the usual blob-detection approach, it uses the Hough Circle Transform for robust 2D detection, then applies pure camera geometry to recover 3D coordinates from a single RGB image — no depth sensor, no stereo rig, no learned depth network.

The architecture is deliberately decoupled: detection, 3D estimation, and motion control each run as independent ROS 2 nodes. Tracking can be reactive on 2D data while a parallel consumer uses the 3D output for downstream tasks (logging, AR overlays, manipulator targeting).

📄 Published as a preprint on TechRxiv — Vision-Based 3D Object Coordinate Estimation and Tracking using Hough Transform and Modular 3D Estimation in ROS (2025).

Contributions

Hough Circle Transform as a robust, parameter-driven alternative to color-only blob detection.
Decoupled architecture — independent 3D estimation and 2D tracking processes for parallel use.
Comprehensive Gazebo validation — illumination sweeps, range tests, multi-ball scenarios.

System Architecture

The pipeline is split into three ROS 2 nodes that communicate purely through topics — every node can be replaced, restarted, or relocated to another machine without touching the others.

2D Detection — Hough Circle Transform

detect_ball consumes /camera/image_raw and runs a four-stage pipeline:

Pre-processing — Gaussian blur for noise reduction, then Adaptive Histogram Equalization to lift local contrast.
HSV color masking — convert to HSV, threshold on the target hue range. HSV is much more lighting-stable than RGB.
Hough Circle Transform — gradient-based Hough vote in (a, b, r) accumulator space, picking peaks where edge pixels agree on a circle:
```
(x − a)² + (y − b)² = r²
```
Filters: minimum / maximum radius (drop irrelevant detections) and accumulator threshold (only confident votes).
Output — publish normalized pixel center (u, v) and pixel radius r to /detected_ball.

3D Position Estimation — Pure Camera Geometry

detect_ball_3d subscribes to /detected_ball and recovers 3D coordinates (x₃ᴅ, y₃ᴅ, z₃ᴅ) from the camera intrinsics — no depth sensor, no learned model.

Depth from apparent size

If the ball’s true diameter is 2·r_real and its apparent angular size at distance d is θ_ball:

θ_ball = z2d · h_fov         (image-plane angular extent)
d      = r_real / tan(θ_ball / 2)

In practice the inverse-tangent relation collapses to a stable mapping from pixel radius r to metric distance d once the camera’s horizontal field of view (h_fov) is calibrated.

Vertical decomposition

θ_y = y2d · v_fov / 2
y3d = d · sin(θ_y)
d'  = d · cos(θ_y)        // distance projected onto the horizontal plane

Horizontal decomposition

θ_x = x2d · h_fov / 2
x3d = d' · sin(θ_x)
z3d = d' · cos(θ_x)

The result is a clean (x₃ᴅ, y₃ᴅ, z₃ᴅ) in the camera frame, ready for downstream consumers — manipulators, AR overlays, dataset logging.

d = r_real / tan(θ_ball / 2)

Reactive Tracking Control

follow_ball closes the loop on the 2D detection. It uses two simple controllers and a search behavior:

Angular control — proportional controller on the normalized horizontal offset x ∈ [−1, 1]:

ω = −Kp · x

Linear control — forward velocity gated by apparent size (the bigger the ball, the closer it is):

v = vf   while   r < r_max
v = 0    otherwise

Search behavior — if no detection arrives within t_max, the robot enters search mode at fixed angular velocity Ωs until reacquisition.

Exponential filtering smooths offset and radius before they enter the controllers:

x̂ₜ = α · x̂ₜ₋₁ + (1 − α) · xₜ

This keeps the robot from chattering on noisy detections while still reacting quickly to genuine motion.

Validation in Gazebo

Detection robustness vs illumination

The simulated scene’s main light source was swept across diffuse RGB values. Approximate lux levels were computed as:

Lux ≈ ((R + G + B) / 3) · L_ref       (L_ref = 10,000 lx for white light)

Diffuse RGB	Approx. Lux	Detected
(0.8, 0.8, 0.8)	8000 lx	✅
(0.5, 0.5, 0.5)	5000 lx	✅
(0.1, 0.1, 0.1)	1000 lx	✅
(0.05, 0.05, 0.05)	500 lx	✅

Detection holds reliably down to ~500 lx — comfortably below typical office lighting.

Operational range

Bound	Distance	Reason
Minimum	10 cm	Camera FOV / lens distortion
Maximum	3.5 m	Camera resolution + Hough radius threshold

This range covers the bulk of indoor robot interaction tasks: following, pick-and-place, reactive games.

Sample 3D output

For the configuration shown in the paper:

x = 0.15 m
y = −0.34 m
z = 0.045 m

Velocity profiles during a chase show initial rotational alignment, then a steady forward approach, with smooth distance reduction and minimal lateral drift in the trajectory plot.

Tech Stack

ROS 2 — node graph, topics, lifecycle
OpenCV — Hough Circle Transform, HSV thresholding, CLAHE
Gazebo — physics + camera simulation
Python / C++ — node implementations
Differential-drive base — same platform as the SLAM project

Watch the Tracking

Applications

Automated ball collection — sports stadiums, training centers
Industrial sorting — 3D coordinates feed a manipulator that classifies parts by size and position
Augmented reality — overlay graphics on real-world ball games
Agricultural robotics — size-filtered Hough detection for fruit-picking

Full Paper

The complete preprint, including all equations, figures, and references:

Read on TechRxiv · Open PDF in a new tab · Download PDF

Future Work

Real-world validation beyond the Gazebo benchmarks — outdoor lighting, motion blur, partial occlusions
Multi-object tracking with persistent identities (Hungarian assignment + Kalman filtering)
Polymorphic detection — extend Hough to ellipses / generalized shapes for non-spherical targets
Sensor fusion — combine the monocular depth estimate with sparse LiDAR returns for confidence-weighted localization

Contributors

Imad-Eddine NACIRI
Oussama Errouji
Jade Bousliman

Takeaway

You don’t always need a depth sensor. With clean camera intrinsics, a robust 2D detector, and a decoupled architecture, monocular geometry is enough to put an object in 3D space — and ROS 2 turns that into a reusable building block for everything from reactive control to manipulation.