Zubnet AILearnWiki › Object Detection
Using AI

Object Detection

YOLO, Bounding Box Detection
Identifying and localizing objects in images or video by drawing bounding boxes around them and classifying what each box contains. "There's a car at position (x1,y1,x2,y2) and a person at (x3,y3,x4,y4)." Unlike image classification (which says what's in the image), object detection says what's in the image and where — enabling counting, tracking, and spatial reasoning.

Why it matters

Object detection is the technology behind self-driving cars (detecting pedestrians, vehicles, signs), security cameras (person detection), retail analytics (counting shoppers), manufacturing quality control (detecting defects), and augmented reality (placing virtual objects relative to real ones). It's one of the most commercially deployed computer vision capabilities.

Deep Dive

The YOLO (You Only Look Once) family is the most popular real-time object detection architecture. YOLO divides the image into a grid, predicts bounding boxes and class probabilities for each grid cell in a single forward pass, and filters overlapping detections. YOLOv8 and YOLO-World achieve real-time detection (30+ FPS) with high accuracy on consumer hardware. The alternative, two-stage detectors (like Faster R-CNN), are more accurate but slower.

Beyond Bounding Boxes

Bounding boxes are rectangles — they approximate object location but include background. Instance segmentation (Mask R-CNN, SAM) produces pixel-level masks for each object. Panoptic segmentation labels every pixel as either a specific object instance or a background class. Keypoint detection identifies specific points on objects (joints on a human body for pose estimation). Each adds precision at the cost of compute.

Zero-Shot Detection

Traditional object detectors only find objects from their training categories. Zero-shot detectors (Grounding DINO, OWL-ViT, YOLO-World) can find any object described in natural language: "find all coffee cups" works even if the model never trained on coffee cups. This is possible because these models combine vision and language understanding, matching text descriptions to image regions. It's transformative for applications where the objects of interest change frequently.

Related Concepts

← All Terms
ESC