The YOLO (You Only Look Once) family is the most popular real-time object detection architecture. YOLO divides the image into a grid, predicts bounding boxes and class probabilities for each grid cell in a single forward pass, and filters overlapping detections. YOLOv8 and YOLO-World achieve real-time detection (30+ FPS) with high accuracy on consumer hardware. The alternative, two-stage detectors (like Faster R-CNN), are more accurate but slower.
Bounding boxes are rectangles — they approximate object location but include background. Instance segmentation (Mask R-CNN, SAM) produces pixel-level masks for each object. Panoptic segmentation labels every pixel as either a specific object instance or a background class. Keypoint detection identifies specific points on objects (joints on a human body for pose estimation). Each adds precision at the cost of compute.
Traditional object detectors only find objects from their training categories. Zero-shot detectors (Grounding DINO, OWL-ViT, YOLO-World) can find any object described in natural language: "find all coffee cups" works even if the model never trained on coffee cups. This is possible because these models combine vision and language understanding, matching text descriptions to image regions. It's transformative for applications where the objects of interest change frequently.