Zubnet AIसीखेंWiki › Object Detection
Using AI

Object Detection

YOLO, Bounding Box Detection
Images या video में objects को identify और localize करना उनके around bounding boxes draw करके और हर box में क्या है उसे classify करके। “Position (x1,y1,x2,y2) पर एक car है और (x3,y3,x4,y4) पर एक person है।” Image classification (जो कहती है image में क्या है) के विपरीत, object detection कहता है image में क्या है और कहाँ है — counting, tracking, और spatial reasoning enable करते हुए।

यह क्यों matter करता है

Object detection self-driving cars (pedestrians, vehicles, signs detect करना), security cameras (person detection), retail analytics (shoppers count करना), manufacturing quality control (defects detect करना), और augmented reality (real objects के relative virtual objects place करना) के पीछे की technology है। ये सबसे commercially deployed computer vision capabilities में से एक है।

Deep Dive

The YOLO (You Only Look Once) family is the most popular real-time object detection architecture. YOLO divides the image into a grid, predicts bounding boxes and class probabilities for each grid cell in a single forward pass, and filters overlapping detections. YOLOv8 and YOLO-World achieve real-time detection (30+ FPS) with high accuracy on consumer hardware. The alternative, two-stage detectors (like Faster R-CNN), are more accurate but slower.

Beyond Bounding Boxes

Bounding boxes are rectangles — they approximate object location but include background. Instance segmentation (Mask R-CNN, SAM) produces pixel-level masks for each object. Panoptic segmentation labels every pixel as either a specific object instance or a background class. Keypoint detection identifies specific points on objects (joints on a human body for pose estimation). Each adds precision at the cost of compute.

Zero-Shot Detection

Traditional object detectors only find objects from their training categories. Zero-shot detectors (Grounding DINO, OWL-ViT, YOLO-World) can find any object described in natural language: "find all coffee cups" works even if the model never trained on coffee cups. This is possible because these models combine vision and language understanding, matching text descriptions to image regions. It's transformative for applications where the objects of interest change frequently.

संबंधित अवधारणाएँ

← सभी Terms
ESC