Zubnet AIसीखेंWiki › Clustering
मूल सिद्धांत

Clustering

K-Means, DBSCAN, Cluster Analysis
एक unsupervised learning task जो similar data points को predefined labels के बिना group करती है। Customer purchase data दिए जाने पर, clustering distinct customer segments discover कर सकती है (bargain hunters, luxury buyers, occasional shoppers)। K-means सबसे common algorithm है: K clusters choose करो, हर point को nearest cluster center को assign करो, और centers को iteratively refine करो।

यह क्यों matter करता है

Clustering सबसे common unsupervised learning task है और हर जगह है: customer segmentation, document grouping, anomaly detection (outliers जो किसी cluster में fit नहीं होते), image compression (similar pixels को group करना), और data exploration (मेरे data में क्या natural groups हैं?)। ये अक्सर एक नए dataset को समझने का पहला step होता है।

Deep Dive

K-means works by: (1) randomly initializing K cluster centers, (2) assigning each data point to the nearest center, (3) moving each center to the mean of its assigned points, (4) repeating steps 2–3 until convergence. The main challenge: choosing K. The "elbow method" (plot loss vs. K and find the bend) and silhouette scores are common heuristics, but the right number of clusters often requires domain knowledge.

Beyond K-Means

DBSCAN discovers clusters of arbitrary shapes (K-means assumes spherical clusters) and automatically identifies outliers as noise points. Hierarchical clustering builds a tree of nested clusters that you can cut at any level. Gaussian Mixture Models (GMMs) model clusters as probability distributions, allowing soft assignments (a point can partially belong to multiple clusters). Each method has strengths for different data geometries and use cases.

Clustering with Embeddings

Combining embeddings with clustering is powerful for text analysis. Embed a collection of documents using a sentence embedding model, then cluster the embeddings. Each cluster represents a semantic group — topics, themes, or categories that emerge from the data. This is used for: organizing support tickets by topic, discovering themes in survey responses, grouping similar products, and topic modeling (a modern alternative to LDA). The clusters can then be labeled by asking an LLM to summarize what each cluster is about.

संबंधित अवधारणाएँ

← सभी Terms
← CLIP CNN →