Zubnet AI学习Wiki › Clustering
基础

Clustering

K-Means, DBSCAN, Cluster Analysis
一种把相似数据点归到一起的无监督学习任务,不用预定义标签。给一份客户购买数据,聚类可能发现不同的客户群(图便宜的、买奢侈的、偶尔购买的)。K-means 是最常见的算法:选 K 个 cluster,把每个点分到最近的 cluster 中心,迭代精炼中心。

为什么重要

聚类是最常见的无监督学习任务,到处都是:客户分群、文档归组、异常检测(不属于任何 cluster 的 outlier)、图像压缩(把相似像素归组)、数据探索(我的数据里有什么自然的组?)。它往往是理解新数据集的第一步。

Deep Dive

K-means works by: (1) randomly initializing K cluster centers, (2) assigning each data point to the nearest center, (3) moving each center to the mean of its assigned points, (4) repeating steps 2–3 until convergence. The main challenge: choosing K. The "elbow method" (plot loss vs. K and find the bend) and silhouette scores are common heuristics, but the right number of clusters often requires domain knowledge.

Beyond K-Means

DBSCAN discovers clusters of arbitrary shapes (K-means assumes spherical clusters) and automatically identifies outliers as noise points. Hierarchical clustering builds a tree of nested clusters that you can cut at any level. Gaussian Mixture Models (GMMs) model clusters as probability distributions, allowing soft assignments (a point can partially belong to multiple clusters). Each method has strengths for different data geometries and use cases.

Clustering with Embeddings

Combining embeddings with clustering is powerful for text analysis. Embed a collection of documents using a sentence embedding model, then cluster the embeddings. Each cluster represents a semantic group — topics, themes, or categories that emerge from the data. This is used for: organizing support tickets by topic, discovering themes in survey responses, grouping similar products, and topic modeling (a modern alternative to LDA). The clusters can then be labeled by asking an LLM to summarize what each cluster is about.

相关概念

← 所有术语
← CLIP CNN →