Clustering: Definition & Meaning — AI Wiki

一種把相似資料點歸到一起的非監督式學習任務,不用預定義標籤。給一份客戶購買資料,分群可能發現不同的客戶群(圖便宜的、買奢侈的、偶爾購買的)。K-means 是最常見的演算法:選 K 個 cluster,把每個點分到最近的 cluster 中心,迭代精煉中心。

為什麼重要

分群是最常見的非監督式學習任務,到處都是:客戶分群、文件歸組、異常偵測(不屬於任何 cluster 的 outlier)、影像壓縮(把相似像素歸組)、資料探索(我的資料裡有什麼自然的組?)。它往往是理解新資料集的第一步。

Deep Dive

K-means works by: (1) randomly initializing K cluster centers, (2) assigning each data point to the nearest center, (3) moving each center to the mean of its assigned points, (4) repeating steps 2–3 until convergence. The main challenge: choosing K. The "elbow method" (plot loss vs. K and find the bend) and silhouette scores are common heuristics, but the right number of clusters often requires domain knowledge.

Beyond K-Means

DBSCAN discovers clusters of arbitrary shapes (K-means assumes spherical clusters) and automatically identifies outliers as noise points. Hierarchical clustering builds a tree of nested clusters that you can cut at any level. Gaussian Mixture Models (GMMs) model clusters as probability distributions, allowing soft assignments (a point can partially belong to multiple clusters). Each method has strengths for different data geometries and use cases.

Clustering with Embeddings

Combining embeddings with clustering is powerful for text analysis. Embed a collection of documents using a sentence embedding model, then cluster the embeddings. Each cluster represents a semantic group — topics, themes, or categories that emerge from the data. This is used for: organizing support tickets by topic, discovering themes in survey responses, grouping similar products, and topic modeling (a modern alternative to LDA). The clusters can then be labeled by asking an LLM to summarize what each cluster is about.

Clustering

為什麼重要

Deep Dive

Beyond K-Means

Clustering with Embeddings

相關概念