聚類相關矩陣

February 19, 2015

我有一個相關矩陣，它說明每個項目如何與另一個項目相關。因此，對於 N 個項目，我已經有一個 N*N 相關矩陣。使用這個相關矩陣，我如何對 M 個 bin 中的 N 個項目進行聚類，以便我可以說第 k 個 bin 中的 Nk 個項目的行為相同。請幫幫我。所有項目值都是分類的。

謝謝。如果您需要更多信息，請告訴我。我需要一個 Python 解決方案，但任何有助於推動我滿足要求的幫助都將是一個很大的幫助。

看起來像是塊建模的工作。谷歌的“塊建模”和前幾個點擊很有幫助。

假設我們有一個協方差矩陣，其中 N=100，實際上有 5 個簇：初始協方差矩陣

塊建模試圖做的是找到行的順序，以便集群變得明顯為“塊”：優化的協方差矩陣階

下面是一個代碼示例，它執行基本的貪心搜索來完成此操作。對於您的 250-300 個變量來說，這可能太慢了，但這是一個開始。看看你是否可以跟隨評論：

import numpy as np
from matplotlib import pyplot as plt

# This generates 100 variables that could possibly be assigned to 5 clusters
n_variables = 100
n_clusters = 5
n_samples = 1000

# To keep this example simple, each cluster will have a fixed size
cluster_size = n_variables // n_clusters

# Assign each variable to a cluster
belongs_to_cluster = np.repeat(range(n_clusters), cluster_size)
np.random.shuffle(belongs_to_cluster)

# This latent data is used to make variables that belong
# to the same cluster correlated.
latent = np.random.randn(n_clusters, n_samples)

variables = []
for i in range(n_variables):
   variables.append(
       np.random.randn(n_samples) + latent[belongs_to_cluster[i], :]
   )

variables = np.array(variables)

C = np.cov(variables)

def score(C):
   '''
Function to assign a score to an ordered covariance matrix.
High correlations within a cluster improve the score.
High correlations between clusters decease the score.
'''
   score = 0
   for cluster in range(n_clusters):
       inside_cluster = np.arange(cluster_size) + cluster * cluster_size
       outside_cluster = np.setdiff1d(range(n_variables), inside_cluster)

       # Belonging to the same cluster
       score += np.sum(C[inside_cluster, :][:, inside_cluster])

       # Belonging to different clusters
       score -= np.sum(C[inside_cluster, :][:, outside_cluster])
       score -= np.sum(C[outside_cluster, :][:, inside_cluster])

   return score


initial_C = C
initial_score = score(C)
initial_ordering = np.arange(n_variables)

plt.figure()
plt.imshow(C, interpolation='nearest')
plt.title('Initial C')
print 'Initial ordering:', initial_ordering
print 'Initial covariance matrix score:', initial_score

# Pretty dumb greedy optimization algorithm that continuously
# swaps rows to improve the score
def swap_rows(C, var1, var2):
   '''
Function to swap two rows in a covariance matrix,
updating the appropriate columns as well.
'''
   D = C.copy()
   D[var2, :] = C[var1, :]
   D[var1, :] = C[var2, :]

   E = D.copy()
   E[:, var2] = D[:, var1]
   E[:, var1] = D[:, var2]

   return E

current_C = C
current_ordering = initial_ordering
current_score = initial_score

max_iter = 1000
for i in range(max_iter):
   # Find the best row swap to make
   best_C = current_C
   best_ordering = current_ordering
   best_score = current_score
   for row1 in range(n_variables):
       for row2 in range(n_variables):
           if row1 == row2:
               continue
           option_ordering = best_ordering.copy()
           option_ordering[row1] = best_ordering[row2]
           option_ordering[row2] = best_ordering[row1]
           option_C = swap_rows(best_C, row1, row2)
           option_score = score(option_C)

           if option_score > best_score:
               best_C = option_C
               best_ordering = option_ordering
               best_score = option_score

   if best_score > current_score:
       # Perform the best row swap
       current_C = best_C
       current_ordering = best_ordering
       current_score = best_score
   else:
       # No row swap found that improves the solution, we're done
       break

# Output the result
plt.figure()
plt.imshow(current_C, interpolation='nearest')
plt.title('Best C')
print 'Best ordering:', current_ordering
print 'Best score:', current_score
print
print 'Cluster [variables assigned to this cluster]'
print '------------------------------------------------'
for cluster in range(n_clusters):
   print 'Cluster %02d %s' % (cluster + 1, current_ordering[cluster*cluster_size:(cluster+1)*cluster_size])

引用自：https://stats.stackexchange.com/questions/138325

comments powered by Disqus

聚類相關矩陣

相關問答

當特徵值很小時，獲得準確的特徵向量

確定兩個變量是線性還是非線性甚至不相關的最佳編程方法是什麼

在國際象棋數據上訓練神經網絡

為什麼 scikit-learn SVM 解決不了兩個同心圓？

為什麼R對卷積有不同的定義？

當 PCA 不產生降維時，這意味著什麼？