使用潛在 Dirichlet 分配的主題預測

April 7, 2011

我在文檔語料庫上使用了 LDA 並找到了一些主題。我的代碼的輸出是兩個包含概率的矩陣；一個文檔主題概率和另一個單詞主題概率。但我實際上不知道如何使用這些結果來預測新文檔的主題。我正在使用吉布斯採樣。有誰知道怎麼做？謝謝

我會嘗試“折疊”。這是指獲取一個新文檔，將其添加到語料庫中，然後僅對該新文檔中的單詞進行Gibbs 採樣，保持舊文檔的主題分配相同。這通常會快速收斂（可能是 5-10-20 次迭代），並且您不需要對舊語料庫進行採樣，因此它也運行得很快。最後，您將為新文檔中的每個單詞分配主題。這將為您提供該文檔中主題的分佈。

在您的 Gibbs 採樣器中，您可能有類似於以下代碼的內容：
// This will initialize the matrices of counts, N_tw (topic-word matrix) and N_dt (document-topic matrix)
for doc = 1 to N_Documents
   for token = 1 to N_Tokens_In_Document
      Assign current token to a random topic, updating the count matrices
   end
end

// This will do the Gibbs sampling
for doc = 1 to N_Documents
   for token = 1 to N_Tokens_In_Document
      Compute probability of current token being assigned to each topic
      Sample a topic from this distribution
      Assign the token to the new topic, updating the count matrices
   end
end
折疊是相同的，除了您從現有矩陣開始，將新文檔的標記添加到它們，並僅對新標記進行採樣。IE：
Start with the N_tw and N_dt matrices from the previous step

// This will update the count matrices for folding-in
for token = 1 to N_Tokens_In_New_Document
  Assign current token to a random topic, updating the count matrices
end

// This will do the folding-in by Gibbs sampling
for token = 1 to N_Tokens_In_New_Document
  Compute probability of current token being assigned to each topic
  Sample a topic from this distribution
  Assign the token to the new topic, updating the count matrices
end
如果您使用標準 LDA，則整個文檔不太可能由一個主題生成。所以我不知道計算文檔在一個主題下的概率有多大用處。但是，如果您仍然想這樣做，那很容易。從你得到的兩個矩陣中，你可以計算, 單詞的概率在主題. 拿上你的新文件；假設’第一個詞是. 給定主題，單詞是獨立的，所以概率只是

（請注意，您可能需要在日誌空間中計算它）。

引用自：https://stats.stackexchange.com/questions/9315

comments powered by Disqus

使用潛在 Dirichlet 分配的主題預測

相關問答

LDA 中的主題連貫性得分如何直觀地有意義？

何時使用 LDA over GMM 進行聚類？

使用 LDA 生成的主題詞來表示文檔

主題模型中的主題穩定性

LDA 超參數的自然解釋

主題模型和詞共現方法