可分離性測量

December 7, 2013

我有一個（二進制）分類問題，在將單個訓練數據點（可以追溯到同一源）合併為聚合後，測試準確性（再次在單個數據點上）顯著提高。通過合併，我的意思是添加和平均特徵向量。使用 SVM 分類器/線性內核。（訓練數據是有噪聲的，因為它是半自動生成的）。

因此，有一種奇怪的情況是，使訓練和測試數據更加“不同”（聚合與單個數據點）會提高性能。（此外，對於簡單的線性回歸，我猜將特徵向量加在一起應該沒有什麼區別）。

我試圖找出可能是什麼原因。一個假設是，對於像 SVM 這樣的大邊距分類器來說，可分離性至關重要，並且通過聚合，由噪聲引起的類之間的重疊較少。

最好有一個指標來指示“重疊”或“可分離性”的數量。這裡的標準措施是什麼？我計劃對前兩個 PCA 維度進行一些繪圖，看看情況如何，但另外一些“可信數字”會很好。

最常見的可分離性度量是基於類內分佈重疊的程度（概率度量）。其中有幾個，Jeffries-Matusita 距離、Bhattacharya 距離和轉換後的散度。你可以很容易地用谷歌搜索一些描述。它們很容易實現。

還有一些基於最近鄰居的行為。可分離性指數，主要看重疊的鄰居的比例。假設邊距查看對象與同一類的最近鄰（near-hit）和對立類的最近鄰（near-miss）的距離。然後通過對此求和來創建一個度量。

然後你還有諸如類散佈矩陣和集體熵之類的東西。

編輯

R中的概率可分離性度量

separability.measures <- function ( Vector.1 , Vector.2 ) {
# convert vectors to matrices in case they are not
 Matrix.1 <- as.matrix (Vector.1)
 Matrix.2 <- as.matrix (Vector.2)
# define means
mean.Matrix.1 <- mean ( Matrix.1 )
mean.Matrix.2 <- mean ( Matrix.2 )
# define difference of means
mean.difference <- mean.Matrix.1 - mean.Matrix.2
# define covariances for supplied matrices
cv.Matrix.1 <- cov ( Matrix.1 )
cv.Matrix.2 <- cov ( Matrix.2 )
# define the halfsum of cv's as "p"
p <- ( cv.Matrix.1 + cv.Matrix.2 ) / 2
# --%<------------------------------------------------------------------------
# calculate the Bhattacharryya index
bh.distance <- 0.125 *t ( mean.difference ) * p^ ( -1 ) * mean.difference +
0.5 * log (det ( p ) / sqrt (det ( cv.Matrix.1 ) * det ( cv.Matrix.2 )
)
)
# --%<------------------------------------------------------------------------
# calculate Jeffries-Matusita
# following formula is bound between 0 and 2.0
jm.distance <- 2 * ( 1 - exp ( -bh.distance ) )
# also found in the bibliography:
# jm.distance <- 1000 * sqrt (   2 * ( 1 - exp ( -bh.distance ) )   )
# the latter formula is bound between 0 and 1414.0
# --%<------------------------------------------------------------------------
# calculate the divergence
# trace (is the sum of the diagonal elements) of a square matrix
trace.of.matrix <- function ( SquareMatrix ) {
sum ( diag ( SquareMatrix ) ) }
# term 1
divergence.term.1 <- 1/2 * trace.of.matrix (( cv.Matrix.1 - cv.Matrix.2 ) * 
( cv.Matrix.2^ (-1) - cv.Matrix.1^ (-1) )
)
# term 2
divergence.term.2 <- 1/2 * trace.of.matrix (( cv.Matrix.1^ (-1) + cv.Matrix.2^ (-1) ) *
( mean.Matrix.1 - mean.Matrix.2 ) *
t ( mean.Matrix.1 - mean.Matrix.2 )
)
# divergence
divergence <- divergence.term.1 + divergence.term.2
# --%<------------------------------------------------------------------------
# and the transformed divergence
transformed.divergence  <- 2 * ( 1 - exp ( - ( divergence / 8 ) ) )
indices <- data.frame(
jm=jm.distance,bh=bh.distance,div=divergence,tdiv=transformed.divergence)
return(indices)
}

還有一些愚蠢的可重複的例子：

##### EXAMPLE 1
# two samples
sample.1 <- c (1362, 1411, 1457, 1735, 1621, 1621, 1791, 1863, 1863, 1838)
sample.2 <- c (1362, 1411, 1457, 10030, 1621, 1621, 1791, 1863, 1863, 1838)

# separability between these two samples
separability.measures ( sample.1 , sample.2 )

##### EXAMPLE 2
# parameters for a normal distibution
meen <- 0.2
sdevn <- 2
x <- seq(-20,20,length=5000)
# two samples from two normal distibutions
normal1 <- dnorm(x,mean=0,sd=1) # standard normal
normal2 <- dnorm(x,mean=meen, sd=sdevn) # normal with the parameters selected above

# separability between these two normal distibutions
separability.measures ( normal1 , normal2 )

請注意，這些度量一次僅適用於兩個類和 1 個變量，並且有時會有一些假設（例如遵循正態分佈的類），因此您應該在徹底使用它們之前閱讀它們。但它們仍然可能滿足您的需求。

引用自：https://stats.stackexchange.com/questions/78849

可分離性測量

相關問答

我們真的在線性回歸的第一步中取隨機線嗎？

為什麼是 F 統計量≈≈approx1 當原假設為真時？

支持向量機是作為一種有效訓練神經網絡的方法而開發的嗎？

為什麼 scikit-learn SVM 解決不了兩個同心圓？

實際上，獨立同分佈假設是否適用於絕大多數監督學習任務？

為什麼這些圖中的 SE 區域差異如此之大