How does a random kitchen sink work?
Last year at NIPS 2017 Ali Rahimi and Ben Recht won the test of time award for their paper “Random Features for Large-Scale Kernel Machines” where they introduced random features, later codified as the random kitchen sinks algorithm. As part of publicising their paper, they showed that their model could be implemented in 5 lines of matlab.
% Approximates Gaussian Process regression % with Gaussian kernel of variance gamma^2 % lambda: regularization parameter % dataset: X is dxN, y is 1xN % test: xtest is dx1 % D: dimensionality of random feature % training w = randn(D,d); b = 2 * pi * rand(D, 1); Z = cos(gamma * w * X + b * ones(1,N)); alpha = (lambda * eye(D) +Z * Z') \ (Z * y); % testing ztest = alpha' * cos(gamma * w * xtest + b);
How the above algorithm learns anything is unclear to me. How does a random kitchen sink work? How does it approximate Gaussian processes and support vector machines?
Edit
Rewatching Rahimi’s talk, the term random kitchen sinks is not introduced in the paper for which they won the award but rather at the end of the trilogy of papers beginning with “Random Features for Large-Scale Kernel Machines”. The other papers are:
I think the code snippet introduced above is a specialisation of Algorithm 1 in last paper.
Random kitchen sinks (or random Fourier features) and other related methods don’t endeavour to perform inference but rather they try to reduce the bottleneck of kernel based inference methods.
Kernel methods are great in many settings but they usually rely on the manipulation of matrices, for example solving linear systems of equations and finding matrix determinants. If the matrix is then naively these computations generally cost which limits the applications they can be applied to problems with only a few thousand observations. The most popular way around this bottleneck tends to be low rank methods (although other approaches exist such as Kronecker based methods, H-matrices and Bayesian committee machines to name but a few).
Random Fourier features (Rehimi & Recht 2007) considered creating low rank approximations of shift invariant kernels by sampling only a random subset of the kernels Fourier components. As Fourier space is shift invariant, this property was preserved but now an explicit finite dimensional reproducing kernel Hilbert space was formed by the union of these Fourier components. The once infinite dimensional RKHS is approximated by the degenerate approximate kernel.
Notes on code snippet: There are a few details brushed over in the 5 lines. The most important is that the Gaussian function is also a Gaussian function in Fourier space, just the variance is inverted. That is why they are sampling from randn and then multiplying by variance. Then they produce alpha which is only a sub-procedure to find ztest. Essentially the normal kernel prediction looks like,
Where is the evaluated random Fourier feature vector.
Side comment: Should you use it? The answer isn’t a clear yes. It depends completely on what you are modelling. The use of the Fourier space is not necessarily appropriate for non-stationary non-shift invariant kernels. The guys never claimed it would work in this setting but if you are just starting out in that area sometimes the nuances aren’t obvious.