Efficient Vertical Federated Learning Method for Ridge Regression of Large-Scale Samples via Least-Squares Solution

IEEE Transactions on Emerging Topics in Computing 2022 · Jianping Cai, Ximeng Liu, Zhiyong Yu, Kun Guo, Jiayin Li ·

Integrating data from multiple parties to achieve cross-institutional machine learning is an important trend in Industry 4.0 era. However, the privacy risks from sharing data pose a significant challenge to data integration. To integrate data without sharing data and meet large-scale samples' modeling needs, we propose two vertical federation learning algorithms for ridge regression via least-squares solution for two-party and multi-party scenarios, respectively. Compared with the state-of-the-art algorithms, our algorithms only need one round of calculation for the optimization instead of iteration. Furthermore, our algorithms can effectively handle large-scale samples due to the number of cryptographic operations in our algorithms being independent of the number of samples. Through our proposed the matrix secure agent computing theory and $\delta$ -data indistinguishability theory, we provide quantitative theoretical guarantees for the security of our algorithms. Our algorithms satisfy complete data indistinguishability under the “semi-honest” assumption and the quantitative security under the “malicious” assumption. The experiments show that our proposed algorithm takes only about 400 seconds to handle up to 9.6 million large-scale samples, while the state-of-the-art algorithms take close to 1000 seconds to handle every 1000 samples, which embodies the advantage of our algorithms in handling large-scale samples.δ -data indistinguishability theory, we provide quantitative theoretical guarantees for the security of our algorithms. Our algorithms satisfy complete data indistinguishability under the “semi-honest” assumption and the quantitative security under the “malicious” assumption. The experiments show that our proposed algorithm takes only about 400 seconds to handle up to 9.6 million large-scale samples, while the state-of-the-art algorithms take close to 1000 seconds to handle every 1000 samples, which embodies the advantage of our algorithms in handling large-scale samples.

PDF Abstract