Privacy-Preserving Machine Learning Models for Sensitive Customer Data in Insurance Systems
Main Article Content
Abstract
The insurance industry is exploring the use of machine learning (ML) models to leverage the huge volume of customer data for of-the-moment business decisions. It is, however, extremely sensitive information. From a design per- spective, data attribute utility should be carefully balanced with privacy guarantees, particularly when sensitive customer data is involved. Privacy risks can be mitigated by using techniques that reduce and control the amount of sensitive information exposed during the training and use of ML models. A wide spectrum of privacy-preserving machine learning solutions has been developed. They are based on a comprehensive view of data protection-impact assessments under privacy laws and reg- ulations, subsequently consolidating the specific requirements for both personal identifiable information (PII) and personal health identifiable (PHI) information. For sufficiently large datasets, fair ML solutions with differential privacy-DPIA compliance can be obtained without compromising model performance. Notably, certain ML tasks, such as risk scoring and underwriting, can be accomplished with very close-to-the-source data while preserving DP-compliance for protected attributes. Risk scoring and underwriting processes are performed under the control of one institution, while fraud detection and claims management procedures apply an anomaly-detection-based architecture. For sensitive attributes such as health data, disparity in training data volume can be solved by transferring knowledge through privacy-preserving federated learning. Sensitive attributes with low entropy are avoided at prediction time to mitigate the associated disclosure risk. For such features, privacy and risk evaluation techniques such as k-anonymity and ℓ-diversity are embedded into the data-governance step, ensuring that the data support radarized and risk-aware disclosures when exposed to third parties.