Published on Wed Jan 30 2019

Classifier Suites for Insider Threat Detection

David Noever

We extract risk features from the large CERT dataset, which blends real network behavior with individual threat narratives. Among major classifier families tested on CERT, the random forest algorithms offer the best choice, scoring over 98% accurate.

0
0
0
Abstract

Better methods to detect insider threats need new anticipatory analytics to capture risky behavior prior to losing data. In search of the best overall classifier, this work empirically scores 88 machine learning algorithms in 16 major families. We extract risk features from the large CERT dataset, which blends real network behavior with individual threat narratives. We discover the predictive importance of measuring employee sentiment. Among major classifier families tested on CERT, the random forest algorithms offer the best choice, with different implementations scoring over 98% accurate. In contrast to more obscure or black-box alternatives, random forests are ensembles of many decision trees and thus offer a deep but human-readable set of detection rules (>2000 rules). We address performance rankings by penalizing long execution times against higher median accuracies using cross-fold validation. We address the relative rarity of threats as a case of low signal-to-noise (< 0.02% malicious to benign activities), and then train on both under-sampled and over-sampled data which is statistically balanced to identify nefarious actors.

Sat May 18 2019
Machine Learning
Disentangled Attribution Curves for Interpreting Random Forests and Boosted Trees
Disentangled Attribution Curves (DAC) is a method to provide interpretations of tree ensemble methods. For a given variable, or group of variables, DAC plots the importance of a variable(s) as their value changes. DAC is shown to out-perform competing methods in the
0
0
0
Fri Jan 13 2017
Machine Learning
What Can I Do Now? Guiding Users in a World of Automated Decisions
More and more processes governing our lives use in some part an automatic decision step. Here we present asimple idea which gives some of the power back to the applicant. It is based on a formalization reminiscent of methods used for invasion attacks.
0
0
0
Mon Jul 28 2014
Machine Learning
Understanding Random Forests: From Theory to Practice
Data analysis and machine learning have become an integrative part of the modern scientific methodology. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms and limitations.
1
0
0
Fri Oct 25 2019
NLP
Current Limitations in Cyberbullying Detection: on Evaluation Criteria, Reproducibility, and Data Scarcity
The detection of online cyberbullying has seen an increase in societal importance, popularity in research, and available open data. However, access restrictions on high-quality data limit the applicability of state-of-the-art techniques.
0
0
0
Tue Feb 09 2021
Machine Learning
Classifier Calibration: with implications to threat scores in cybersecurity
Calibration is important for two reasons: first, it provides a meaningful score, that is the posterior probability; second, it puts the scores of different classifiers on the same scale for comparable interpretation. This paper explores the calibration of a classifier output score in binary classification problems.
0
0
0
Sat Mar 10 2018
Machine Learning
Influence of the Event Rate on Discrimination Abilities of Bankruptcy Prediction Models
In bankruptcy prediction, the proportion of events is very low, which is often oversampled to eliminate this bias. In this paper, we study the influence of the event rate on discrimination abilities of bankruptcy prediction models. Results show that Bayesian Network is the most insensitive to the event rates.
0
0
0