Batea uses machine learning to find valuable device information

Batea is an open source tool for network-scanning penetration testers

When penetration testers scan a new network, they often analyze Nmap reports to find devices that represent potential security threats and need further investigation.

But the problem is, enterprise networks can comprise thousands of devices, making threat discovery akin to trawling riverbeds for gold nuggets.

Recognizing this analogy with the precious metals industry, researchers at Delve Labs have developed Batea, an open source tool that leverages machine learning to find valuable information in network device data.

‘Heavier, shinier targets’

Batea gets its name from the tool used by gold prospectors to separate gold nuggets from dust and mud.

“It’s easy to make the parallel between gold mining and penetration testing, or even malicious network intrusion,” Serge Olivier Paquette, research lead at Delve Labs, told The Daily Swig.

“When trying to infiltrate a network, one has to separate muddy, uninteresting devices to focus attention on the heavier and shiny targets early on in the process.

“When security experts manage to send large scale port scans or vulnerability assessments in complex, enterprise networks, they end up sifting through huge amounts of information using mainly their experience and intuition.”

Experienced pen testers who have reviewed tons of data over their career will quickly find red flags, such as a Linux server among a range of Windows workstations; a machine with multiple HTTP servers on seemingly arbitrary ports; an unusual hostname scheme; or a list of exposed services that indicate a machine’s administrative purpose.

Novices have a harder time finding relevant information, and the task gets even more difficult as the number of devices in the network grows.

“This is where we saw an opportunity to use machine learning, to replace or augment learned intuition,” Paquette said.

“What if we could automate this process of identifying what should stand out in such an assessment? What if we could automate the filtering of gold nuggets from dust?”

How does Batea work?

Batea takes an XML version of an Nmap report and applies a series of transformations to create a matrix of numerical features about each device, such as the number of open ports, the complexity of the hostname, or the IP address octet.

It then uses Isolation Forest, an unsupervised machine learning algorithm suitable for anomaly detection, to find the ‘gold nuggets’ – the outstanding assets in the network.

The advantage of Isolation Forest is that it doesn’t need a mountain of data to make credible predictions, and is applicable to both small and large networks.

“We have seen very accurate rankings on networks with as few as 40 devices, but it gets better as you increase the number of devices,” explained Paquette. “We typically recommend using datasets of more than 50 devices.”

Pen testers can train the Batea machine learning model from scratch on new network data, or use a model that has been pre-trained on various networks.

“What we observe in practice is that many enterprise environments are in fact ‘typical’ environments, and this is what Batea aims to capture. By training on many different environments – just like an experienced pen tester does – the model gets iteratively better and better at separating the baseline from the exceptional.”

In fact, Delve has launched a test page for Batea, where pen testers can upload their Nmap reports and have them perused for gold nuggets without the need to install the tool. Delve will, in turn, use their data to further train Batea’s machine learning model.

How effective is Batea?

Evaluating the accuracy of unsupervised machine learning models is challenging because, unlike supervised learning, there’s no ground truth or labeled data to compare the results against.

Nonetheless, Batea has received good feedback from the security community, according to Delve, because it uses a simple model to tackle a frustrating problem. Paquette called it “one of the few examples of free, actionable machine learning models that can actually simplify the work of pen testers in the wild”.

In the coming months, the Canada-based Delve team is planning to add several features to enhance Batea, including integrations with tools other than Nmap and the ability to map external data to devices.

They will also be working on dimensionality reduction, a process that removes unnecessary features from data so that machine learning models can still achieve accuracy with considerably fewer training examples.


READ MORE SAMM v2 – OWASP releases revamped security assurance framework