Cambridge University boffins apply natural language processing to sort out the slang on HackForums
UPDATED Computer scientists and linguists from Cambridge University have combined to apply natural language processing (NLP) techniques to pick out trends in discussions on underground cybercrime forums.
These underground forums and chatrooms typically feature a great deal of general discussion, as well as attempts to sell illicit software and other items, or offer hacking tutorials. Posts are often full of domain-specific lexicon, misspellings, slang, jargon, and acronyms.
Standard NLP approaches are tuned for more organized, edited and collated content such as news articles and Wikipedia entries. Conventional approaches fall apart once faced with talk of ‘fullz’, ‘warez’, ‘rats’, ‘sploits’, and other terms that pepper English-language cybercrime forums.
A team of researchers led by Jack Hughes of the University of Cambridge’s Computer Laboratory and linguist Seth Aycock, also of the University of Cambridge, were however able to develop a technique to pick out trends from years of posts to an English-language underground hacking forum – specifically the popular HackForums site.
The statistical approach developed by the team was based on a technique called ‘weighted log-odds ratio’ was said to achieve better results than ‘term-frequency inverse-document-frequency’ (TF-IDF), another NLP-based method.
The researchers tested their technique by looking at HackForums posts referencing the spread of the WannaCry ransomware in 2017, and a second set of posts contained in a subforum called ‘Monetizing Techniques’.
The Bayesian-based statistical analysis approach taken by the researchers and the NLP techniques they applied is informed by earlier research into making sense of “noisy text data”.
“Detecting trending topics on noisy social media data is not a new problem for information retrieval and NLP,” the University of Cambridge team explains.
“However, we believe our application of an existing statistical method onto a longitudinal dataset provides a novel lightweight approach to detecting trending terms, which returns terms of more relevance than TF-IDF, and remains computationally less expensive than topic modelling such as LDA.”
Applying the technique may have practical applications in “identifying what may be of interest to security researchers” more quickly and efficiently, according to the researchers.
However, the team acknowledge that since many cybercrime posts take place on Russian language forums, more work is needed to see if the technique lends itself to wider application.
“Many cyber-crime forums are not English-speaking, which can add complexity into analysis,” the team acknowledged.
A paper (PDF) on the research has been accepted at the 2020 Workshop on Noisy User-Generated Text.
Aycock told The Daily Swig that the same methodology could "certainly be applied to foreign language cybercrime forums"."The method is statistical so is language-independent; however, further research would have to be done to assess the technique’s performance on foreign language data, especially if it’s mixed with text in other languages," he said.
One of the main applications of the technique is to aid research on cybercrime forums, he addded.
Aycock explained: "The Hackforums data and the CrimeBB dataset are used in a lot of cybercrime research, so our work will help researchers to track trending terms in current and historic data; for example, studying how the more dangerous malware has trended on the forums previously might then mean researchers, or cybersecurity professionals, can find similar dangerous malware earlier and track it (in the researcher’s case) or act against it (in the cybersecurity professional’s case)."
This story has been updated to add comment from Seth Aycock