Stopping Cyberattacks with AI
Math Plus Machine Learning Quickly Identifies and Deflects Malicious Traffic
The war in Ukraine has caused a dramatic rise in distributed denial-of-service (DDoS) cyberattacks worldwide, digital offensives that can bring down websites by overwhelming the targeted server with a deluge of internet traffic. Millions occur every year, with the number and size rising. Approximately a third of website downtime is due to DDoS attacks.
“DDoS cyberattacks are meant to generate chaos, disrupt institutions, and of course cause financial losses,” says Michał Karpowicz, a computer scientist and the director of research at NASK, Polish National Research Institute for Cybersecurity and AI. “And they are very common because they are relatively easy to generate.”
In a typical DDoS attack, a culprit uses many computers and online devices infected with specially designed malware. These devices can include gadgets on the Internet of Things (IoT)—kitchen appliances, security cameras, and thermostats—which are proliferating every day. Currently, the IoT contains more than 10 billion devices, a massive army awaiting recruitment. Attacks are typically one of two types, or a combination. In volumetric attacks, infected devices flood the targeted network with extreme traffic volumes at once, overloading its capabilities to serve regular users. In an application attack, they make requests that require the server to perform extensive computations, also overloading it.
Karpowicz describes the problem using a familiar scenario. Imagine you’re driving to the bank. There’s little traffic and everything is going smoothly until you reach the final intersection. Suddenly, everything jams up and you must wait. That’s a volumetric attack. Now imagine you have finally arrived at the bank and are standing in line for the teller. You notice that the customer in front of you has an unusual issue that requires the attention of every bank employee. Again, you must wait. That’s an application attack.
“DDoS cyberattacks are meant to generate chaos, disrupt institutions, and of course cause financial losses. And they are very common because they are relatively easy to generate.”Michael Karpowicz, director of research at NASK
Malware can hide the source of the requests, making attacks hard to shut down. Services also offer attacks for hire, making them widely accessible. Such services have been launched by disgruntled employees, activists, market competitors, and nations. Karpowicz notes that during pandemic lockdowns, NASK detected the attacks launched from student networks at the beginning of lectures to halt online tests.
But recently Karpowicz has developed mathematical methods to detect such malicious requests and halt or redirect them. “When you are faced with a problem that is rooted in reality,” he says, “it gives you a lot of inspiration.”
Fingerprints on the Fly
Information on the internet, including requests from websites, consists of data packets. Each packet has a header that acts as a unique identifier describing its size, source, and destination. Typical defenses look at those headers to see if they’re on a ban-list. The defenses then redirect bad packets to harmless destinations.
The problem is that they must have this list beforehand, and attackers frequently change headers to escape detection. Karpowicz’s solution instead detects network traffic patterns in real time.
Imagine a series of vehicle traffic cameras at city intersections. If you look at a random sampling of video clips, you start to notice a sudden buildup of blue convertibles at a certain location. Suspicious. You can then look closely at the cars and learn more about them and potentially pull more of them over.
The method takes network data and translates it into signal data. The task then is to separate different signal sources. Karpowicz likens that to the cocktail party problem, where you’re at a party surrounded by conversations and must single out the words of the person speaking to you.
Here, he uses linear algebra. The signals, or packet headers, fill matrices. In a matrix, each row represents packets of a certain type over time, defined by packet size or source or some other factor.
In the vehicle traffic metaphor, you might have rows for blue vehicles, minivans, and cars from California. The values filling the cells in a given matrix indicate something that could be relevant to identifying an attack. In one matrix, each cell might contain the number of flags, and in another it might indicate bits per second. Karpowicz then deciphers which combinations of attributes are especially high in these metrics. Not just blue cars or convertibles, but blue convertibles.
Once he’s found those combinations of features, he translates them back to the networking world, generating rules for firewalls. In the vehicle traffic metaphor, if he now knows that blue cars of a certain make and model are causing the problem, he creates camera filters that spot those cars immediately. He calls his linear algebra–based method meta-factorization. A paper on meta-factorization is under review.
Karpowicz implements the calculations in MATLAB®. “Linear algebra algorithms are the fastest known algorithms in science,” he says, “making this method highly efficient.”
Detecting malicious packets is only part of the problem; defenders must also control them. Karpowicz says a solution came to him by accident. About a decade ago, he was working on a project to increase the energy efficiency of network devices. He wanted to predict traffic volume and redirect packet streams to routes that could handle them. While discussing the project with a colleague working on cybersecurity, the colleague said it might be used to fight DDoS attacks. “We were playing with the idea,” Karpowicz says, “and bit by bit, a new technology was being developed.”
In a paper published last year in the European Journal of Control, Karpowicz describes the result, a method called adaptive tuning. Typical network traffic control systems don’t receive feedback on their performance. They defend against attacks by stopping all the traffic somehow related to the attack. That is often too much, as it may also stop the legitimate traffic. Karpowicz suggests using traffic controllers that incorporate feedback so they know if they’re redirecting the right packets.
“Cybersecurity provided me with great scientific challenges, and in particular mathematical challenges,” he says. The theory and practicality fed off each other. “That’s what kept me going. I had this hunch that there’s something to it and there’s a problem that suggests solutions.” Together, the detection and control systems compose the basis of a service offered by NASK called FLDX, for which they now hold a patent. It can detect attacks within 5 seconds and start to mitigate them in 10 seconds.
“It is machine learning at its best. We don’t need to begin by collecting huge data sets in order to tune the system before it becomes operational. It is ready to work the moment it is installed.”Michael Karpowicz, director of research at NASK
NASK deploys FLDX across Poland, protecting nationwide networks using a distributed cluster of virtual machines. What makes the entire solution unique is that it’s not only a cybersecurity system, but it’s also a research platform,” Karpowicz says. “It allows you to program your own algorithms for detecting and suppressing attacks in MATLAB. It allows you to access data that we collect in MATLAB, and you can use all the benefits of this technology to use data processing, signal processing, machine learning artificial intelligence (AI), based on traffic samples that we provide.”
Both the detection and adaptive tuning algorithms use machine learning, specifically a kind called semi-supervised learning, which doesn’t require much hand-labeled data for training. “We encode a little bit of expert knowledge in the learning algorithms and then let them do their job,” Karpowicz says. “And their job is to discover what is going on in the world.”
They find statistical regularities in large amounts of data, like spotting clusters of a certain type of car. “It is machine learning at its best,” he says. “We don’t need to begin by collecting huge data sets in order to tune the system before it becomes operational. It is ready to work the moment it is installed.”
The detector uses a “voting” system to identify anomalous data flows. It also optimizes matrices by solving sets of equations. Even if network behavior is nonlinear—where some factors have disproportional effects—linear equations can capture the bulk of the patterns.
Researchers and engineers can use MATLAB to access NASK’s data set, and because many technical universities already use it, “that’s something that is very convenient.” NASK’s data set is unique because it consists of traffic data from all across the country and has a high rate of data sampling.
NASK’s research looks at traffic dynamics, target vulnerabilities, and attack sources, frequency, and methods. And they use FLDX to protect public and commercial clients, including schools and vaccination enrollment services. Earning the trust of his cybersecurity colleagues has been “a great surprise,” Karpowicz says. “It takes time for scientists to become respected in the cybersecurity engineering department. You have to show that you understand what they are doing and that you do something useful.” But with FLDX, “you get access to this playground where you can actually make things happen. And that’s something very special about NASK. We have a very short path from the laboratory to the technology and deployment.”
Karpowicz is now collaborating with other institutions, including MIT and the University of Technology Sydney. “I want NASK to be known worldwide for cybersecurity and AI,” he says. It’s on its way.