TeMIA-NT Architecture
Three main modules compose the TeMIA-NT: ThrEat Monitoring and Intelligent data Analytics of Network Traffic: capture, distributed processing, and visualization modules.
The capture module makes the data acquisition. We sniff the network packet or mirror the packets from the network switch, an use the libpcap library. Then, the capture module converts the network data packets into network flows to be further analyzed. A network flow is an abstraction that characterizes network connections. Usually, a flow is a sequence of packets with an identical set of the head packet fields, for instance a sequence of packets with the same source IP address and destination IP address. A well know definition of network flow uses the the quintuple field of the TCP/IP packet header: source IP address, destination IP address, source port, destination port, and protocol. We use a python application based on flowtbag to abstract packets into flows. To characterize a flow flowtbag considers a time window and uses 46 features. The extracted 46 feature flow characterization is then published in a producer/consumer service provided by Apache Kafka. We can also use Zeek (formely BRO) to capture network events.
The distributed processing module is the most important module of TeMIA-NT tool that classifies the flows as malicious or benign through machine learning (ML) algorithms. The distributed processing module core is the Apache Spark framework. This module provides the scalability, the flexibility, the intelligence, and rapidity of the data analytics through machine learning algorithms. Spark processes the data in a cluster following the master/slave model, where slaves have the capacity to expand and reduce resources, making the system scalable. TeMIA-NT tool provides the possibility to use different machine learning algorithms such as: decision tree, random tree, naive bayes, svm, and others. Once the flow arrives in the distributed processing module, we can use a feature selection algorithm to select the most important characteristics for threat classification. Another possibility is to use a dimension reduction such as the popular principal component analysis (PCA). All the algorithms are optimized by grid hyperparameter tuning. It's now under development the automatic feature extraction obtained by node2vec algorithm. In the processing step, the processed metadata is enriched through different information such as the geographical location of the analyzed IPs.
The last sequence module is the visualization module. TeMIA-NT tool uses Elastic Search and Kibana to store and visualize the results. Thus, the distributed processing module sends the computed resultas to the elasticsearch which provides a fast search and store service. Elasticsearch communicates through queries with the user interface that runs in the Kibana environment in which represents the results to be viewed by the user.