Week 2 — Data, Features, and Classical Machine Learning for Security Analytics
Overview
This week focuses on the practical core of many applied AI systems in cybersecurity:
- preparing cyber data;
- selecting and engineering features;
- applying classical machine-learning methods;
- evaluating results in ways that make operational sense.
Students often rush to deep learning or GenAI because these appear modern and powerful. In practice, many real security tasks are still solved effectively with careful preprocessing, feature engineering, and well-understood classical models.
The central message of this week is:
A simple model on well-prepared data is often more valuable than a sophisticated model on badly understood data.
Learning Outcomes
By the end of this week, students should be able to:
- explain the role of preprocessing and feature engineering in cyber datasets;
- distinguish between supervised learning, unsupervised learning, and anomaly detection in security contexts;
- apply basic classical machine-learning models to a cybersecurity dataset;
- evaluate model results using metrics suitable for security problems;
- identify common pitfalls such as class imbalance, leakage, overfitting, and unrealistic datasets.
1. Why data preparation matters
Cybersecurity data is rarely ready for modelling in raw form. Real datasets are often:
- incomplete;
- imbalanced;
- noisy;
- duplicated;
- inconsistently labelled;
- temporally messy;
- context-dependent.
A model trained on poorly prepared data may learn irrelevant patterns, produce unstable results, or appear strong in testing but fail in deployment.
Example
Suppose a malicious URL dataset contains a feature such as “source list name” where one feed mainly contains malicious entries and another contains benign entries. A model may learn the feed source instead of learning meaningful URL behaviour. This is not intelligence. It is accidental shortcut learning.
2. Typical steps in cyber data preprocessing
2.1 Cleaning
This may include:
- removing duplicates;
- fixing malformed entries;
- handling missing values;
- normalising inconsistent categorical values;
- converting timestamps into usable formats.
2.2 Selection
Some columns may be irrelevant, redundant, or dangerous because they leak label information.
For example, if a field is created only after analyst confirmation, it may not be available at prediction time. Keeping it in training would produce unrealistic performance.
2.3 Transformation
Examples include:
- one-hot encoding of categorical variables;
- scaling or normalisation of numerical features;
- tokenisation of text;
- aggregation of event sequences;
- temporal summarisation over fixed windows.
2.4 Splitting the data
Students should separate data into training, validation, and test sets carefully.
In cyber contexts, random splits are not always sufficient. A time-based split may be more realistic because it better reflects deployment conditions.
3. Feature engineering in cybersecurity
A feature is a measurable property used by the model. Feature engineering is the process of designing or selecting features that help distinguish meaningful patterns.
3.1 Why features matter
The model only sees the world through the features it is given. If the features are weak, noisy, or misleading, the model will also be weak, noisy, or misleading.
3.2 Examples of cybersecurity features
Network-based features
- packet count;
- byte count;
- flow duration;
- source/destination port;
- protocol;
- number of failed connections;
- ratio of incoming to outgoing traffic.
Authentication features
- login time;
- geolocation;
- device novelty;
- failed login count;
- impossible travel indicators;
- privilege level.
Email and URL features
- domain age;
- character distribution;
- number of subdomains;
- URL length;
- attachment type;
- mismatch between displayed and actual link.
Host-based features
- process tree depth;
- unusual parent-child process pairs;
- registry modifications;
- frequency of script execution;
- file entropy indicators.
3.3 Good feature engineering principles
Good features are usually:
- available at the time of decision;
- interpretable enough to reason about;
- relevant to the threat model;
- not direct leaks of the answer;
- stable enough to be useful beyond one narrow dataset.
4. Supervised learning in security analytics
Supervised learning uses labelled examples.
Typical security tasks
- phishing vs benign email;
- malicious vs benign URL;
- attack vs normal flow;
- malware family classification;
- suspicious vs routine alert.
Common classical models
- logistic regression;
- decision trees;
- random forests;
- support vector machines;
- k-nearest neighbours;
- gradient boosting methods.
When supervised learning works well
It works best when:
- labels are reasonably reliable;
- the threat pattern is learnable from the data;
- the training set is representative enough;
- the organisation understands the cost of different errors.
Weaknesses
Supervised learning may struggle when:
- labels are scarce or inconsistent;
- attacker behaviour shifts;
- rare classes are underrepresented;
- models learn artefacts rather than security-relevant structure.
5. Unsupervised learning and anomaly detection
Many cyber problems do not have good labels. In these cases, unsupervised or semi-supervised approaches may be attractive.
5.1 Unsupervised learning
Unsupervised methods look for structure without labelled targets.
Examples:
- clustering similar alerts;
- grouping related behaviours;
- identifying outlier activity patterns.
5.2 Anomaly detection
Anomaly detection asks whether a data point is unusual relative to a baseline.
Examples:
- unusual login time for a user;
- abnormal traffic volume from a device;
- rare process behaviour on an endpoint.
Important caution
Anomalous does not necessarily mean malicious.
A system upgrade, a holiday period, or a new service rollout may look anomalous. This is why anomaly detection can produce many false positives in security environments.
6. Class imbalance and why accuracy is not enough
Many cyber datasets are heavily imbalanced. Malicious cases may be far rarer than benign ones.
Example
Suppose 99% of events are benign and 1% are malicious.
A model that labels everything as benign gets 99% accuracy, but it is useless.
Better metrics
Precision
Of the alerts flagged as malicious, how many are actually malicious?
High precision reduces analyst fatigue.
Recall
Of the actual malicious cases, how many did the model catch?
High recall reduces missed attacks.
F1-score
A balance between precision and recall.
ROC-AUC
Useful for ranking-based comparisons, but it can be misleading in highly imbalanced settings.
PR-AUC
Often more informative than ROC-AUC when the positive class is rare.
Confusion matrix
Shows true positives, true negatives, false positives, and false negatives directly.
Security interpretation
A model is not good because it has a high number. It is good if its pattern of errors is acceptable for the operational setting.
7. Data leakage
Data leakage occurs when information from outside the true decision context enters training or testing.
Common forms of leakage
- using future information in training features;
- including analyst-confirmed fields unavailable at prediction time;
- random splits that let near-duplicate events appear in both train and test;
- deriving features from the label itself.
Leakage is one of the main reasons published or classroom results may appear unrealistically strong.
Rule of thumb
Always ask:
Would this information actually exist at the moment the system must make the decision?
If the answer is no, it should not be part of the predictive feature set.
8. Overfitting and underfitting
Underfitting
The model is too simple or the features are too weak to capture useful structure.
Overfitting
The model learns peculiarities of the training data instead of generalisable patterns.
In cybersecurity, overfitting is especially dangerous because attackers and environments change. A model that memorises one dataset may collapse in a live setting.
Practical warning signs
- strong training performance but weak test performance;
- unstable behaviour across splits;
- dependence on a few suspicious features;
- performance collapse on newer data.
9. Case study: intrusion detection with classical ML
Consider a simplified network intrusion dataset.
Objective
Classify network flows as benign or attack-related.
Candidate features
- protocol type;
- connection duration;
- source bytes;
- destination bytes;
- failed connection count;
- flag indicators.
Possible models
- logistic regression;
- decision tree;
- random forest.
Questions for evaluation
- Which model gives the highest recall?
- Which model gives the lowest false-positive rate?
- Which model is easiest to explain?
- Would the results hold on future traffic?
- Are any features likely to reflect artefacts of the dataset rather than the attack itself?
This case illustrates the difference between classroom experimentation and operational reasoning.
10. Practical workflow for students
A sensible machine-learning workflow for a cyber task is:
- define the decision problem clearly;
- inspect the dataset;
- clean and transform the data;
- split data carefully;
- establish a baseline;
- train one or two classical models;
- evaluate with suitable metrics;
- interpret errors;
- reflect on realism and limitations.
This is a better workflow than immediately searching for the most advanced model.
11. Lab guidance
Suggested lab theme
Comparing classical models on a cybersecurity dataset
Suggested tasks
- load a dataset for phishing, malicious URLs, or intrusion detection;
- examine class balance and feature types;
- preprocess the data;
- train at least two classical models;
- compare precision, recall, F1-score, and confusion matrices;
- identify at least one limitation of the dataset and at least one likely deployment risk.
Suggested extension
Ask students to compare:
- raw accuracy;
- operational usefulness;
- interpretability.
This helps shift thinking from benchmark culture to realistic evaluation.
12. Discussion questions
- When is a simpler model preferable to a more complex one in cybersecurity?
- Why can anomaly detection become noisy in real environments?
- Which is worse in a phishing filter: false positives or false negatives?
- Why is data leakage so easy to introduce in cyber datasets?
- How should model evaluation change if the environment evolves rapidly?
13. Key terms
- Preprocessing
- Feature Engineering
- Supervised Learning
- Unsupervised Learning
- Anomaly Detection
- Class Imbalance
- Precision
- Recall
- F1-score
- PR-AUC
- Data Leakage
- Overfitting
- Underfitting
14. Week summary
This week showed that much of applied AI for cybersecurity depends on disciplined data work and careful evaluation.
Students should now understand:
- why cyber data needs thoughtful preprocessing;
- how features shape what the model can learn;
- how classical machine learning can support real security tasks;
- why imbalance, leakage, and weak metrics can distort results;
- why operational interpretation matters as much as model performance.
The next week moves to deep learning and generative AI, where modern capabilities increase, but so do complexity, risk, and the need for critical judgement.