Hybrid Network Anomaly Detection System

LogSentinel

End-to-end network behavior monitoring from flow CSV ingestion to SOC-ready anomaly alerts. This project uses a static dataset and batch pipeline execution for reproducible analysis.

View Pipeline View on GitHub
Window Size10,000 flows
Baseline SourceMonday profile
Detection CoreZ-score + Anomaly Index

Pipeline Flow (Improved)

The pipeline is sequential and each stage writes artifacts for the next stage.

1

Preprocess

Reads Data/raw/*.csv, normalizes column names, enforces required schema, converts numeric fields, and removes invalid rows.

Data/raw -> Data/processed/cleaned_*.csv
2

Windowing

Splits each cleaned file into fixed windows of 10,000 flows. For each window it computes volume, diversity, and rate features.

Data/processed -> Data/windowed/windowed_*.csv
3

Baseline Modeling

Uses Monday windowed traffic as reference behavior and computes mean/std per feature for later deviation scoring.

Data/windowed -> Models/baseline_model.json
4

Monitoring + Inference

Computes z-score profile per window, anomaly index, severity, likely pattern, confidence, and SOC action hint.

Models/alerts_output.csv + terminal alerts

How Windowing Works

Windowing transforms row-level flows into behavior snapshots. Each snapshot is one analytic unit.

num_windows = total_rows // 10000

Each window produces features such as:

  • total_bytes, total_packets
  • unique_source_ips, unique_destination_ips
  • unique_destination_ports
  • avg_flow_bytes_per_sec, avg_flow_packets_per_sec
  • avg_packets_per_flow

Z-Score and Anomaly Index

Z-score compares how far a window metric moves from baseline behavior.

z = (value - mean) / std

Anomaly Index summarizes all metric deviations for that window:

anomaly_index = sum(abs(z_i))
Anomaly IndexSeverity
< 3LOW
3 to < 6MEDIUM
6 to < 10HIGH
≥ 10CRITICAL

Static Dataset Assumption (Important)

This implementation is currently designed for a static, offline dataset. It processes stored CSV files, not live packets or streaming telemetry. Results are deterministic for the same input files.

Batch execution No live stream ingestion Reproducible runs Good for experiments and demos

For production SOC use, the next step would be converting this pipeline to incremental or streaming processing.

Download + Place Data

  1. Download the "MachineLearningCSV" archive from the official CIC-IDS2017 page.
  2. Extract all CSVs into Data/raw/ (eight files: Monday-Friday variants).
  3. Leave Data/processed, Data/windowed, and Models empty; the pipeline fills them.

Pipeline Outputs

  • Data/processed/cleaned_*.csv — schema-validated, numeric-cast flows.
  • Data/windowed/windowed_*.csv — 10k-flow windows with volume/diversity/rate features.
  • Models/baseline_model.json — mean/std per feature from Monday.
  • Models/alerts_output.csv — SOC alerts for non-Monday traffic (includes IsolationForest fields).
  • Models/isolation_forest_model.pkl — IsolationForest model trained on Monday windows.
LogSentinel Behavioral Anomaly Alert
File          : Friday-...PortScan...
Window ID     : 19
Severity      : HIGH
Likely Pattern: Possible PortScan
Confidence    : High
Anomaly Index : 8.74
Top 3 Indicators:
  1) unique_destination_ips   z=-6.66
  2) unique_destination_ports z=+3.21
  3) avg_flow_packets_per_sec z=-2.94

Detection Logic

Alert trigger

  • ≥ 2 metrics with |z| ≥ 3, or
  • Any single metric with |z| ≥ 5.

Anomaly Index = sum(|z|) across metrics; severity buckets: <3 LOW, 3-<6 MEDIUM, 6-<10 HIGH, ≥10 CRITICAL.

Pattern heuristic

  • PortScan: very low unique_destination_ips, high unique_destination_ports.
  • DDoS: very low destination IP diversity plus high packet volume/rate deviation.
  • Infiltration: destination diversity drops while packet rate spikes.
  • Web Attack: high unique_source_ips with elevated packet rate.
  • Unknown: review top deviating metrics.

IsolationForest runs alongside Z-score: model anomalies are shown as "IsolationForest: Anomaly" with score; final decision is CONFIRMED / SUSPICIOUS / NORMAL based on agreement.

Why Z-Score + IsolationForest

Runbook (when to run what)

Tuning Cheatsheet

Window size

  • Lower (e.g., 5,000) = finer time granularity, more windows, more compute.
  • Higher (e.g., 20,000) = smoother signals, fewer windows, less compute.
  • Keep consistent with your memory budget; recompute baseline after changes.

Severity thresholds

  • Current: LOW <3, MEDIUM 3-<6, HIGH 6-<10, CRITICAL ≥10.
  • More sensitive: lower all cutoffs (e.g., 2 / 4 / 8).
  • Less noisy: raise cutoffs (e.g., 4 / 8 / 12).
  • Only rerun --step monitor after threshold edits.

Pattern Signals (why you see a label)

PatternPrimary signalsAction hint
PortScan Low unique_destination_ips, high unique_destination_ports Investigate top source IPs for horizontal scans; block repeat probes.
DDoS Low destination diversity plus high packet volume/rate deviation Validate volumetric flood; rate-limit and apply upstream filtering.
Infiltration Destination diversity drops while packet rate spikes Correlate with endpoint telemetry and unusual outbound sessions.
Web Attack High unique_source_ips with elevated packet rate Review web server logs for bursty patterns and exploit signatures.
Unknown Does not match heuristics; check top deviating metrics Pivot on indicators and correlate with firewall/DNS/endpoint logs.

Input Schema (required columns)

ColumnTypeNotes
Source IPstringExact name required
Destination IPstringExact name required
Source PortnumericCoerced to number
Destination PortnumericCoerced to number
ProtocolnumericCoerced to number
Flow DurationnumericCoerced to number
Total Fwd PacketsnumericCoerced to number
Total Backward PacketsnumericCoerced to number
Total Length of Fwd PacketsnumericCoerced to number
Total Length of Bwd PacketsnumericCoerced to number
Flow Bytes/snumericCoerced to number
Flow Packets/snumericCoerced to number
LabelstringUsed to compute attack ratio

Column names are case-sensitive. Files missing any required column are skipped during preprocessing.

Configuration Cheatsheet

  • WINDOW_SIZE (default 10,000 flows)
  • ANOMALY_INDEX_THRESHOLDS (LOW/MEDIUM/HIGH/CRITICAL cutoffs)
  • RAW_DIR, PROCESSED_DIR, WINDOWED_DIR, MODELS_DIR
  • MONDAY_WINDOWED_FILE (baseline source)
Edit these in logsentinel/config.py, then rerun the affected stage (windowing, baseline, or monitor). Keep window size aligned with your data volume and memory budget.

Project Structure

.
|-- main.py
|-- logsentinel/
|   |-- config.py
|   |-- preprocessing.py
|   |-- windowing.py
|   |-- baseline.py
|   |-- monitor.py
|   `-- pipeline.py
|-- Data/
|   |-- raw/        # static input CSV files
|   |-- processed/  # generated cleaned outputs
|   `-- windowed/   # generated window metrics
`-- Models/         # generated baseline + alert outputs

Run Commands

When and Why to Run Each Command

CommandWhen to runWhy
pip install -r requirements.txt First project setup or on a new machine. Installs required Python libraries (`pandas`, `numpy`).
python main.py Normal end-to-end execution. Runs all stages in order: preprocess, window, baseline, monitor.
python main.py --step preprocess After changing/adding raw CSV files. Regenerates cleaned datasets in Data/processed.
python main.py --step window After preprocessing or window-size/config updates. Builds window-level behavioral features in Data/windowed.
python main.py --step baseline When baseline reference behavior should be refreshed. Recomputes mean/std model from Monday windowed data.
python main.py --step monitor After baseline is available and windowed files exist. Scores deviations, generates alerts, and writes SOC output.

Performance Notes

Limitations & Future Ideas

Current Limits

  • Baseline Drift: Static Monday baseline can age as behavior changes.
  • Slow Attack Evasion: Low-and-slow attacks may stay below statistical thresholds.
  • Baseline Poisoning Risk: Gradual manipulation could shift the baseline and dull sensitivity.
  • Limited Model Complexity: IsolationForest helps, but complex patterns may still be missed.
  • Batch Processing Only: Works on CSV files, not real-time streams.
  • Enterprise Gap: SIEMs rely on rules/signatures; behavioral layers are often limited/abstracted.

Future Ideas

  • Adaptive Baseline: Continuously update to follow behavior drift.
  • Real-Time Processing: Add streaming/ingest path beyond batch CSV.
  • Advanced ML Models: Explore Autoencoders or LSTMs for richer patterns.
  • Drift Detection: Detect stale baselines and trigger safe retraining.
  • Explainability: Add feature importance to show why alerts fire.
  • SIEM Integration: Integrate with Splunk/OpenSearch for production workflows.
  • Improved Classification: Replace heuristic attack labels with ML-based classifiers.

Troubleshooting

Support and Links

For questions, open an issue with the stage, command, and a short excerpt of the error.