LogSentinel: Hybrid Network Anomaly Detection System

Pipeline Flow (Improved)

The pipeline is sequential and each stage writes artifacts for the next stage.

Preprocess

Reads Data/raw/*.csv, normalizes column names, enforces required schema, converts numeric fields, and removes invalid rows.

Data/raw -> Data/processed/cleaned_*.csv

Windowing

Splits each cleaned file into fixed windows of 10,000 flows. For each window it computes volume, diversity, and rate features.

Data/processed -> Data/windowed/windowed_*.csv

Baseline Modeling

Uses Monday windowed traffic as reference behavior and computes mean/std per feature for later deviation scoring.

Data/windowed -> Models/baseline_model.json

Monitoring + Inference

Computes z-score profile per window, anomaly index, severity, likely pattern, confidence, and SOC action hint.

Models/alerts_output.csv + terminal alerts

How Windowing Works

Windowing transforms row-level flows into behavior snapshots. Each snapshot is one analytic unit.

num_windows = total_rows // 10000

Each window produces features such as:

total_bytes, total_packets
unique_source_ips, unique_destination_ips
unique_destination_ports
avg_flow_bytes_per_sec, avg_flow_packets_per_sec
avg_packets_per_flow

Z-Score and Anomaly Index

Z-score compares how far a window metric moves from baseline behavior.

z = (value - mean) / std

Anomaly Index summarizes all metric deviations for that window:

anomaly_index = sum(abs(z_i))

Anomaly Index	Severity
< 3	LOW
3 to < 6	MEDIUM
6 to < 10	HIGH
≥ 10	CRITICAL

Static Dataset Assumption (Important)

This implementation is currently designed for a static, offline dataset. It processes stored CSV files, not live packets or streaming telemetry. Results are deterministic for the same input files.

Batch execution No live stream ingestion Reproducible runs Good for experiments and demos

For production SOC use, the next step would be converting this pipeline to incremental or streaming processing.

Download + Place Data

Download the "MachineLearningCSV" archive from the official CIC-IDS2017 page.
Extract all CSVs into Data/raw/ (eight files: Monday-Friday variants).
Leave Data/processed, Data/windowed, and Models empty; the pipeline fills them.

Pipeline Outputs

Data/processed/cleaned_*.csv — schema-validated, numeric-cast flows.
Data/windowed/windowed_*.csv — 10k-flow windows with volume/diversity/rate features.
Models/baseline_model.json — mean/std per feature from Monday.
Models/alerts_output.csv — SOC alerts for non-Monday traffic (includes IsolationForest fields).
Models/isolation_forest_model.pkl — IsolationForest model trained on Monday windows.

LogSentinel Behavioral Anomaly Alert
File          : Friday-...PortScan...
Window ID     : 19
Severity      : HIGH
Likely Pattern: Possible PortScan
Confidence    : High
Anomaly Index : 8.74
Top 3 Indicators:
  1) unique_destination_ips   z=-6.66
  2) unique_destination_ports z=+3.21
  3) avg_flow_packets_per_sec z=-2.94

Detection Logic

Alert trigger

≥ 2 metrics with |z| ≥ 3, or
Any single metric with |z| ≥ 5.

Anomaly Index = sum(|z|) across metrics; severity buckets: <3 LOW, 3-<6 MEDIUM, 6-<10 HIGH, ≥10 CRITICAL.

Pattern heuristic

PortScan: very low unique_destination_ips, high unique_destination_ports.
DDoS: very low destination IP diversity plus high packet volume/rate deviation.
Infiltration: destination diversity drops while packet rate spikes.
Web Attack: high unique_source_ips with elevated packet rate.
Unknown: review top deviating metrics.

IsolationForest runs alongside Z-score: model anomalies are shown as "IsolationForest: Anomaly" with score; final decision is CONFIRMED / SUSPICIOUS / NORMAL based on agreement.

Why Z-Score + IsolationForest

Z-score: fast, transparent thresholds; great for explainability and tuning.
IsolationForest: tree-based, captures nonlinear/interaction patterns.
Fusion: both agree → CONFIRMED ANOMALY; only one fires → SUSPICIOUS; neither → NORMAL. Reduces false positives while still surfacing weak signals.

Runbook (when to run what)

New raw CSVs added → run --step preprocess then --step window.
Changed window size → rerun --step window, then --step baseline, then --step monitor.
Changed thresholds → rerun --step monitor only.
Need fresh baseline (new Monday file) → rerun --step baseline then --step monitor.
Full rebuild → run python main.py.

Tuning Cheatsheet

Window size

Lower (e.g., 5,000) = finer time granularity, more windows, more compute.
Higher (e.g., 20,000) = smoother signals, fewer windows, less compute.
Keep consistent with your memory budget; recompute baseline after changes.

Severity thresholds

Current: LOW <3, MEDIUM 3-<6, HIGH 6-<10, CRITICAL ≥10.
More sensitive: lower all cutoffs (e.g., 2 / 4 / 8).
Less noisy: raise cutoffs (e.g., 4 / 8 / 12).
Only rerun --step monitor after threshold edits.

Pattern Signals (why you see a label)

Pattern	Primary signals	Action hint
PortScan	Low `unique_destination_ips`, high `unique_destination_ports`	Investigate top source IPs for horizontal scans; block repeat probes.
DDoS	Low destination diversity plus high packet volume/rate deviation	Validate volumetric flood; rate-limit and apply upstream filtering.
Infiltration	Destination diversity drops while packet rate spikes	Correlate with endpoint telemetry and unusual outbound sessions.
Web Attack	High `unique_source_ips` with elevated packet rate	Review web server logs for bursty patterns and exploit signatures.
Unknown	Does not match heuristics; check top deviating metrics	Pivot on indicators and correlate with firewall/DNS/endpoint logs.

Input Schema (required columns)

Column	Type	Notes
Source IP	string	Exact name required
Destination IP	string	Exact name required
Source Port	numeric	Coerced to number
Destination Port	numeric	Coerced to number
Protocol	numeric	Coerced to number
Flow Duration	numeric	Coerced to number
Total Fwd Packets	numeric	Coerced to number
Total Backward Packets	numeric	Coerced to number
Total Length of Fwd Packets	numeric	Coerced to number
Total Length of Bwd Packets	numeric	Coerced to number
Flow Bytes/s	numeric	Coerced to number
Flow Packets/s	numeric	Coerced to number
Label	string	Used to compute attack ratio

Column names are case-sensitive. Files missing any required column are skipped during preprocessing.

Configuration Cheatsheet

WINDOW_SIZE (default 10,000 flows)
ANOMALY_INDEX_THRESHOLDS (LOW/MEDIUM/HIGH/CRITICAL cutoffs)
RAW_DIR, PROCESSED_DIR, WINDOWED_DIR, MODELS_DIR
MONDAY_WINDOWED_FILE (baseline source)

Edit these in logsentinel/config.py, then rerun the affected stage (windowing, baseline, or monitor). Keep window size aligned with your data volume and memory budget.

Project Structure

.
|-- main.py
|-- logsentinel/
|   |-- config.py
|   |-- preprocessing.py
|   |-- windowing.py
|   |-- baseline.py
|   |-- monitor.py
|   `-- pipeline.py
|-- Data/
|   |-- raw/        # static input CSV files
|   |-- processed/  # generated cleaned outputs
|   `-- windowed/   # generated window metrics
`-- Models/         # generated baseline + alert outputs

Run Commands

When and Why to Run Each Command

Command	When to run	Why
`pip install -r requirements.txt`	First project setup or on a new machine.	Installs required Python libraries (`pandas`, `numpy`).
`python main.py`	Normal end-to-end execution.	Runs all stages in order: preprocess, window, baseline, monitor.
`python main.py --step preprocess`	After changing/adding raw CSV files.	Regenerates cleaned datasets in `Data/processed`.
`python main.py --step window`	After preprocessing or window-size/config updates.	Builds window-level behavioral features in `Data/windowed`.
`python main.py --step baseline`	When baseline reference behavior should be refreshed.	Recomputes mean/std model from Monday windowed data.
`python main.py --step monitor`	After baseline is available and windowed files exist.	Scores deviations, generates alerts, and writes SOC output.

Performance Notes

CIC-IDS2017 CSVs are large (hundreds of MB); prefer machines with >8 GB RAM.
Windowing is O(rows); 10k-flow windows balance fidelity and compute. Increase cautiously.
Baseline and monitor reuse intermediates—rerun only the needed stages to save time.

Limitations & Future Ideas

Current Limits

Baseline Drift: Static Monday baseline can age as behavior changes.
Slow Attack Evasion: Low-and-slow attacks may stay below statistical thresholds.
Baseline Poisoning Risk: Gradual manipulation could shift the baseline and dull sensitivity.
Limited Model Complexity: IsolationForest helps, but complex patterns may still be missed.
Batch Processing Only: Works on CSV files, not real-time streams.
Enterprise Gap: SIEMs rely on rules/signatures; behavioral layers are often limited/abstracted.

Future Ideas

Adaptive Baseline: Continuously update to follow behavior drift.
Real-Time Processing: Add streaming/ingest path beyond batch CSV.
Advanced ML Models: Explore Autoencoders or LSTMs for richer patterns.
Drift Detection: Detect stale baselines and trigger safe retraining.
Explainability: Add feature importance to show why alerts fire.
SIEM Integration: Integrate with Splunk/OpenSearch for production workflows.
Improved Classification: Replace heuristic attack labels with ML-based classifiers.

Troubleshooting

Missing required columns: file is skipped; verify column names exactly.
Baseline missing: ensure Data/windowed/windowed_cleaned_Monday-WorkingHours.pcap_ISCX.csv exists, then rerun --step baseline.
Encoding issues: loader automatically tries utf-8, cp1252, then latin1.
No alerts produced: confirm non-Monday windowed files exist and baseline is present.

Support and Links

For questions, open an issue with the stage, command, and a short excerpt of the error.