Preprocess
Reads Data/raw/*.csv, normalizes column names, enforces required schema, converts numeric fields, and removes invalid rows.
Data/raw -> Data/processed/cleaned_*.csv
Hybrid Network Anomaly Detection System
End-to-end network behavior monitoring from flow CSV ingestion to SOC-ready anomaly alerts. This project uses a static dataset and batch pipeline execution for reproducible analysis.
The pipeline is sequential and each stage writes artifacts for the next stage.
Reads Data/raw/*.csv, normalizes column names, enforces required schema, converts numeric fields, and removes invalid rows.
Data/raw -> Data/processed/cleaned_*.csv
Splits each cleaned file into fixed windows of 10,000 flows. For each window it computes volume, diversity, and rate features.
Data/processed -> Data/windowed/windowed_*.csv
Uses Monday windowed traffic as reference behavior and computes mean/std per feature for later deviation scoring.
Data/windowed -> Models/baseline_model.json
Computes z-score profile per window, anomaly index, severity, likely pattern, confidence, and SOC action hint.
Models/alerts_output.csv + terminal alerts
Windowing transforms row-level flows into behavior snapshots. Each snapshot is one analytic unit.
Each window produces features such as:
total_bytes, total_packetsunique_source_ips, unique_destination_ipsunique_destination_portsavg_flow_bytes_per_sec, avg_flow_packets_per_secavg_packets_per_flowZ-score compares how far a window metric moves from baseline behavior.
Anomaly Index summarizes all metric deviations for that window:
| Anomaly Index | Severity |
|---|---|
| < 3 | LOW |
| 3 to < 6 | MEDIUM |
| 6 to < 10 | HIGH |
| ≥ 10 | CRITICAL |
This implementation is currently designed for a static, offline dataset. It processes stored CSV files, not live packets or streaming telemetry. Results are deterministic for the same input files.
For production SOC use, the next step would be converting this pipeline to incremental or streaming processing.
Data/raw/ (eight files: Monday-Friday variants).Data/processed, Data/windowed, and Models empty; the pipeline fills them.Data/processed/cleaned_*.csv — schema-validated, numeric-cast flows.Data/windowed/windowed_*.csv — 10k-flow windows with volume/diversity/rate features.Models/baseline_model.json — mean/std per feature from Monday.Models/alerts_output.csv — SOC alerts for non-Monday traffic (includes IsolationForest fields).Models/isolation_forest_model.pkl — IsolationForest model trained on Monday windows.LogSentinel Behavioral Anomaly Alert
File : Friday-...PortScan...
Window ID : 19
Severity : HIGH
Likely Pattern: Possible PortScan
Confidence : High
Anomaly Index : 8.74
Top 3 Indicators:
1) unique_destination_ips z=-6.66
2) unique_destination_ports z=+3.21
3) avg_flow_packets_per_sec z=-2.94
Anomaly Index = sum(|z|) across metrics; severity buckets: <3 LOW, 3-<6 MEDIUM, 6-<10 HIGH, ≥10 CRITICAL.
unique_destination_ips, high unique_destination_ports.unique_source_ips with elevated packet rate.IsolationForest runs alongside Z-score: model anomalies are shown as "IsolationForest: Anomaly" with score; final decision is CONFIRMED / SUSPICIOUS / NORMAL based on agreement.
--step preprocess then --step window.--step window, then --step baseline, then --step monitor.--step monitor only.--step baseline then --step monitor.python main.py.--step monitor after threshold edits.| Pattern | Primary signals | Action hint |
|---|---|---|
| PortScan | Low unique_destination_ips, high unique_destination_ports |
Investigate top source IPs for horizontal scans; block repeat probes. |
| DDoS | Low destination diversity plus high packet volume/rate deviation | Validate volumetric flood; rate-limit and apply upstream filtering. |
| Infiltration | Destination diversity drops while packet rate spikes | Correlate with endpoint telemetry and unusual outbound sessions. |
| Web Attack | High unique_source_ips with elevated packet rate |
Review web server logs for bursty patterns and exploit signatures. |
| Unknown | Does not match heuristics; check top deviating metrics | Pivot on indicators and correlate with firewall/DNS/endpoint logs. |
| Column | Type | Notes |
|---|---|---|
| Source IP | string | Exact name required |
| Destination IP | string | Exact name required |
| Source Port | numeric | Coerced to number |
| Destination Port | numeric | Coerced to number |
| Protocol | numeric | Coerced to number |
| Flow Duration | numeric | Coerced to number |
| Total Fwd Packets | numeric | Coerced to number |
| Total Backward Packets | numeric | Coerced to number |
| Total Length of Fwd Packets | numeric | Coerced to number |
| Total Length of Bwd Packets | numeric | Coerced to number |
| Flow Bytes/s | numeric | Coerced to number |
| Flow Packets/s | numeric | Coerced to number |
| Label | string | Used to compute attack ratio |
Column names are case-sensitive. Files missing any required column are skipped during preprocessing.
WINDOW_SIZE (default 10,000 flows)ANOMALY_INDEX_THRESHOLDS (LOW/MEDIUM/HIGH/CRITICAL cutoffs)RAW_DIR, PROCESSED_DIR, WINDOWED_DIR, MODELS_DIRMONDAY_WINDOWED_FILE (baseline source)logsentinel/config.py, then rerun the affected stage (windowing, baseline, or monitor). Keep window size aligned with your data volume and memory budget.
.
|-- main.py
|-- logsentinel/
| |-- config.py
| |-- preprocessing.py
| |-- windowing.py
| |-- baseline.py
| |-- monitor.py
| `-- pipeline.py
|-- Data/
| |-- raw/ # static input CSV files
| |-- processed/ # generated cleaned outputs
| `-- windowed/ # generated window metrics
`-- Models/ # generated baseline + alert outputs
| Command | When to run | Why |
|---|---|---|
pip install -r requirements.txt |
First project setup or on a new machine. | Installs required Python libraries (`pandas`, `numpy`). |
python main.py |
Normal end-to-end execution. | Runs all stages in order: preprocess, window, baseline, monitor. |
python main.py --step preprocess |
After changing/adding raw CSV files. | Regenerates cleaned datasets in Data/processed. |
python main.py --step window |
After preprocessing or window-size/config updates. | Builds window-level behavioral features in Data/windowed. |
python main.py --step baseline |
When baseline reference behavior should be refreshed. | Recomputes mean/std model from Monday windowed data. |
python main.py --step monitor |
After baseline is available and windowed files exist. | Scores deviations, generates alerts, and writes SOC output. |
Data/windowed/windowed_cleaned_Monday-WorkingHours.pcap_ISCX.csv exists, then rerun --step baseline.For questions, open an issue with the stage, command, and a short excerpt of the error.