How to Build a Real-Time Driver Fatigue and Distraction Detection System with Python, Flask, and MediaPipe

[AI Readability Summary] This localized driver risk detection system, built with Python, Flask, and MediaPipe, identifies dangerous behaviors such as eye closure, yawning, and head deviation in real time. Through a web interface, it triggers tiered alerts while avoiding the inefficiency of manual inspection, high false-positive rates, and privacy risks associated with cloud-based processing. Keywords: fatigue detection, distracted driving, MediaPipe.

The technical specification snapshot outlines the stack clearly

Parameter Details
Primary Language Python 3.8+
Web Framework Flask 2.3.3
Vision Protocol / Input USB camera video stream, local HTTP access
Core Vision Libraries OpenCV 4.8.0.76, MediaPipe 0.10.14
Numerical Computing NumPy 1.24.3, SciPy 1.11.2
Data Storage SQLite
Frontend Stack HTML + JS + Chart.js
GitHub Stars Original data not provided
Runtime Mode Local deployment, browser access on port 5000

This system focuses on low-cost real-time perception of dangerous driver behaviors

The project targets intelligent transportation and driving safety scenarios. It uses a monocular camera and lightweight computer vision algorithms to perform local detection. Compared with cloud inference solutions, it is better suited to privacy-sensitive environments, unstable network conditions, or classroom demonstrations.

The system covers two common categories of risk at the same time: driver fatigue and distracted driving. Fatigue detection relies on the geometric relationships of eye and mouth landmarks, while distraction detection relies on head pose estimation. This design keeps computational cost low and makes real-time performance achievable on a standard PC.

The system screenshots show real-time overlays and behavior recognition results intuitively

Driver dangerous behavior detection screenshot 1 AI Visual Insight: This image shows the frontend monitoring interface capturing the driver’s facial region in real time. The system typically overlays status text, risk labels, or threshold evaluation results on video frames to verify that facial landmark extraction and alert logic are working in sync.

Driver dangerous behavior detection screenshot 2 AI Visual Insight: This image emphasizes the runtime human-computer interaction view. It typically includes the camera feed, a status panel, and control buttons, showing that the system does more than inference alone. It also provides a complete operational workflow for starting monitoring, reviewing history, and adjusting parameters.

Driver dangerous behavior detection screenshot 3 AI Visual Insight: This image demonstrates how algorithm output integrates with the application interface. Common elements include risk level prompts, pose direction judgments, and fatigue event records, indicating that the project has evolved from a simple model demo into a usable application prototype.

Fatigue detection depends on two core geometric metrics: EAR and MAR

EAR is used to detect eye closure. At its core, it measures the ratio between the vertical and horizontal distances of the eye. When the eye closes, the vertical distance decreases significantly, so EAR drops quickly. The system usually computes the average value across both eyes and combines it with a consecutive-frame counter to reduce transient false positives.

MAR is used to detect yawning. Its logic is similar to EAR, but the observed target changes to mouth opening. Because speaking can also cause the mouth to open, the system uses a longer consecutive-frame threshold and a lower alert level to balance false positives.

import numpy as np
from scipy.spatial import distance

def aspect_ratio(points):
    # Compute the distance between upper and lower landmarks
    vertical = distance.euclidean(points[1], points[5]) + distance.euclidean(points[2], points[4])
    # Compute the distance between left and right landmarks
    horizontal = 2.0 * distance.euclidean(points[0], points[3])
    # Return the aspect ratio for EAR or MAR
    return vertical / horizontal if horizontal else 0.0

This code abstracts a shared calculation path for EAR and MAR, making it easy to reuse for both eye and mouth detection.

Distraction detection uses PnP-based head pose estimation to solve 3D orientation angles

The system selects stable landmarks such as the nose tip, chin, eye corners, and mouth corners from the 468 landmarks provided by MediaPipe Face Mesh. It builds a 2D-3D correspondence with a standard 3D face model, then passes the mapping to cv2.solvePnP() to solve the rotation vector and translation vector.

The final rotation result is converted into three Euler angles: Yaw, Pitch, and Roll. Yaw reflects left-right head turning, Pitch reflects looking down or up, and Roll reflects head tilt. If any angle exceeds the threshold for a sustained period, the system determines that the driver’s attention has deviated from the road ahead.

import cv2
import numpy as np

# 3D standard face model points
model_points = np.array([
    (0.0, 0.0, 0.0),          # Nose tip
    (0.0, -330.0, -65.0),     # Chin
    (-225.0, 170.0, -135.0),  # Left eye corner
    (225.0, 170.0, -135.0),   # Right eye corner
    (-150.0, -150.0, -125.0), # Left mouth corner
    (150.0, -150.0, -125.0)   # Right mouth corner
], dtype=np.float64)

# image_points must be mapped from MediaPipe landmarks
success, rotation_vec, translation_vec = cv2.solvePnP(
    model_points, image_points, camera_matrix, dist_coeffs
)

This code performs head pose estimation and serves as the core entry point for distracted driving detection.

The project uses an engineered architecture that combines layering with strategy-based design

At the algorithm layer, BaseDetector provides a unified processing template, while the fatigue detector and distraction detector implement feature extraction and rule evaluation separately. This reflects both the Strategy pattern and the Template Method pattern. When adding new behaviors such as smoking or phone use, you only need to extend the detector instead of rewriting the main processing flow.

At the service layer, VideoService manages the video stream, and AlertService handles alert persistence and statistical analysis. The two are decoupled through event notifications or callback mechanisms. The data layer uses SQLite, which is suitable for single-machine deployment and lightweight audit scenarios.

class BaseDetector:
    def process(self, frame):
        # Preprocess the video frame
        data = self.extract_features(frame)
        # Execute the detector-specific logic
        result = self.detect(data)
        # Return a standardized detection result
        return self.postprocess(result)

This pseudocode illustrates the unified detector lifecycle and makes modular extension easier.

The installation and startup path is designed to be beginner-friendly

The environment requirements are not demanding. The minimum recommended setup is an Intel i3 CPU, 4 GB of memory, and a 720p camera. For more stable landmark detection, Python 3.10, 8 GB of memory, and a 1080p camera are recommended.

Dependencies are concentrated around Flask, OpenCV, MediaPipe, NumPy, and SciPy. In regions with slower package access, you can switch to a mirror source to improve installation speed. By default, the system runs on local port 5000 and is accessed visually through a browser.

# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

# Enter the source directory and start the service
cd src
python app.py

These commands complete environment isolation, dependency installation, and service startup, making the project easy to reproduce quickly.

Runtime stability depends heavily on lighting, camera angle, and threshold configuration

Although this type of geometric rule-based algorithm is lightweight, it is sensitive to input quality. The camera should face the driver directly, ideally at a distance of 50 to 100 centimeters, while avoiding strong backlighting, facial occlusion, and interference from sunglasses. A mask will directly reduce the usefulness of MAR.

When false positives occur too often, you should raise EAR_THRESHOLD, MAR_THRESHOLD, or the head pose thresholds appropriately. When missed detections become obvious, tune them back in the opposite direction. To address lag, lower the resolution before blindly increasing the detection interval.

The file structure reflects a typical Flask application layout

Under the project root, src/ contains the application entry point, configuration, detectors, services, models, static assets, and template pages. data/ stores the SQLite database, logs/ stores system logs, and docs/ is used to archive architecture, requirements, and review documents.

The value of this organization is that responsibility boundaries remain clear: algorithms, pages, configuration, and storage stay independent of one another. For coursework, team collaboration, or later secondary development, this structure is much easier to maintain than placing everything into a single file.

src/
├── app.py
├── config.py
├── detectors/
├── services/
├── models/
├── static/
└── templates/

This directory summary shows the main skeleton of the system and makes it easier to locate code responsibilities quickly.

This solution works well as a capstone prototype and a local safety monitoring foundation

From an engineering perspective, this is not a heavy-model solution. It is a runnable, explainable, and extensible computer vision application framework. Its strengths include a low deployment barrier, strong interpretability, and a complete frontend-backend loop, making it suitable for teaching, prototype validation, and lightweight real-world scenarios.

If you continue evolving the project, consider adding behavior classification models, multimodal temporal analysis, model quantization for deployment, and edge-device adaptation. That path would gradually upgrade the system from a rule-driven implementation to an industrial-grade driver monitoring system that combines rules with learning-based methods.

FAQ

1. Why does this project use MediaPipe instead of a heavy deep learning detector?

MediaPipe Face Mesh outputs dense facial landmarks directly, which makes it well suited to geometric analysis tasks such as EAR, MAR, and head pose estimation. It is lightweight to deploy, fast at inference, and does not require large-scale training, which makes it a strong fit for local real-time detection and teaching projects.

2. Can this system be used directly in a real in-vehicle environment?

It can serve as a prototype, but direct commercial use still requires additional hardening. Real in-vehicle environments involve vibration, nighttime lighting, occlusion, multi-person interference, and hardware constraints. You would need stronger robustness testing, infrared illumination, temporal modeling, and embedded adaptation.

3. How can I extend it to detect new dangerous behaviors such as smoking or phone use?

A practical approach is to add a new detector under detectors/, inherit from BaseDetector, and implement feature extraction and decision logic. If you choose a learning-based method, you can package YOLO or a classification model as an independent strategy and then connect it to the unified alert service.

Core Summary: This article reconstructs a driver dangerous behavior detection system built with Python, Flask, OpenCV, and MediaPipe. It covers EAR/MAR-based fatigue recognition, PnP-based head pose analysis for distraction detection, layered architecture, deployment workflow, performance tuning, and privacy-aware local design, making it suitable for capstone projects, prototype validation, and localized intelligent transportation applications.