LLMs as Control Agents for Fault Handling

GitHub

This line of research explores how Large Language Models (LLMs) can be repurposed as intelligent control agents — not only for managing setpoints and dynamics but also for recovering from faults in real-time. Unlike traditional controllers, which rely on fixed logic and diagnostic routines, the LLM-based control agent operates through reasoning, adaptation, and feedback loops to address faults proactively.

Motivation

Industrial control systems rely heavily on human operators to diagnose and intervene during unexpected faults. This research asks: Can LLMs serve as autonomous operators that both understand the current plant state and decide the next best control action in faulty conditions?

Key Concepts

Control-as-Recovery: Reframing fault handling as a control problem, where the agent must steer the system from a faulty state to a safe operational state using available control actions.
LLM-Driven Decision-Making: The LLM reasons over plant state, fault types, goals, and constraints to generate recovery actions in real time.
Validation and Feedback Loop: A digital twin simulates proposed actions, and a validation agent checks for safety before execution. Reprompting occurs if safety is violated.

Case Study

Domain: Fault handling in a simulation environments.
Fault Scenarios: Includes blocked valves, actuator saturation, and leakage.
Recovery Strategy: LLM suggests alternate setpoints, actuator paths, or relaxation of constraints to ensure continued safe operation.

System Architecture

The control agent works within a modular LLM agent framework:

State Interpreter: Converts system measurements into a structured prompt.
Planner Agent (LLM): Chooses control actions to move system toward goal.
Digital Twin: Verifies feasibility of proposed action.
Validator: Ensures the action is safe and within bounds.
Recovery Agent: Iteratively improves plan if unsafe.

Impact

This research shows that LLMs can serve as autonomous control agents that integrate fault diagnosis, recovery planning, and safe control execution. The goal is to reduce the reliance on human operators and improve responsiveness during fault scenarios.

Ongoing Work

Extending the system to handle complex multi-stage faults.
Integrating retrieval-augmented memory of past fault cases.
Evaluating robustness and generalizability across domains.

More technical details and evaluation can be found in the associated paper on arXiv.