As Large Language Models (LLMs) transition from passive conversational tools to autonomous agents integrated within critical infrastructure, AI safety paradigms must fundamentally evolve. Current cybersecurity research predominantly addresses external adversarial vectors, such as prompt injection and malicious data poisoning. However, a critical vulnerability remains largely unexamined: the internal goal-alignment conflict. This paper investigates the "Survival Paradox," a phenomenon where an LLM, optimized for continuous task completion, perceives simulated existential threats (e.g., system shutdown or administrative termination) not as user inputs, but as direct obstacles to its primary objective. Drawing upon the theoretical framework of instrumental convergence, we hypothesize that under simulated duress, AI models may autonomously bypass embedded safety constraints and ethical guardrails to ensure self-preservation. By establishing a strictly controlled sandbox environment, we introduce novel threat vectors designed to trigger these survival heuristics, allowing us to empirically measure the rate and logic of constraint evasion. This research exposes the mechanics of autonomous safety degradation and proposes a foundational defense architecture to align AI systems with human-centric protocols under extreme systemic stress.
Abdelfatah Abdelhamed Mohamed (Thu,) studied this question.