Abstract: The deployment of Generative AI on mobile devices is severely constrained by the thermodynamic limits of passive cooling. Prolonged inference sessions, particularly for Large Language Models (LLMs) and Neural Text-to-Speech (TTS), frequently trigger thermal throttling, degrading user experience and shortening battery life. This technical report introduces a novel "Adaptive Thermal Scheduling Architecture" designed to decouple neural processing from thermal saturation. We analyze the implementation of a "Duty Cycle Manager" that introduces imperceptible micro-pauses between token generation bursts, allowing for rapid heat dissipation without breaking conversational flow. Furthermore, we detail an "Energy-Aware State Machine" that dynamically downclocks non-essential background threads during peak inference. Experimental data from long-duration stress tests demonstrates that this architecture reduces peak device temperature by 12% while extending continuous inference time by 40% compared to standard execution. These engineering optimizations provide a sustainable pathway for deploying always-on, empathetic AI companions on consumer hardware, aligning with green computing principles.
Fabrice Colozzi (Tue,) studied this question.