Integration of artificial intelligence in user experience and interface visual design - earthquake simulation and multimodal optimization

Boyu Zhao; Xiuxia Yue; Xinqiang Yao

doi:10.1051/smdo/2025027

Multi-modal Information Learning and Analytics on Cross-Media Data Integration

Open Access

Issue		Int. J. Simul. Multidisci. Des. Optim. Volume 16, 2025 Multi-modal Information Learning and Analytics on Cross-Media Data Integration


Article Number		20
Number of page(s)		16
DOI		https://doi.org/10.1051/smdo/2025027
Published online		06 October 2025

Int. J. Simul. Multidisci. Des. Optim. 16, 20 (2025)

Research Article

Integration of artificial intelligence in user experience and interface visual design - earthquake simulation and multimodal optimization

Boyu Zhao¹, Xiuxia Yue¹ and Xinqiang Yao¹^,2^*

¹ Tianjin Earthquake Agency, Tianjin 300000, PR China
² Institute of Disaster Prevention, Langfang 065000, PR China

^* e-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

Received: 18 June 2025
Accepted: 3 September 2025

Abstract

Aiming at the problems of low user participation caused by one-way communication in the human-computer interactive interface, insufficient emotional authenticity of digital human interaction, and bottleneck of real-time processing of multimodal data, this paper proposes a digital human interactive interface design method driven by generative artificial intelligence (AI) and lightweight engine for earthquake emergency scenarios for earthquake science popularization interface. By building a multimodal emotional computing framework and a dynamic load balancing engine, the coordinated output of digital human actions, expressions, and emergency knowledge can be achieved. The system simulation function refers to building an emergency behavior simulation system by integrating the Unity physics engine, dynamically generating a building shaking model based on USGS Shakemap data, and simulating the execution path of standard actions such as crouching and protecting the head in real time. By using the OmniHuman model to generate a context-related body movement library, empathy guidance in earthquake scenarios is strengthened. Distributed architecture and Instant-NGP (neural graphics primitives) technology are deployed to compress 3D digital human rendering latency. Combined with the diffusion model, an emergency knowledge visualization solution is dynamically generated to adapt to the user's cognitive characteristics. The experimental results show that in terms of user experience, the accuracy of the system's head protection posture is improved to 92.7%, and the F1-score of anxiety emotion recognition for all age groups exceeds 0.84. Regarding interface technical performance, the end-to-end latency is 320 ms when the high concurrent processing capacity is 1,000 users. These are significantly better than those of baseline solutions, verifying the dual improvement of emotional authenticity and emergency efficiency. This paper provides a feasible technical path for the digitization of earthquake science popularization and promotes the application innovation of intelligent interactive systems in public safety.

Key words: Digitalization of earthquake science popularization / Digital human interactive interface / Generative artificial intelligence / Lightweight engine / Multimodal emotional computing

© B. Zhao et al., Published by EDP Sciences, 2025

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

Earthquakes are natural disasters that are sudden and destructive, so science popularization education is crucial to improving the public's emergency response capabilities. However, the current traditional earthquake science popularization model relies on one-way information transmission, such as manuals and videos, which has the pain points of low user participation and insufficient knowledge conversion rate. In recent years, digital human interaction technology [1,2] has provided a new path for earthquake science popularization, but its actual application exposes the core problems of insufficient emotional interaction authenticity and low efficiency of real-time processing of multimodal data. Existing digital humans are limited by the repetitiveness of action libraries and stiff expressions, making it difficult to convey empathy and resulting in a lack of user trust. As for earthquake scenarios, simultaneous processing of geographic information, user emotion monitoring, and emergency knowledge output is required. Existing systems often cause interaction latency due to uneven computing loads [3], affecting the efficiency of emergency information transmission [4,5]. In addition, the cognitive load in emergency scenarios is special: emotions such as panic lead to a decrease in the efficiency of digital human emergency instruction execution, and an information density adaptive mechanism needs to be designed. With breakthroughs in artificial intelligence technology, such as the development of generative AI and edge computing [6,7], it has become possible to build a digital human interface for earthquake science popularization that combines emotional authenticity with real-time response capabilities. This study focuses on this and aims to improve the immersion and practicality of earthquake science popularization interactive interfaces through innovative AI technologies, thereby providing intelligent solutions for public safety education, with important social value in reducing earthquake casualties and optimizing emergency education models.

Existing digital human technology for earthquake science popularization faces two bottlenecks. In terms of emotional interaction, although the multimodal emotional computing framework has made progress in the generation of expressions and actions, it relies on large-scale training data and lacks contextual relevance, making it difficult to adapt to the dynamic needs of earthquake scenarios [8]. Although it can integrate eye movement and voice analysis to identify paired emotion categories, it has not solved the problem of action-semantic coordination. In terms of multimodal data fusion, existing technologies such as NVIDIA's Instant-NGP combined with Neural Radiance Fields (NeRF) can optimize 3D rendering efficiency [9] but have not solved the problem of parallel processing of geographic information and user behavior data. In addition, system latency increases significantly in high-concurrency scenarios [10]. For example, the response latency of Google's MediaPipe Holistic framework increases sharply when interactions are intensive [11], making it difficult to meet the demand for emergency feedback within seconds. A deeper contradiction lies in the fact that most existing algorithms are designed based on general scenarios and lack knowledge graph adaptation [12] and dynamic load scheduling mechanisms [13] for the vertical field of earthquake science popularization, making it difficult to balance the authority of knowledge transfer and the smoothness of interaction.

This paper proposes a digital human interactive system for earthquake science popularization, built upon a three-level coordinated framework: disaster simulation, multimodal emotional interaction, and cognitive-driven interface. To break through existing limitations, we integrate a generative AI-driven context-aware framework with a lightweight multimodal fusion engine. The system simulates earthquake disasters using the Unity physics engine and generates dynamic, empathetic interactions (e.g., head protection guidance) by combining the OmniHuman model with reinforcement learning for precise action-voice synchronization. A distributed edge computing architecture with priority scheduling ensures real-time performance under high concurrency by compressing rendering latency and prioritizing core tasks. We also construct a vertical domain knowledge graph, enhanced by GPT-4 knowledge distillation, to ensure authoritative and rapid emergency responses. A closed-loop feedback system, leveraging MediaPipe for micro-expression recognition and Whisper for voice analysis, dynamically adjusts the digital human's behavior to enhance user emotional resonance. This work achieves the dual optimization of emotional authenticity and emergency effectiveness and provides a reusable technical paradigm for intelligent public safety systems.

2 Related work

In recent years, many researchers have explored the problem of emotional interaction of digital humans. Melnik et al. [14] proposed the StyleGAN3 model, systematically sorting out the research progress of face generation and editing technology based on StyleGAN, which can improve the naturalness of facial expressions through non-steady-state texture generation. Huang Z et al. [15] proposed a high-fidelity human avatar reconstruction method based on monocular video, HumanAvatar, which integrates the pre-trained HuMoR motion estimation model, Instant-NGP neural radiation field, and Fast-SNARF joint model and applies posture-sensitive space reduction technology to achieve a reconstruction speed of minutes while ensuring rendering quality. However, it does not solve the context association problem of action-semantics. Liu F et al. [16] proposed an interpretable cognitive model that integrates emotional psychology theory and deep learning. By integrating the VGG-facial action coding system (FACS)-OCC expression feature extraction, Pleasure-Arousal-Dominance (PAD) emotion space mapping, and OCEAN (openness, conscientiousness, extraversion, agreeableness, and neuroticism) personality model, they solved the problem of deep theoretical integration and interpretability in emotional cognitive modeling. Although they constructed a multimodal emotional computing protocol and quantified expression parameters through the facial action coding system, its static rules cannot adapt to the dynamic needs of earthquake scenarios. Zhao Y et al. [17] proposed an emotion recognition model for a virtual reality earthquake emergency training platform. By endowing intelligent agents with emotion recognition capabilities, they solved the problem of insufficient immersion in traditional virtual reality (VR) training caused by the lack of emotional interaction. By innovatively combining electroencephalogram (EEG), near-infrared spectroscopy, and eye tracking technology, the physiological signals of users in the virtual earthquake escape task were monitored in real time, verifying that the emotional expression of the intelligent agent can effectively trigger emotional resonance in users. Mok S et al. [18] used subjective evaluation, eye movement, and EEG technology to explore the effects of gaze and eye avoidance on user engagement in different conversational contexts. The results showed that gaze significantly increased engagement under positive emotional stimulation, and the engagement levels of digital humans and real people were significantly different. Eye contact can effectively enhance the natural interaction between humans and digital humans, providing empirical support at the cognitive and neural levels for the emotional design of digital humans. Although the above methods have achieved certain success in emotional interaction and cognitive interaction, they have not achieved the deep integration of emotional expression and emergency knowledge transmission.

To improve the real-time processing capability of multi-source data, Li W et al. [19] proposed a multi-source collaborative monitoring earthquake disaster emergency response method. By integrating multi-source spatiotemporal data to construct a disaster evolution simulation and dynamic risk assessment model, the shortcomings of traditional technologies in information extraction, collaborative monitoring, and service applications were solved. Their study combined the formation mechanism and evolution law of earthquake disasters and developed a comprehensive platform that integrates multi-source data rapid access, intelligent service model, and disaster monitoring information expression, realizing task-driven software and hardware collaborative services, significantly improving the regional earthquake emergency response capability, and providing technical support for the disaster prevention and mitigation industry. Algiriyage N et al. [20] explored the current status and challenges of deep learning applications in disaster response tasks. Although social media and collaborative technologies have driven the explosive growth of multimodal data (text, audio, video, and images) of disaster information, current deep learning practices still rely mainly on text information and fail to fully tap the value of other modal data. Although OpenAI's GPT-4 knowledge distillation technology [21] can compress the model size, its response latency still cannot meet the second-level feedback requirements of emergency scenarios. AlAbdulaali A et al. [22] proposed a user-friendly interactive multimodal disaster information dashboard design to address the issues of insufficient multimodal data presentation and human-computer interaction in disaster emergency response. This design effectively integrates and visualizes complex disaster information, improves the efficiency of emergency personnel in obtaining key information, and provides a practical solution for the usability presentation of multimodal crisis data [23]. Existing technologies generally have the defects of dynamic load imbalance and multimodal latency accumulation in earthquake science popularization scenarios.

3 Materials and methods

3.1 Overall architecture of digital human interactive system for earthquake science popularization

This paper proposes a design method for earthquake science popularization digital human interactive interface that integrates generative AI and a lightweight engine. Figure 1 shows the architecture diagram of the earthquake science popularization process implemented by the interactive interface.

Figure 1 shows the framework of the earthquake science popularization digital human interactive system proposed in this paper, which consists of three core modules to form a closed-loop architecture. The multimodal emotional computing framework receives the user's facial image, voice signal, and text input, extracts 468 facial key point coordinates through MediaPipe Face Mesh [24], combines the Mel-Frequency Cepstral Coefficients (MFCC) voice features extracted by Whisper and the semantic vector encoded by Bidirectional Encoder Representations from Transformers (BERT), and outputs the anxiety, trust, and other emotional state classification results and action control parameters after gated weighted fusion. The lightweight multimodal fusion engine is deployed based on edge computing nodes, using Instant-NGP hash coding technology to compress the graphics memory of the digital human 3D model rendering, and allocates GPU computing power in real time based on the dynamic priority scheduling algorithm, and simultaneously retrieves the earthquake knowledge graph to generate precise responses. The generative interface dynamic adaptation module generates a personalized interface layout through Stable Diffusion according to the user's age and cognitive features and drives the Unity engine to build an AR tremor simulation scenario. Finally, the digital human interactive interface integrates multimodal output to form voice-action-visualization collaborative feedback. The user's real-time behavior data flows back to the input end through eye tracking and posture recognition, forming a “perception-decision-feedback” closed loop.

The system workflow is shown in Figure 2.

Figure 2 illustrates the overall process of the AI-driven digital human system for earthquake science popularization. It shows the key stages: User Input (facial, voice, and text data), AI Core Processing (emotion recognition, action generation, and knowledge retrieval), System Engine (edge computing, dynamic scheduling, and knowledge graph), Output & Interaction (synchronized digital human with AR/VR simulation), and Feedback Loop (real-time user behavior data feeding back into the system).

Fig. 1

Overall architecture of the interaction process.

Fig. 2

System workflow.

3.2 Emotional computing framework based on multimodal large model

The construction of the multimodal emotional computing framework is based on the fusion of trimodal features of facial micro-expressions, voice signals, and text semantics, and realizes precise recognition and feedback regulation of emotional states through mathematical modeling and parameter optimization.

The Google MediaPipe Face Mesh model is used to extract the coordinates of 468 3D key points on the user's face and construct a spatiotemporal feature matrix F_t ∈ ℝ^468×3. The dynamic expression is represented by calculating the change rate of the Euclidean distance [25] between adjacent frames:

$Δ D_{t} = \sqrt{\sum_{i = 1}^{468} {(p_{i}^{(t)} - p_{i}^{(t - τ)})}^{2}}, τ = 0.2 s .$ Mathematical equation (1)

Principal component analysis is performed on ΔD_t, and the first k = 20 principal components are retained to generate the reduced-dimensional expression feature vector f_face ∈ ℝ²⁰.

In the parametric analysis of voice emotion, the MFCC and fundamental frequency trajectory of the voice signal are extracted through OpenAI Whisper to construct a time-frequency joint feature matrix S ∈ ℝ^T×15, where T is the number of time frames. The bidirectional long short-term memory network [26] is used to model the temporal dependency relationship:

$h_{speech} = \frac{1}{T} \sum_{t = 1}^{T} ({\vec{h}}_{t} + {\overset{\leftarrow}{h}}_{t}), h_{t} \in ℝ^{64} .$ Mathematical equation (2)

The forward and backward hidden states ${\vec{h}}_{t}$ Mathematical equation and ${\vec{h}}_{t}$ are updated via a gated recurrent unit (GRU) [27].

The BERT model is used for the text semantic embedding representation to encode the user input text into a 768-dimensional vector e_text ∈ ℝ⁷⁶⁸ and focuses on keywords through a multi-head self-attention mechanism.

In cross-modal feature fusion, a gating mechanism is used to dynamically weight multimodal features:

$g = σ (W_{g} [f_{face}; h_{speech}; e_{text}] + b_{g}), W_{g} \in ℝ^{(20 + 64 + 768) \times 3}$ Mathematical equation (3)

$f_{fused} = g_{1} ⊙ f_{face} + g_{2} ⊙ h_{speech} + g_{3} ⊙ e_{text}$ Mathematical equation (4)

W_g ∈ ℝ^{(20+64+768)×3} is the gate weight matrix, which maps the concatenated multimodal feature [f_face;h_speech;e_text] into a three-dimensional gate vector g, and dynamically allocates the weight of each modality. g₁, g₂, and g₃ are the components of the gate vector, which are normalized to [0,1] through the Sigmoid function to control the contribution of facial, voice, and text features, respectively. The initial values of the gate weight matrix are determined through systematic grid search on the development set, ensuring an objective and data-driven initialization process.

The emotional state classifier is built on top of a stacked Transformer encoder [28], with an input sequence of f_fused, and captures cross-modal associations through a self-attention mechanism:

$MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{8}) W^{O}$ Mathematical equation (5)

${head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) .$ Mathematical equation (6)

In equations (5) and (6), $W_{i}^{Q}$ Mathematical equation , $W_{i}^{K}$ , and $W_{i}^{V} \in ℝ^{256 \times 32}$ are the projection matrices of the i th attention head, which decompose the input feature f_fused into 8 parallel subspaces to enhance the model's ability to capture multi-dimensional associations. W^O ∈ ℝ^256×256 is the parameter matrix. The output layer generates the probability distribution of emotion categories through Softmax:

$P (y = k ∣ x) = \frac{exp (w_{⊤}^{k} h_{trans} + b_{k})}{\sum_{j = 1}^{5} exp (w_{j}^{⊤} h_{trans} + b_{j})} .$ Mathematical equation (7)

In equation (7), w_k ∈ ℝ²⁵⁶. b_k ∈ ℝ is the weight and bias of the Softmax layer. The Transformer output h_trans is mapped to the probability of the kth category of emotion. k∈{1,...,4} corresponds to four categories of emotion labels.

In action generation and synchronization optimization, the text instructions are mapped to the action latent space z_action ∈ ℝ¹²⁸ based on the OmniHuman model, and the bone joint angle sequence θ_1:T ∈ ℝ^25×T is solved by inverse kinematics. The synchronization loss function is defined to minimize the timestamp deviation between the action and the voice and constrain the timing consistency of the generated action. The equation is:

$ℒ_{sync} = \frac{1}{T} \sum_{t = 1}^{T} || t_{gesture}^{(t)} - t_{speech}^{(t)} {||}_{2}^{2} .$ Mathematical equation (8)

The proximal policy optimization (PPO) algorithm is used to optimize the policy network π_φ [29], and the objective function is:

$J (ϕ) = E_{s \sim π_{ϕ}} [m i n (r_{t} (ϕ) {\hat{A}}_{t}, clip (r_{t} (ϕ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]$ Mathematical equation (9)

$r_{t} (ϕ) = \frac{π_{ϕ} (a_{t} ∣ s_{t})}{π_{old} (a_{t} ∣ s_{t})}, ϵ = 0.2 .$ Mathematical equation (10)

The advantage function ${\hat{A}}_{t}$ Mathematical equation is calculated by the generalized advantage estimate:

${\hat{A}}_{t} = \sum_{l = 0}^{\infty} {(γ λ)}^{l} δ_{t + l}$ Mathematical equation (11)

$δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t}) .$ Mathematical equation (12)

In equations (11) and (12), γ = 0.99 is the discount factor. λ = 0.95 is the balance parameter, balancing immediate rewards and long-term benefits. γ controls the decay rate of future rewards. λ adjusts the balance between bias and variance. After multiple iterations of training, the action-voice synchronization error is reduced to 48.3 ± 5.2 ms.

The parameter dynamic adjustment module activates the voice fundamental frequency decay and action amplitude constraints when P≥0.7 is detected:

$F_{0}^{adj} = F_{0}^{base} \cdot (1 - α \cdot P), || Δ θ_{t} {||}_{2} \leq β \cdot || Δ θ_{m a x} {||}_{2} .$ Mathematical equation (13)

In equation (13), α = 0.3 is the voice fundamental frequency decay coefficient, which linearly adjusts the fundamental frequency according to the anxiety probability P(y = anxiety) to reduce the sharpness of the voice under high anxiety. β = 0.8 is the action amplitude scaling factor, which limits the joint angular velocity to avoid violent actions causing discomfort to the user.

This mathematical framework achieves precise and real-time emotional interaction through strict constraints and optimization, providing theoretical support for digital humans in earthquake science popularization.

3.3 Multimodal real-time fusion architecture for emergency response

The design of the lightweight multimodal fusion engine aims to solve the problem of efficient real-time processing of multi-source data (geographic information, user behavior, and knowledge base) in earthquake emergency science popularization scenarios, and break through the performance bottleneck of traditional systems under high concurrency and dynamic loads. The core of the engine consists of three parts: edge real-time rendering acceleration, dynamic priority scheduling mechanism, and vertical domain knowledge base optimization. The optimal allocation of computing resources is achieved through mathematical modeling and architectural innovation. Table 1 lists the lightweight engine parameter settings.

Table 1 reveals the core parameter configuration of the earthquake science digital human lightweight engine and its engineering significance. The collaborative design of the number of hash coding layers (L = 8) and the hash table size (219) compresses the 3D rendering video memory of the digital human and ensures the model loading efficiency of edge nodes in low-bandwidth environments. The setting of the emotion monitoring weight (w₁ = 0.6) in the dynamic scheduling module ensures that the system prioritizes computing power to emotion recognition and emergency knowledge output in high-concurrency scenarios, avoiding interaction latency caused by secondary tasks, such as geographic data updates occupying resources. The knowledge base uses GPT-4 distillation to compress the parameter quantity to 0.3B and cooperates with 128-dimensional approximate nearest neighbor (ANN) retrieval to reduce response latency. These parameters jointly support the design concept of the lightweight system without sacrificing performance and provide underlying technical support for real-time interaction in earthquake emergency scenarios.

The Instant-NGP technology is used for the edge real-time rendering acceleration to lightweight encode the digital human 3D mesh model. Given the digital human mesh vertex coordinate V ∈ ℝ^N×3 and the patch connection relationship F ∈ ℕ^M×3, the geometric information is mapped to a compact feature space through multi-resolution hash coding:

$H (v) = L \overset{L}{\underset{l = 1}{\oplus}} hash (⌊ v \cdot s_{l} \overset{Á}{I}) .$ Mathematical equation (14)

In equation (14), v ∈ V is the vertex coordinate. s_l is the resolution scaling factor of the lth layer hash grid. L = 8 is the number of hash layers. ⊕ represents the feature concatenation operation. This encoding compresses the single-frame rendering time to meet the real-time performance requirements of earthquake emergency scenarios.

The dynamic priority scheduling mechanism allocates computing resources based on task criticality. The task set 𝒯 = {τ₁, τ₂, τ₃} is defined, corresponding to user emotion monitoring, geographic information update, and knowledge base retrieval. Task priority weights are determined by a normalized exponential function:

$w_{i} = \frac{exp (α \cdot c_{i})}{\sum_{j = 1}^{3} exp (α \cdot c_{j})}, c_{i} \in {0.6,0.3,0.1} .$ Mathematical equation (15)

In equation (15), α = 2.0 is the sharpening coefficient, and c_i is the preset criticality score. Within the Edge AI scheduler framework, GPU computing power allocation follows the proportional differential control law:

$P (t) = K_{p} \cdot e (t) + K_{d} \cdot \frac{d e (t)}{d t}, e (t) = \sum_{i = 1}^{3} w_{i} \cdot (d_{i}^{req} - d_{i}^{act})$ Mathematical equation (16)

$d_{i}^{req}$ Mathematical equation is the expected latency of task τ_i. $d_{i}^{act}$ is the actual latency. K_p = 0.8 and K_d = 0.2 are PID control parameters.

The optimization of the vertical domain knowledge base achieves a balance between authority and response speed through knowledge distillation and semantic retrieval enhancement. A knowledge graph 𝒢 = (𝒩, ℰ) is constructed based on the earthquake case database of the China Earthquake Administration, where node 𝒩 contains 52,000 entities such as earthquake events, earthquake avoidance strategies, and geological parameters, and edge ℰ describes the relationship between entities such as “Wenchuan earthquake-trigger-landslide”. GPT-4 is used to perform knowledge distillation on the original question-answering model ℳ_base (with a parameter quantity of 1.2B):

$ℒ_{KD} = λ \cdot KL (p_{student} || p_{teacher}) + (1 - λ) \cdot ℒ_{CE} .$ Mathematical equation (17)

p_student is the output probability of the student model. p_teacher is the soft label generated by GPT-4. λ = 0.7 is the distillation weight. After distillation, the parameter quantity of model ℳ_light is 0.3B, and the response latency is reduced. Quantitative evaluation shows a knowledge loss rate in the earthquake domain corpus, primarily affecting low-frequency professional terms, which is mitigated by integrating a specialized knowledge graph. The approximate nearest neighbor algorithm is used in semantic retrieval to construct the Hierarchical Navigable Small World (HNSW) index for the knowledge graph embedding vector e_n ∈ ℝ¹²⁸ to reduce the retrieval time.

In the communication protocol and load balancing, WebRTC is used to achieve low-latency data synchronization between edge nodes. The node set 𝒞 = {c₁, c₂, ..., c_K} and the load state of each node s_k = (u_k, g_k, m_k) are defined, where u_k is the number of concurrent users; g_k is the GPU utilization; m_k is the video memory occupancy. The load balancing objective function is:

$\underset{x}{m i n} \sum_{k = 1}^{K} {(\frac{u_{k} + x_{k}}{U_{m a x}})}^{2} + γ \cdot || x {||}_{2}^{2}$ Mathematical equation (18)

x_k is the number of new users assigned to node c_K. U_max = 1,000 is the upper limit of the single node capacity. γ = 0.1 is the regularization coefficient. The optimal allocation strategy is solved by the Lagrange multiplier method to reduce the system's overall load variance.

The engine achieves the coordinated optimization of hash coding compression, dynamic PID scheduling, knowledge distillation, and load balancing to enable a single edge node to support multi-user concurrent interaction and end-to-end response with low latency. The strict constraints and parameter optimization of the mathematical model ensure the robustness of the system in emergency scenarios, providing a reliable technical foundation for the digitization of earthquake science.

Table 1

Lightweight engine parameter settings.

3.4 Disaster scenario adaptive visual generation algorithm

Generative interface visual dynamic adaptation achieves real-time personalized adjustment of interface layout, visual style, and interaction logic through the artificial intelligence technology, solving the contradiction between user cognitive differences and emergency information transmission efficiency in earthquake science scenarios. This module is based on user feature analysis and physical environment perception, and builds a complete technical chain from data-driven generation to multimodal feedback optimization.

The Stable Diffusion 2.1 model is used in the automatic generation of interface layout to map user features to the visual design space. The input layer includes a user attribute vector, environmental perception data (real-time magnitude, geographic location, and safe zone topology map), and physiological signals (heart rate and eye movement hotspot coordinates). Given the user attribute vector u = (age, edu _ level, color _ pref) ∈ ℝ^d, the potential diffusion process is controlled by the cross-attention mechanism:

$ϵ_{θ} (z_{t}, t, u) = UNet (z_{t}, t) + γ \cdot CrossAttn (z_{t}, Embed (u)) .$ Mathematical equation (19)

In equation (19), z_t ∈ ℝ^64×64×4 is the latent space feature. γ = 0.6 is the control strength coefficient. Embed (u) ∈ ℝ⁷⁶⁸ is the attribute embedding vector. Feature encoding design encoder uses a two-stream Transformer architecture to process static attributes and dynamic signals, respectively, and generates a joint embedding through cross-modal attention fusion.

In the interface generation driven by the diffusion model, the conditional control strategy is that the text prompt project builds a special prompt word library for earthquake emergency (such as “cartoon style_high contrast_magnitude 5.5”). ControlNet constraints are applied, and the safe zone topology map is used as a spatial constraint to ensure that the evacuation route is not blocked by UI elements. After the generation process initializes the latent vector, iterative denoising is performed. After outputting the candidate interface I ∈ ℝ^{1920×1080×3}, the semantic matching degree is computed through the CLIP model:

$s = \frac{⟨ {CLIP}_{img} (I), {CLIP}_{text} (p r o m p t) ⟩}{{||CLIP}_{img} (I) || \cdot || {CLIP}_{text} (p r o m p t) ||}$ Mathematical equation (20)

When s≥0.85, the design solution is adopted to ensure that the interface elements conform to the user's cognitive habits. The generation efficiency of this method is much higher than that of manual design.

The earthquake peak ground acceleration (PGA) and spectrum characteristics are analyzed based on USGS Shakemap data:

$a_{VR} (t) = \sum_{f = 1}^{10} A_{f} \cdot sin (2 π f t + ϕ_{f}) \cdot e^{- ζ t} .$ Mathematical equation (21)

In equation (21), A_f is the acceleration amplitude of the fth frequency band (with a range of 0.1 g–0.5 g). ζ = 0.2 is the decay coefficient. ϕ_f is the random phase. The acceleration signal is converted into six-degree-of-freedom tactile feedback through the SteamVR SDK. The latency constraint is Δt ≤ 10 ms. The spatial audio module uses the head-related transfer function to realize the source azimuth perception, and the azimuth error is ≤3°.

In dynamic color and contrast optimization, a visual comfort objective function is established, and user visual fatigue is minimized:

$ℒ_{visual} = λ_{1} \cdot || c_{fg} - c_{bg} {||}_{2}^{2} + λ_{2} \cdot || L_{lum} - 120 {||}_{2}^{2}$ Mathematical equation (22)

c_fg and c_bg ∈ ℝ³ are the RGB vectors of foreground and background colors. L_lum is the brightness value. λ₁ = 0.7 and λ₂ = 0.3 are the balance coefficients. Sequential quadratic programming is used to solve the optimal color combination and reduce the number of iterative convergence steps.

Multi-channel input fusion integrates eye tracking and posture recognition to achieve hybrid interactive control. Eye movement data is collected by Tobii Pro Nano (with a sampling rate of 120 Hz) to construct a gaze heat map:

$H (x, y) = \sum_{i = 1}^{N} 𝒩 (x, y | μ_{i}, Σ), Σ = (\begin{array}{c} 25 & ⁡ & 0 \\ 0 & ⁡ & 25 \end{array}) .$ Mathematical equation (23)

In equation (23), (x, y|μ_i) is the coordinate of the ith gaze point. When the regional integral value of a UI element is ≥0.8 and the duration T is ≥ 2 s, it is determined to be active gaze triggering knowledge retrieval. In posture recognition, the Intel RealSense D415 depth camera captures the coordinate J ∈ ℝ^21×3 of 21 key points of the hand. The fisting action is defined as:

$σ_{fist} = \frac{1}{21} \sum_{i = 1}^{21} || J_{i} - \bar{J} {||}_{2}^{2} \leq 0.01 .$ Mathematical equation (24)

$\bar{J}$ Mathematical equation is the mean coordinate of the palm key points, and the variance threshold of 0.01 is determined by calibration.

In real-time style transfer and anti-interference optimization, to cope with the complex lighting conditions of earthquake emergency scenarios, an illumination invariant feature extraction network is designed:

$F_{inv} = ResNet - 18 (I) - AdaIN (ResNet - 18 (I), AvgPool (I)) .$ Mathematical equation (25)

The AdaIN layer performs style normalization:

$AdaIN (x, y) = σ (y) \cdot \frac{x - μ (x)}{σ (x)} + μ (y) .$ Mathematical equation (26)

The network ensures that interface elements remain identifiable under different lighting conditions. Under dynamic load conditions, a multi-objective optimization algorithm is used to balance rendering quality and frame rate.

This technical module has achieved remarkable results in improving the efficiency of user emergency operations and reducing the rate of false touches on the interface. Through the deep integration of generative AI and physical modeling, the adaptive ability of the earthquake science popularization interface in a dynamic environment is realized, establishing a technical paradigm for intelligent interactive design in the field of public safety.

4 Results

4.1 Experimental design process

The experiment is conducted on a heterogeneous computing cluster deployed by a local earthquake bureau's science popularization center. 120 users of different ages (12–65 yr old, stratified sampling covering students, office workers, and the elderly) are recruited. Four control systems are set up: a traditional static graphic system (baseline 1), an open-source digital human system (baseline 2, integrating Azure cognitive services and Unity MARS (Mixed and Augmented Reality Studio)), a Transformer-based multimodal interactive system (baseline 3, Meta TransAvatar), and a federated learning-driven adaptive interface (baseline 4). The task modules include guidance on head protection posture, earthquake knowledge questions and answers, and sudden tremor simulation. Data is collected through the OptiTrack Prime 41 motion capture system, Tobii Pro Nano eye tracker (with a sampling rate of 120 Hz) and customized stress testing tools (simulating multi-user concurrency). The AR/VR (Augmented Reality/Virtual Reality) earthquake simulation integrates a physically accurate vibration feedback system built on the Unity engine. The VR earthquake simulation interface and the earthquake avoidance simulation interface are shown in Figure 3, and the head protection crouching posture is shown in Figure 4.

To evaluate the action generation capability of the OmniHuman model in complex environments, an additional “"obstacle avoidance” task was designed. In this scenario, virtual obstacles were dynamically generated in the Unity simulation based on the simulated earthquake intensity. Users were instructed to navigate to a designated safe zone while avoiding these obstacles. The system's ability to generate feasible joint angle sequences for the digital human to perform crouching, crawling, and stepping-over actions was recorded and analyzed. The results of the obstacle avoidance task are presented in Table 2.

The results in Table 5 demonstrate that the OmniHuman model, guided by the context-aware framework, can effectively solve for feasible joint angles to navigate complex post-disaster environments. The system maintains a high success rate (>88%) at low to medium obstacle densities. Even under high-density conditions, the success rate remains above 75%, confirming the practical feasibility of the proposed method for generating complex avoidance maneuvers in earthquake scenarios.

The experiment is divided into three stages: single-user interaction performance test (task completion rate and emotion recognition accuracy), high concurrent load stress test (response latency and GPU utilization), and 72 h continuous operation stability monitoring (memory leak rate and anomaly recovery time). A double-blind design is used to eliminate subjective bias. All algorithm parameters are fixed to avoid dynamic tuning interference.

Users aged 12–65 are divided into four age groups: 12–18, 19–35, 36–50, and 51–65. Table 3 shows the statistical results and task completion status of users in these four age groups.

To verify the real-time performance and robustness of the earthquake science popularization digital human interactive system, this study designs a dynamic simulation experiment of coupled physical-social effects and carries out an action-voice synchronization optimization experiment based on the reinforcement learning framework.

In the simulation experiment, the building model library is loaded into Unity3D to construct a multi-level earthquake initialization scenario, and USGS Shakemap historical data is integrated to generate source parameters (with a magnitude of 4.0–7.0 and a PGA = 0.1 g–0.5 g). During the simulation, multimodal interactions are initiated, including voice instructions for head protection and avoidance. AR arrows dynamically indicate safe zones. Calming words are triggered based on the user's heart rate to provide emotional comfort, etc. 10,000 interaction events are recorded to analyze fault tolerance performance in abnormal scenarios. The test indicators include the response latency of the digital human (action generation→user perception), the accuracy of emergency instruction execution, and the peak value of system resource consumption (GPU). Table 4 lists the system performance indicator results.

In Table 4, in the test of 10,000 interaction events, the average latency of the system for magnitude 4.0 is 0.79 s, indicating that the system has millisecond-level real-time response capabilities in low-intensity earthquakes. At this stage, the GPU utilization rate is only 62%, and the resource load is light, mainly processing basic action generation and voice feedback. The latency for magnitude 7.0 increases significantly to 1.80 s, and the standard deviation expands to 0.31 s, reflecting the computation fluctuation caused by the surge in data in the strong earthquake scenario. At this time, the GPU utilization rate reaches 89%, approaching the hardware limit, but the degradation strategy is still not triggered (the threshold is set at 95%). The instruction accuracy rate remains above 92.7% in the entire magnitude range, with a standard deviation of no more than 2.9%. The accuracy rate of 93.6% for a magnitude 5.5 earthquake is inferior to 95.1% for a magnitude 4.0 earthquake because moderate-intensity earthquakes trigger more complex obstacle avoidance decisions, and the system optimizes the decision logic through real-time retrieval of knowledge graphs. The GPU utilization rate is strongly positively correlated with the magnitude, approaching 90% at a magnitude of 7.0, mainly due to three-dimensional physical simulation, multimodal data fusion, and emotional computing. The experiment verifies the effectiveness of the lightweight engine's load balancing mechanism in high-magnitude scenarios, avoiding the common resource preemption problem of traditional systems.

Simulation data show that the system still meets the timeliness requirements of emergency response in a strong earthquake of magnitude 7.0, and the fluctuation of instruction accuracy is controllable, proving the adaptability of the multimodal fusion architecture to extreme scenarios.

The optimization goal in the synchronization optimization experiment is to minimize the synchronization error between the digital human action and voice feedback, which can be expressed as:

$\underset{θ}{m i n} E [| t_{gesture} - t_{speech} |] + λ \cdot Jerk (θ) .$ Mathematical equation (27)

In equation (27), θ is the action generation model parameter. Jerk punishes sudden joint movement changes. λ = 0.3 is the balance coefficient.

To overcome the bottleneck of action-voice synchronization precision, this study uses a collaborative control algorithm based on PPO. The state space is defined to include the user's emotional state, the semantic context of the current knowledge graph node, and the system load level of the GPU utilization grading. The action space covers the joint movement speed adjustment and voice broadcast rate scaling. The comparison baseline methods include the traditional method of fixed-latency compensation (Kalman filtering) and the control method of deep Q-network (DQN). The reward function integrates the synchronization error exponential decay term, sudden joint movement changes, and user real-time score. The equation is:

$R = exp (- \frac{{(t_{g} - t_{s})}^{2}}{2 σ^{2}}) - 0.1 \cdot Energy (θ) + 0.2 \cdot UserScore .$ Mathematical equation (28)

In equation (28), the synchronization error weight σ = 25 ms. Energy is the action energy consumption. UserScore comes from real-time feedback (1–5 points). Table 5 compares the performance results of the synchronization optimization algorithm of the three methods. **Statistical significance is indicated as *p < 0.01 and p < 0.05 vs. Kalman filtering (two-tailed t-test).

Table 5 shows that after 50,000 steps of iterative training on the Gazebo+ROS (Robot Operating System) simulation platform, the 82.3 ms mean error of the Kalman filtering exposes the limitations of traditional methods in nonlinear interactions. It relies on a fixed latency compensation model and cannot adapt to dynamic loads. The PPO algorithm reduces the error to 48.2 ms, which is 41.4% better than Kalman. The 95% quantile of 78.6 ms meets the stringent requirements of emergency scenarios. The user score of the PPO group reaches 4.6 points, significantly higher than DQN's 4.1 points. The real-time score feedback mechanism dynamically adjusts the reward function to ensure that the optimization direction is in line with human perception preferences. The standard deviation of PPO is 15.3 ms, the lowest among the three, proving that the conservatism of its strategy update effectively suppresses training fluctuations. Compared with DQN's standard deviation of 22.1 ms, PPO's theoretical advantage in the exploration-exploitation balance is highlighted.

Optimization experiments show that the PPO algorithm breaks through the existing technical bottlenecks in synchronization precision and user experience through the coordinated regulation of emotional state perception and hardware load, providing algorithm-level guarantees for the multimodal natural interaction of digital humans in earthquake science popularization. It should be noted that the current PPO reward function primarily focuses on action-voice synchronization; future work will integrate visual feedback timing (e.g., AR arrow display) into the optimization objective to achieve balanced multimodal coordination.

The simulation experiment reveals the quantitative relationship between earthquake intensity and system load. The optimization experiment breaks through the bottleneck of multimodal synchronization through algorithm innovation. The two together support the core argument of emotion-effectiveness balance optimization and provide an empirical basis for the engineering implementation of intelligent interactive systems in emergency education scenarios.

Fig. 3

VR earthquake simulation interface and shock avoidance simulation interface.

Fig. 4

Simulation of crouching posture with head protection.

Table 2

Success rate of action sequence generation in complex obstacle scenarios.

Table 3

User statistics and task completion.

Table 4

System performance indicators with different earthquake magnitudes.

Table 5

Comparison of performance of synchronous optimization algorithms.

4.2 Interactive system performance and user experience evaluation indicators

Quantitative indicators cover three dimensions: user experience, technical performance, and robustness:

The equation for head protection posture accuracy is expressed as:

$A_{p} = \frac{\sum_{i = 1}^{N} I (Δ θ_{i} < 15^{\circ})}{N} .$ Mathematical equation (29)

In equation (29), Δθ_i is the L2 norm difference between the user's joint angle and the standard posture, and the threshold of 15° is determined by ergonomics.

The F1-score of emotion recognition is calculated using macro-average, and the confusion matrix statistics FN, TP, FP, and TN of four emotions, such as anxiety and trust.

The end-to-end latency is D_e = t_response − t_query. Percentiles are calculated using high-frequency sampling (with a 10 ms granularity) using Prometheus.

The critical point of a system crash is defined as the number U_max of concurrent users meeting FPS≥24 and CPU load≤90%.

The equation for the misoperation rate is defined as:

$R_{e} = \frac{N_{error}}{N_{total}} .$ Mathematical equation (30)

Abnormal inputs include noisy voice with SNR = 5 dB and non-earthquake-related questions.

4.3 User experience

To study the user experience when using the earthquake science popularization digital human system, the accuracy of users' correct execution of earthquake escape postures in different methods is compared to evaluate the results, with head protection posture as the research object. Figure 5 compares the accuracy results.

In Figure 5a, the horizontal axis is the age group of 12–18 yr old, 19–35 yr old, 36–50 yr old, and 51–65 yr old, and the vertical axis is the head protection posture accuracy. The performance of the method in this paper, the baseline 1 method, and the baseline 4 method is compared. The results demonstrate that the proposed method achieves significantly higher accuracy across all age groups, with an average of 92.7% (p < 0.01, two-tailed t-test). It reaches 95.2% in the 12–18 age group, with 80.3% for baseline 1 and 90.5% for baseline 4. In the 51–65 age group, the accuracy of the method in this paper remains at 90.1%. At this time, baseline 1 method reaches 65.4%, and baseline 4 reaches 84.3%. With the increase of age, the accuracy of all systems decreases, but the decline of the method in this paper is the smallest. From the young group to the old group, the accuracy of the method in this paper decreases by 5.1%; that of the traditional system decreases by 14.9%; that of the federated system decreases by 6.2%. The standard deviation trend in Figure 5b shows that the standard deviation of the method in this paper is always lower than that of baseline methods, such as 2.1% of the method in this paper in the 12–18 age group vs. 5.4% of the traditional system, and the error band in the elderly group does not significantly widen, indicating that its stability is not affected by age. These results further confirm the statistical superiority of the proposed method in cross-age scenarios.

The results of high accuracy and low standard deviation directly reflect the effectiveness of generative interface dynamic adaptation technology: for users of different ages, the model generates interface layouts that adapt to cognitive characteristics through Stable Diffusion and combines AR action guidance to enhance the intuitiveness of operations. Due to the lack of personalized design in traditional systems, elderly users are limited by interface complexity and feedback latency, and the misoperation rate increases significantly (with a standard deviation as high as 8.2%). Although the baseline 4 method improves generalization through distributed training, it does not integrate the earthquake domain knowledge base, resulting in a semantic disconnect between action generation and emergency scenarios. The method in this paper uses a lightweight engine with priority scheduling to maintain low latency and high stability under resource-constrained conditions, ultimately achieving an age-insensitive interactive experience and providing key technical support for earthquake science popularization digital humans to break through the traditional one-way communication mode.

To study the recognition effect and detection precision of digital human interfaces on different user emotions, this paper analyzes four primary emotional states: anxiety (anxiety here is represented as a “panic-numb” spectrum.), trust, confusion, and neutrality. Figure 6 presents the F1-score results of emotion recognition for different age groups.

In Figure 6, the horizontal axis represents age groups from 12–18 to 51–65 yr old, and the vertical axis represents emotion categories, including anxiety, trust, confusion, neutrality, etc. The color depth of the heat map represents the F1-score. The results indicate that the recognition rate for anxiety is the highest, peaking at 0.92 for the 12–18 age group and gradually decreasing to 0.84 for the 51–65 age group (p < 0.05, ANOVA). The recognition rate of neutral emotions is the lowest, maintaining at 0.75–0.85 in all age groups. In the same emotion category, the F1-score in the young groups (12–35 yr old) is generally higher than that in the old groups (36–65 yr old). For example, the recognition rate of confusion is 0.91 in the 12–18 age group and drops to 0.80 in the 51–65 age group. There are significant differences in trust among different age groups, while anxiety emotions have the lowest age sensitivity.

The high recognition rate of anxiety emotions verifies the effectiveness of the multimodal emotional computing framework. In earthquake science popularization scenarios, user anxiety is usually accompanied by obvious facial muscle tension and voice fundamental frequency fluctuations. The model precisely captures such signals through dynamic key point displacement analysis and spectrum feature fusion. The low recognition rate of neutral emotions exposes the limitations of the existing system − when there is a lack of significant emotional features, the model relies on a supplementary analysis of text semantics, but the sparsity of the vertical domain knowledge base, such as insufficient coverage of earthquake professional terms, leads to misjudgment in some scenarios. The performance difference between the young and old groups reflects the impact of user cognitive characteristics on interaction design: the elderly need to rely on dynamic interface adaptation technology to improve their interaction perception due to their small facial micro-expressions and slow voice rhythm. The heat map analysis confirms the necessity of algorithm-level optimization for emotional authenticity and age-stratified design from the algorithm level and provide a quantitative basis for the multimodal interaction optimization of earthquake science popularization digital humans.

To explore the relationship between multimodal synchronization error and user satisfaction, the user score range is set (1–5 points). Figure 7 quantifies the direct impact of synchronization precision on user experience, providing a critical engineering criterion for system design.

In Figure 7, the horizontal axis is the voice-action synchronization error, and the vertical axis is the user satisfaction score. The point size and color map sample size. The methods represented by the data points from left to right in the figure are the method in this paper, the adaptive interface driven by federated learning with baseline 4 method, the multimodal interactive system based on Transformer with baseline 3 method, the open-source digital human system with baseline 2 method, and the traditional static graphic system with baseline 1 method. The results show that the proposed method achieves the lowest synchronization error (48.3 ms ± 5.2) and the highest user satisfaction score (4.6 ± 0.3), with the largest sample size. The data point error of baseline 1 method is 120.1 ms ± 18.7, and the score is 3.1 ± 0.9, with the smallest sample size, which is 50. The synchronization error of the baseline 4 method data point is higher than that of the method in this paper, which is 65.4 ms, and it is prone to crash when there is high concurrency. The black dotted trend line shows that the synchronization error is strongly negatively correlated with the score. For every 10 ms reduction in the error, the score is about 0.2 points higher, which verifies the direct impact of low-latency interaction on user experience.

The strong correlation between low synchronization error and high user scores confirms the urgent need for real-time feedback in earthquake emergency scenarios: strict synchronization of digital human actions and voice, such as millisecond-level alignment of head protection postures and voice instructions, can reduce user cognitive load and improve instruction execution efficiency in emergency situations. Baseline 4 method's score exposes the limitations of simply optimizing synchronization errors − its lightweight engine has excessive GPU utilization under high load, resulting in interaction freezes and reflecting the need for the system to balance latency and stability. The proposed method employs a dynamic reward mechanism based on reinforcement learning and edge priority scheduling to minimize synchronization errors while ensuring service continuity, ultimately achieving a balanced optimization of “low error-high stability-high score” and providing quantifiable engineering criteria for the multimodal interaction design of digital humans for earthquake science popularization.

Fig. 5

Comparison of head protection posture accuracy and standard deviation. a. Comparison of head protection posture accuracy. b. Standard deviation trend.

Fig. 6

Emotion recognition F1-score results.

Fig. 7

Quantifying the impact of multimodal synchronization error on user satisfaction.

4.4 Interactive interface visual effects performance verification

To verify the performance of the earthquake science popularization digital human interactive interface, this paper studies the system stability and concurrent processing capacity. The system stability is measured by the memory leak rate and mean time to repair (MTTR) during the system operation time, and the concurrent processing performance is measured by the end-to-end latency and GPU utilization corresponding to different concurrent user numbers under the edge node. Figure 8 shows the system stability results within a 72 h period.

In this line chart, the horizontal axis is the system operation time (0–72 h). The left vertical axis is the memory leak rate, and the right vertical axis is the MTTR. The data shows that the memory leak rate increases linearly over time, from the initial 0.01% to 0.09% after 72 h, and the growth rate slows down over time, which meets the industrial standard. The MTTR gradually increases from 5.2 s to 11.2 s. The two lines are weakly positively correlated, indicating that the accumulation of memory leak may slightly affect the efficiency of fault recovery.

The slow growth of the memory leak rate demonstrates the efficient resource management of the lightweight engine: through hash coding to compress the memory usage of the rendering pipeline and dynamic priority scheduling, the system still keeps resource leak under control under 72 h of high load. The gradual increase in MTTR reflects the optimization space of the anomaly detection algorithm − although the redundant node switching mechanism avoids service interruption, the latency in log analysis and state snapshots leads to increased recovery time. The experimental results confirm the dual requirements of system robustness in earthquake emergency scenarios: long-term operation stability and rapid recovery of sudden failures, which provides a reliability guarantee for the 24 h deployment of digital human interactive interfaces in public service places, ensuring that earthquake science education is not disturbed by technical failures.

Figure 9 compares the concurrent performance results of the method in this paper and the baseline 4 method with different concurrent numbers of user interactions on edge nodes.

In Figure 9, the horizontal axis is the number of concurrent users, the left vertical axis is the end-to-end latency (ms), and the right vertical axis is the GPU utilization (%). The bar chart shows that the latency of the method in this paper increases smoothly with the number of concurrent users, from 120 ms for 200 users to 320 ms for 1,000 users. In contrast, the latency of the baseline 4 method increases sharply from 280 ms to 520 ms (crashing at 1,000 users). The GPU utilization of the method in this paper increases from 45% to 85%, while that of the baseline 4 method soars from 68% to 95%, triggering a system crash. The crash point highlights the unavailability of the baseline system under high load, while the method in this paper still runs stably at 1,000 users (with a latency of 320 ms and a GPU of 85%).

The coordinated optimization of low latency and controllable GPU utilization confirms the core value of the dynamic load scheduling mechanism of the lightweight engine in earthquake emergency scenarios: through the priority allocation algorithm and edge rendering degradation strategy, the system still ensures interactive fluency under thousands of concurrency. The collapse of the baseline 4 method exposes the limitations of the general federated learning framework − its strategy of evenly allocating computing power does not consider the vertical field needs of earthquake science popularization, resulting in resource starvation of key modules. This paper designs dynamic load balancing inspired by the intelligent seismic system to achieve a precise and targeted supply of computing power, providing a reliable technical foundation for large-scale user synchronous participation in earthquake drills and breaking through the concurrency bottleneck of traditional systems in public safety scenarios.

Fig. 8

System stability results.

Fig. 9

Comparison of concurrency performance.

5 Discussion

This paper proposes an AI-driven digital human interactive system for earthquake science popularization, with the following key contributions.

Novel Integration for Emotional Authenticity: We propose a novel generative AI-driven framework that integrates a multimodal emotional computing model (fusing facial, voice, and text) with reinforcement learning (PPO) for action-voice synchronization. This is the first work to apply such a closed-loop system to earthquake scenarios, significantly improving emotional authenticity and user trust compared to static or rule-based VR/AR training systems.

Innovative Lightweight Architecture for Real-time Performance: We design an innovative lightweight multimodal fusion engine based on edge computing. By combining Instant-NGP for rendering compression, a dynamic priority scheduling mechanism, and GPT-4 knowledge distillation, our system achieves millisecond-level end-to-end latency, overcoming the high-latency bottleneck of existing cloud-dependent or monolithic digital human systems.

Context-Aware Interface Adaptation: We introduce a context-aware generative interface that dynamically adapts the visual layout and interaction logic using Stable Diffusion and ControlNet. This interface is driven by user age, cognitive features, and real-time seismic data, providing a personalized experience that enhances comprehension and reduces cognitive load, a significant improvement over fixed-UI applications.

Systematic Integration and Validation: We present a comprehensive, end-to-end system that integrates disaster simulation (Unity + USGS data), emotional interaction, and real-time optimization. The system is rigorously validated through user studies, demonstrating a 92.7% accuracy in head protection guidance and a 41.4% reduction in synchronization error compared to baseline methods.

6 Conclusion

Focusing on the emotional authenticity and real-time requirements of the digital human interactive interface for earthquake science popularization, this paper proposes a multimodal interactive architecture that integrates generative AI and a lightweight engine. At the method level, by constructing an emotional computing framework based on the fusion of dynamic modeling of facial micro-expressions and voice spectrum features, precise recognition of user emotions such as anxiety and trust is achieved. The OmniHuman action generation model is combined with reinforcement learning for synchronous optimization to compress action-voice latency. The lightweight engine uses hash coding compression and dynamic priority scheduling to enable the system to maintain an end-to-end latency of 320 ms under thousands of concurrency levels. This paper creatively proposes a dynamic interface generation mechanism driven by vertical domain knowledge and a multimodal load balancing strategy dedicated to earthquake scenarios. Experiments show that during the interaction, the accuracy of the user's head protection posture reaches 92.7%, and the F1-score of anxiety emotion recognition exceeds 0.84. The limitation of the current research includes that the training data is concentrated on urban earthquake scenarios; the geological characteristics of rural areas are insufficiently covered; physical tremor simulation equipment is not integrated. This data bias may lead to performance disparities across different populations, and user privacy and ethical considerations in emotion recognition require further investigation. In future work, multi-source seismic data fusion can be expanded. Wearable devices can be combined to achieve emotion recognition with enhanced physiological signals, and cross-platform AR or VR collaborative interaction paradigms can be explored. Additionally, a cross-cultural calibration mechanism will be introduced to improve the generalization of the emotional computing model by accounting for cultural differences in micro-expression interpretation. This study provides a reusable technical framework for the application of intelligent interactive systems in earthquake safety education and science popularization, which is expected to be extended to other multiple disaster emergency scenarios such as fire escape and typhoon response.

Acknowledgments

The authors would like to thank the Tianjin Science and Technology Plan Project (249KPXMRC0023) for their support and assistance.

Funding

This study was supported by Tianjin Science and Technology Plan Project “Research and Demonstration Application of Mass Creation Science Popularization Products for Youth Disaster Prevention and Reduction” (249KPXMRC0023).

Conflicts of interest

The authors declare that there is no conflict of interest with any financial organizations regarding the material reported in this manuscript.

Data availability statement

Data is available upon reasonable request.

Author contribution statement

Conceptualization, B.Z.; Methodology, X. (x)Y.; Software, X. (q)Y.; Validation, B.Z., X. (x)Y. and X. (q)Y.; Formal Analysis, X. (q)Y.; Investigation, B.Z.; Resources, X. (q)Y.; Data Curation, X. (q)Y.; Writing − Original Draft Preparation, B.Z., X. (x)Y. and X. (q)Y. All authors were aware of the submission of the manuscript and agreed to its publication.

References

H.O. Demirel, S. Ahmed, V.G. Duffy, Digital human modeling: a review and reappraisal of origins, present, and expected future methods for representing humans computationally, Int. J. Hum. Comput. Interact. 38, 897–937 (2022) [Google Scholar]
B. Wang, H. Zhou, X. Li et al., Human digital twin in the context of industry 5.0, Robot. Comput. Int. Manuf. 85, 102626 (2024) [Google Scholar]
K. Ling'ai, H. Zichao, Design and application of popular science information release system for earthquake prevention and disaster reduction, Prog. Earthquake Sci. 54, 346–352 (2024) [Google Scholar]
R. Damaševičius, N. Bacanin, S. Misra, From sensors to safety: Internet of Emergency Services (IoES) for emergency response and disaster management, J. Sens. Actuator Netw. 12, 41 (2023) [Google Scholar]
N. Li, N. Sun, C. Cao et al., Review on visualization technology in simulation training system for major natural disasters, Nat. Hazards 112, 1851–1882 (2022) [Google Scholar]
V. Plevris, AI-driven innovations in earthquake risk mitigation: a future-focused perspective, Geosciences 14, 244 (2024) [Google Scholar]
M. Aboualola, K. Abualsaud, T. Khattab et al., Edge technologies for disaster management: a survey of social media and artificial intelligence integration, IEEE Access 11, 73782–73802 (2023) [Google Scholar]
C. Gao, S. Ajith, M.V. Peelen, Object representations drive emotion schemas across a large and diverse set of daily-life scenes, Commun. Biol. 8, 697 (2025) [Google Scholar]
X. Fang, Y. Zhang, H. Tan et al., Performance evaluation and optimization of 3D gaussian splatting in indoor scene generation and rendering, ISPRS Int. J. Geo-Inf. 14, 21 (2025) [Google Scholar]
C. Shan, C. Wu, Y. Xia et al., Adaptive resource allocation for workflow containerization on Kubernetes, J. Syst. Eng. Electron. 34, 723–743 (2023) [Google Scholar]
P. Marques, P. Váz, J. Silva et al., Real-time gesture-based hand landmark detection for optimized mobile photo capture and synchronization, Electronics 14, 704 (2025) [Google Scholar]
J. Jia, W. Ye, Deep learning for earthquake disaster assessment: objects, data, models, stages, challenges, and opportunities, Remote Sens. 15, 4098 (2023) [Google Scholar]
X. Cheng, Q. Li, R. Hai et al., Research progress and prospects of seismic performance on underground structure embedded in soft soil foundation, Sci. Rep. 14, 21883 (2024) [Google Scholar]
A. Melnik, M. Miasayedzenkau, D. Makaravets et al., Face generation and editing with stylegan: a survey, IEEE Trans. Pattern Anal. Mach. Intell. 46, 3557–3576 (2024) [Google Scholar]
Z. Huang, S.M. Erfani, S. Lu et al., Efficient neural implicit representation for 3D human reconstruction, Pattern Recognit. 156, 110758 (2024) [Google Scholar]
F. Liu, H.Y. Wang, S.Y. Shen et al., OPO-FCM: a computational affection based OCC-PAD-OCEAN federation cognitive modeling approach, IEEE Trans. Comput. Soc. Syst. 10, 1813–1825 (2022) [Google Scholar]
Y. Zhao, Z. Liu, J. Xiao et al., Research on emotion modeling of intelligent agents in earthquake evacuation simulation, Cogn. Syst. Res. 87, 101242 (2024) [Google Scholar]
S. Mok, S. Park, M. Whang, Examining the impact of digital human gaze expressions on engagement induction, Biomimetics 8, 610 (2023) [Google Scholar]
W. Li, Q. Wang, W. Cheng et al., Development and application of a smart emergency response platfrom for earthquake disasters based on multi-source monitoring data, Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 48, 25–30 (2022) [Google Scholar]
N. Algiriyage, R. Prasanna, K. Stock et al., Multi-source multimodal data and deep learning for disaster response: a systematic review, SN Comput. Sci. 3, 92 (2022) [Google Scholar]
S. Muralidharan, S. Turuvekere Sreenivas, R. Joshi et al., Compact language models via pruning and knowledge distillation, Adv. Neural Inf. Process. Syst. 37, 41076–41102 (2024) [Google Scholar]
A. AlAbdulaali, A. Asif, S. Khatoon et al., Designing multimodal interactive dashboard of disaster management systems, Sensors 22, 4292 (2022) [Google Scholar]
L. Zhao, W.Z. Song, L. Shi, X. Ye, Decentralised seismic tomography computing in cyber-physical sensor systems, Cyber-Phys. Syst. 1, 91–112 (2015) [Google Scholar]
J. Mok, N. Kwak, Performance improvement of facial gesture-based user interface using MediaPipe face mesh, J. Internet Things Converg. 9, 125–134 (2023) [Google Scholar]
K. Amara, O. Kerdjidj, N. Ramzan, Emotion recognition for affective human digital twin by means of virtual reality enabling technologies, IEEE Access 11, 74216–74227 (2023) [Google Scholar]
X. Zhang, X. Xie, H. Zhao et al., Seismic response prediction method of train-bridge coupled system based on convolutional neural network-bidirectional long short-term memory-attention modeling, Adv. Struct. Eng. 28, 341–357 (2025) [Google Scholar]
J. Gonzalez, W. Yu, L. Telesca, Gated recurrent units based recurrent neural network for forecasting the characteristics of the next earthquake, Cybern. Syst. 53, 209–222 (2022) [Google Scholar]
S. Hazmoune, F. Bougamouza, Using transformers for multimodal emotion recognition: taxonomies and state of the art review, Eng. Appl. Artifi. Intell. 133, 108339 (2024) [Google Scholar]
S. Banar, R. Mohammadi, Seismonet: a proximal policy optimization-based earthquake early warning system using dilated convolution layers and online data augmentation, Expert Syst. Appl. 253, 124337 (2024) [Google Scholar]

Cite this article as: Boyu Zhao, Xiuxia Yue, Xinqiang Yao, Integration of artificial intelligence in user experience and interface visual design − earthquake simulation and multimodal optimization, Int. J. Simul. Multidisci. Des. Optim. 16, 20 (2025), https://doi.org/10.1051/smdo/2025027

All Tables

Table 1

Lightweight engine parameter settings.

In the text

Table 2

Success rate of action sequence generation in complex obstacle scenarios.

In the text

Table 3

User statistics and task completion.

In the text

Table 4

System performance indicators with different earthquake magnitudes.

In the text

Table 5

Comparison of performance of synchronous optimization algorithms.

In the text

All Figures

	Fig. 1 Overall architecture of the interaction process.
In the text

	Fig. 2 System workflow.
In the text

	Fig. 3 VR earthquake simulation interface and shock avoidance simulation interface.
In the text

	Fig. 4 Simulation of crouching posture with head protection.
In the text

	Fig. 5 Comparison of head protection posture accuracy and standard deviation. a. Comparison of head protection posture accuracy. b. Standard deviation trend.
In the text

	Fig. 6 Emotion recognition F1-score results.
In the text

	Fig. 7 Quantifying the impact of multimodal synchronization error on user satisfaction.
In the text

	Fig. 8 System stability results.
In the text

	Fig. 9 Comparison of concurrency performance.
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[R1] H.O. Demirel, S. Ahmed, V.G. Duffy, Digital human modeling: a review and reappraisal of origins, present, and expected future methods for representing humans computationally, Int. J. Hum. Comput. Interact. 38, 897–937 (2022) [Google Scholar]

[R2] B. Wang, H. Zhou, X. Li et al., Human digital twin in the context of industry 5.0, Robot. Comput. Int. Manuf. 85, 102626 (2024) [Google Scholar]

[R3] K. Ling'ai, H. Zichao, Design and application of popular science information release system for earthquake prevention and disaster reduction, Prog. Earthquake Sci. 54, 346–352 (2024) [Google Scholar]

[R4] R. Damaševičius, N. Bacanin, S. Misra, From sensors to safety: Internet of Emergency Services (IoES) for emergency response and disaster management, J. Sens. Actuator Netw. 12, 41 (2023) [Google Scholar]

[R5] N. Li, N. Sun, C. Cao et al., Review on visualization technology in simulation training system for major natural disasters, Nat. Hazards 112, 1851–1882 (2022) [Google Scholar]

[R6] V. Plevris, AI-driven innovations in earthquake risk mitigation: a future-focused perspective, Geosciences 14, 244 (2024) [Google Scholar]

[R7] M. Aboualola, K. Abualsaud, T. Khattab et al., Edge technologies for disaster management: a survey of social media and artificial intelligence integration, IEEE Access 11, 73782–73802 (2023) [Google Scholar]

[R8] C. Gao, S. Ajith, M.V. Peelen, Object representations drive emotion schemas across a large and diverse set of daily-life scenes, Commun. Biol. 8, 697 (2025) [Google Scholar]

[R9] X. Fang, Y. Zhang, H. Tan et al., Performance evaluation and optimization of 3D gaussian splatting in indoor scene generation and rendering, ISPRS Int. J. Geo-Inf. 14, 21 (2025) [Google Scholar]

[R10] C. Shan, C. Wu, Y. Xia et al., Adaptive resource allocation for workflow containerization on Kubernetes, J. Syst. Eng. Electron. 34, 723–743 (2023) [Google Scholar]

[R11] P. Marques, P. Váz, J. Silva et al., Real-time gesture-based hand landmark detection for optimized mobile photo capture and synchronization, Electronics 14, 704 (2025) [Google Scholar]

[R12] J. Jia, W. Ye, Deep learning for earthquake disaster assessment: objects, data, models, stages, challenges, and opportunities, Remote Sens. 15, 4098 (2023) [Google Scholar]

[R13] X. Cheng, Q. Li, R. Hai et al., Research progress and prospects of seismic performance on underground structure embedded in soft soil foundation, Sci. Rep. 14, 21883 (2024) [Google Scholar]

[R14] A. Melnik, M. Miasayedzenkau, D. Makaravets et al., Face generation and editing with stylegan: a survey, IEEE Trans. Pattern Anal. Mach. Intell. 46, 3557–3576 (2024) [Google Scholar]

[R15] Z. Huang, S.M. Erfani, S. Lu et al., Efficient neural implicit representation for 3D human reconstruction, Pattern Recognit. 156, 110758 (2024) [Google Scholar]

[R16] F. Liu, H.Y. Wang, S.Y. Shen et al., OPO-FCM: a computational affection based OCC-PAD-OCEAN federation cognitive modeling approach, IEEE Trans. Comput. Soc. Syst. 10, 1813–1825 (2022) [Google Scholar]

[R17] Y. Zhao, Z. Liu, J. Xiao et al., Research on emotion modeling of intelligent agents in earthquake evacuation simulation, Cogn. Syst. Res. 87, 101242 (2024) [Google Scholar]

[R18] S. Mok, S. Park, M. Whang, Examining the impact of digital human gaze expressions on engagement induction, Biomimetics 8, 610 (2023) [Google Scholar]

[R19] W. Li, Q. Wang, W. Cheng et al., Development and application of a smart emergency response platfrom for earthquake disasters based on multi-source monitoring data, Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 48, 25–30 (2022) [Google Scholar]

[R20] N. Algiriyage, R. Prasanna, K. Stock et al., Multi-source multimodal data and deep learning for disaster response: a systematic review, SN Comput. Sci. 3, 92 (2022) [Google Scholar]

[R21] S. Muralidharan, S. Turuvekere Sreenivas, R. Joshi et al., Compact language models via pruning and knowledge distillation, Adv. Neural Inf. Process. Syst. 37, 41076–41102 (2024) [Google Scholar]

[R22] A. AlAbdulaali, A. Asif, S. Khatoon et al., Designing multimodal interactive dashboard of disaster management systems, Sensors 22, 4292 (2022) [Google Scholar]

[R23] L. Zhao, W.Z. Song, L. Shi, X. Ye, Decentralised seismic tomography computing in cyber-physical sensor systems, Cyber-Phys. Syst. 1, 91–112 (2015) [Google Scholar]

[R24] J. Mok, N. Kwak, Performance improvement of facial gesture-based user interface using MediaPipe face mesh, J. Internet Things Converg. 9, 125–134 (2023) [Google Scholar]

[R25] K. Amara, O. Kerdjidj, N. Ramzan, Emotion recognition for affective human digital twin by means of virtual reality enabling technologies, IEEE Access 11, 74216–74227 (2023) [Google Scholar]

[R26] X. Zhang, X. Xie, H. Zhao et al., Seismic response prediction method of train-bridge coupled system based on convolutional neural network-bidirectional long short-term memory-attention modeling, Adv. Struct. Eng. 28, 341–357 (2025) [Google Scholar]

[R27] J. Gonzalez, W. Yu, L. Telesca, Gated recurrent units based recurrent neural network for forecasting the characteristics of the next earthquake, Cybern. Syst. 53, 209–222 (2022) [Google Scholar]

[R28] S. Hazmoune, F. Bougamouza, Using transformers for multimodal emotion recognition: taxonomies and state of the art review, Eng. Appl. Artifi. Intell. 133, 108339 (2024) [Google Scholar]

[R29] S. Banar, R. Mohammadi, Seismonet: a proximal policy optimization-based earthquake early warning system using dilated convolution layers and online data augmentation, Expert Syst. Appl. 253, 124337 (2024) [Google Scholar]