| Issue |
Int. J. Simul. Multidisci. Des. Optim.
Volume 16, 2025
Multi-modal Information Learning and Analytics on Cross-Media Data Integration
|
|
|---|---|---|
| Article Number | 26 | |
| Number of page(s) | 15 | |
| DOI | https://doi.org/10.1051/smdo/2025026 | |
| Published online | 21 October 2025 | |
Research Article
User interface layout optimization design based on eye tracking simulation
1
School of Art and Design, North China Institute of Aerospace Engineering, Langfang 065000, Hebei, PR China
2
Faculty of Art, Sustainability and Creative Industry, Sultan Idris Education University, Tanjong Malim 359000, Perak, Malaysia
* e-mail: xicuiyu@nciae.edu.cn
Received:
5
July
2025
Accepted:
3
September
2025
To solve the problem of user visual attention distribution and spatial misalignment of interface elements in current user interface layout design due to reliance on subjective experience, this paper proposes an optimization method for fashion brand e-commerce user interface design that uses eye tracking simulation and integrates brand characteristics. This paper builds a fashion-specific visual behavior database through multimodal eye movement data collection, and uses the spatiotemporal attention mechanism of the Transformer-XL (Transformer with Extra Long Context) model to predict the gaze hotspots and scanning paths of users in dynamic tasks; then, this paper designs a multi-agent reinforcement learning optimizer, encodes aesthetic rules such as brand logo size and main color ratio as hard constraints, and generates an interface space distribution plan through element competition-collaboration game; finally, this paper develops a real-time interactive prototype based on the Unity engine to achieve dynamic layout closed-loop optimization driven by the collaborative efforts of brand constraints and eye movement simulation heat maps. Experimental results show that this method reduces the task completion time by 32% in the optimization of the light luxury clothing homepage, the overlap between the simulated and real eye movement hot spots reaches 88%, and the brand consistency score increases by 46% (from 3.2 to 4.7), significantly reducing the user's cognitive load and improving the user experience. The conclusion confirms that by integrating eye movement behavior quantification with brand characteristics, it is possible to break through the traditional design's reliance on static aesthetics, provide a “cognitive adaptation-brand expression” dual-goal collaborative intelligent design paradigm for fashion interfaces, and promote the simultaneous improvement of user conversion rate and brand value.
Key words: User interface design optimization / eye tracking simulation / dynamic reinforcement learning / spatiotemporal attention mechanism / multi-agent optimizer
© C. Xi and M.Z. Idris, Published by EDP Sciences, 2025
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
Fashion brand user interface design [1] has gradually become the core carrier of brand value transmission and user interaction experiences in the digitalization wave. E-commerce platforms and customized service interfaces in the fields of clothing, jewelry, and other related products need to meet the dual needs of visual aesthetic expression [2] and efficient cognitive guidance. The rationality of their layout directly affects user decision-making efficiency [3] and brand image perception. The current mainstream design methods are highly dependent on the designer's subjective experience and static aesthetic rules, resulting in significant deviations between the spatial distribution of interface elements and the user's actual visual behavior patterns. Studies have shown that when users browse fashion product interfaces, their attention distribution [4] exhibits dynamic evolution characteristics. For example, in a clothing matching scenario, the eyes frequently jump between product pictures, price tags, and recommended accessories, and the traditional fixed layout is difficult to adapt to such dynamic needs. Although existing technologies have introduced eye tracking [5] for post-evaluation, data collection is limited by laboratory environments and small sample sizes, and the optimization process lags behind the design stage, making it impossible to achieve real-time intervention. More importantly, there is a lack of a systematic framework for the coordinated optimization of aesthetic rules of fashion brand interfaces, such as LOGO position, main color ratio, and user cognitive rules, such as scanning paths and gaze duration. Algorithms often only take click-through rate or task time as a single goal, ignoring the core business demand of brand consistency. The above problems lead to a dilemma in interface design between user experience and brand value: excessive pursuit of layout efficiency may weaken the brand's unique visual language, while sticking to the aesthetic paradigm may easily lead to lengthy user operation paths and reduced conversion rates.
Empirical research in the field of interface layout design has explored user experience optimization paths from multiple dimensions, laying the foundation for methodological innovation. Santoso M F [6] applied UI/UX (User Interface/User Experience) design methods and Figma tools to develop web interfaces, and verified that a systematic design process can significantly improve the consistency, aesthetics, and user experience of the interface, especially in terms of color coordination and font standardization, which meets the requirements of the preset design system and provides a practical reference for interface innovation in the Internet era. However, this type of tool-driven static design model lacks adaptability to user dynamic behavior feedback. Li W et al. [7] analyzed the multi-terminal interface layouts of 40 responsive websites through a hierarchical group experiment, combined eye tracking and questionnaire surveys to extract the core design factors that affect interface consistency, and used sales management software as an example to verify the effectiveness of the cross-terminal responsive design method. They successfully achieved consistency and sustainability of user experience. Their method further expanded the objectivity of design verification at the level of behavioral data fusion. Wang X et al. [8] proposed a comprehensive interface aesthetics evaluation model based on multivariate regression and the entropy weight method. By integrating seven types of aesthetic indicators, such as density and symmetry, and combining them with automatic image segmentation technology, they developed a prototype system that supports rapid evaluation. This significantly improved the efficiency and flexibility of aesthetic evaluation of human-computer interaction interfaces and provided a lightweight solution for design iteration. Although the above studies have made progress in tool flow, multi-terminal adaptation, and cognitive mechanism analysis, they have not yet solved the problem of accurate matching of interface layout and real-time user attention in dynamic scenarios, and lack a systematic framework for the coordinated optimization of brand feature constraints and cognitive laws.
The integration of reinforcement learning methods with behavioral data and brand specifications is driving the evolution of user interface optimization design towards dynamic and personalized directions. Yan H et al. [9] proposed a fashion interface generation framework based on style transfer, which uses the GAN (Generative Adversarial Networks) network to convert brand design specifications into layout templates. Although this improves design efficiency, it lacks dynamic user adaptation capabilities. Zhan X et al. [10] proposed a deep learning method based on a VAE-GAN hybrid architecture. By capturing user preferences and behavior data, they generated personalized UI layouts in real time. In the experiment, they achieved a personalized accuracy rate of 0.89 and a fast response time of 1.2 s, which was significantly better than the existing system. Their effectiveness was verified by cross-cultural users, providing a new paradigm for AI-driven adaptive interface design that balances efficiency and ethics. Although its generative model breaks through static constraints, its adaptability to the long-term evolution of user behavior remains to be verified. Khamaj A et al. [11] proposed a dynamic user interface optimization method based on reinforcement learning and deep Q-network. The method realizes personalized adjustment of the interface by analyzing user behavior data in real time. While improving user engagement, satisfaction and task completion rate, it solves the problems of low interactivity and lack of adaptability of traditional interfaces caused by static design. This method compensates for the limitations of short-term data-driven approaches through continuous interactive learning optimization strategies. Although the above studies have verified the potential of algorithm-enabled interface optimization in different fields, none of them have solved the problem of universal adaptation across scenarios, and lack deep collaborative modeling of multimodal data such as eye movements and behaviors, which restricts the ability to dynamically balance cognitive load in complex scenarios, thereby reducing the user experience of the user interface.
To address the above challenges, this study proposes a collaborative optimization architecture based on eye tracking simulation and dynamic reinforcement learning. First, this paper constructs a multimodal eye movement database, covering typical fashion e-commerce web scenarios such as clothing browsing and jewelry customization, collects gaze point coordinates, scanning trajectories, and pupil diameter data of more than 200 users, and establishes a spatiotemporal visual behavior model. The Transformer-XL neural network is used to process long-sequence eye movement data, and the self-attention mechanism is used to capture the migration rules of users' eye movement patterns across tasks, such as the transition from global scanning in the product overview stage to local focus in the size selection stage. On this basis, a multi-agent reinforcement learning framework with brand feature embedding is designed, in which interface elements (product display area, purchase button, brand story bar, etc.) are defined as independent agents. The action space of each agent includes three types of operations: position offset, size scaling, and transparency adjustment. Brand design specifications (such as minimum LOGO display area, primary color ratio threshold, etc.) are encoded as hard constraints, which, together with the attention heat map predicted by user eye movement simulation, constitute a reward function, driving the intelligent agent to find the Pareto optimal solution through a competition-cooperation game. To verify the closed-loop optimization effect, a real-time interactive prototype system is developed based on the Unity engine, integrating the eye movement simulation module and the layout rendering engine to achieve the ability to dynamically adjust the interface at 20 frames per second. This method theoretically establishes for the first time a dual-objective optimization model of “cognitive adaptation-aesthetic expression” for fashion brand interfaces. It technically breaks through the limitation that traditional eye movement data is only used for a posteriori evaluation, and integrates behavior prediction and layout generation into a unified process. At the practical level, it provides designers with a visual tool chain, supports full-cycle collaboration from data collection, simulation optimization to prototype iteration, and promotes the transformation of fashion interface design from experience-driven to data-intelligent-driven paradigm.
2 Fashion interface eye movement simulation and dynamic optimization architecture
2.1 Overall architecture of interface optimization design
The system architecture is shown in Figure 1. The “incremental training model iteration” is triggered when the intersection-over-union (IoU) ratio between the predicted attention heatmap and the real user interaction data is below a threshold of 0.8. The “resource reallocation” is initiated when the model update jitter exceeds 5 ms to ensure rendering smoothness.
Figure 1 builds an eye-tracking interface optimization closed-loop system, which starts with multimodal data collection. The eye-tracking device captures the user's gaze point and scanning path, and the brand design specifications input hard constraints such as LOGO size and primary color. The two together constitute the basic input of the data layer. At the core processing layer, the spatiotemporal attention engine integrates eye movement and interaction data, generates a prediction heat map through the Transformer-XL model, and outputs it to the reinforcement learning optimizer when the intersection-over-union ratio reaches a certain value; otherwise, it triggers the incremental training model iteration. The optimizer combines the heat map with the brand constraints and solves the layout plan through a multi-agent game. If there is a conflict in the brand's hard constraints, the agent competition-collaboration re-optimization is initiated. The feedback loop continuously monitors user interactions and records behavioral data to drive online incremental learning. Otherwise, it triggers resource reallocation to ensure smoothness, and new data flows back to the spatiotemporal attention engine to form a closed-loop optimization. The system achieves the quantitative goals of increasing hot spot overlap and reducing task time while ensuring the hard constraints of brand aesthetics through the four-stage collaboration of “prediction-optimization-rendering-feedback”. At the same time, it dynamically balances computing load and user experience with a low latency threshold.
![]() |
Fig. 1 Overall design architecture. |
2.2 Multimodal fashion interface eye movement data collection and modeling
In response to the data needs of fashion brand user interface layout optimization, this study constructed a multimodal eye movement behavior database for scenes such as clothing and jewelry. The data acquisition system uses a Tobii Pro Fusion eye tracker, a Logitech MX Keys keyboard, and a Microsoft Surface Dial knob to form a multi-channel interactive recording platform, which simultaneously captures the user's gaze point coordinates, scanning path, pupil diameter, and physical operation events during browsing, filtering, and matching tasks [12]. The experimental design includes three types of typical fashion interface prototypes: the homepage of a light luxury clothing e-commerce, the jewelry fashion interface, and the virtual fitting room. Each prototype is embedded with brand visual specification parameters (LOGO size, main color RGB value, font level Fi). The participants were 200 users of different ages, who were divided into three groups according to their fashion consumption frequency: high, medium, and low. They completed a preset task sequence in a controlled lighting environment and generated raw eye movement data Draw={(tk,xk,yk,dk,ok)|k=1,…,N}, where tk represents the timestamp, (xk,yk) represents the gaze point coordinates on the screen, dk represents the pupil diameter, and ok∊{click, zoom, rotate} represents the operation event.
In the data preprocessing stage, the timestamp alignment model is used to eliminate the multi-device collection delay. The formula is:
In the formula, α is the scaling factor, which corrects the clock frequency difference between the eye tracker and the interactive device; β is the offset, which compensates for the initial startup time deviation of the device. The timing offset parameters α and β are solved by the least squares method to ensure that the time synchronization error between the eye movement data and the interaction events is reduced. The original gaze point is subjected to Kalman filter noise reduction, and the state equation is defined as:
In the formula, A is the state transfer matrix, which describes the dynamic evolution of the gaze point from time k-1 to time k; H is the observation matrix, which maps the state vector to the measurement space. The state vector xk=[xk,yk,ẋk,ẏk]T contains the gaze point position and velocity, and the covariance matrix of the process noise wk and the observation noise vk is iteratively estimated by the EM algorithm.
The two steps are complementary: formula (1) corrects the macro-level timing misalignment between different devices, while formula (2) filters the micro-level noise in the gaze point signal.
The feature extraction module constructs spatiotemporal correlation features from the cleaned data: gaze point clustering generates visual hotspots Hi=(µx,µy,σx,σy, T) (mean, variance, duration), and the scanning path is modeled as a Markov transfer matrix M=[p(sj|si)]n×n, which represents the visual correlation strength between interface elements [13,14]. Brand features are quantified through a style transfer network (StyleGAN2), which maps the interface screenshots to a latent space vector zb ∊ R512 and constructs a brand aesthetic feature set B={zb(1),…, zb(m)} [15]. Finally, the multimodal dataset Dmulti consists of the eye movement feature matrix E ∈ RN×6 (including coordinates, duration, and pupil diameter), the interaction event sequence, and the brand feature tensor B ∈ Rm×512. The cross-modal association is achieved through the tensor decomposition model:
In the formula, the tensor product realizes the cross-modal joint representation of eye movement features, interaction events, and brand features. T ∈ RN×m×6 is the trimodal tensor, er, or, br are the latent factor vectors extracted from the trimodal data by Tucker decomposition, and λr is the weight coefficient, which represents the contribution weight of each latent factor to the global tensor. This modeling approach provides a joint representation space of cross-domain associations for subsequent spatiotemporal attention simulation.
2.3 Construction of an eye movement behavior simulation engine based on spatiotemporal attention
Based on a multimodal dataset1, this study constructs an eye movement behavior simulation engine driven by spatiotemporal attention to achieve dynamic prediction of user visual attention [16]. The input data includes the eye movement feature matrix E ∈ ℝN×6, the interaction event sequence, and the brand feature tensor ℬ ∈ ℝm×512, which are mapped to a unified representation space through an embedding layer [17]. The eye movement feature vector ek = [xk,yk,Δxk,Δyk,dk,τk] (coordinates, displacement, pupil diameter, fixation duration) is position-encoded to generate a temporal tag sequence
, where the position encoding function is:
In the formula, dmodel = 512 is the model dimension, pos represents the time step position, and i is the dimension index. The brand feature vector zb is projected into the brand context vector cb = Wbzb+bb through the fully connected layer, where Wb ∈ R512×512 is the weight matrix and bb ∈ R512 is the bias term. The projected brand features and temporal tags are concatenated and input into the improved Transformer-XL architecture, which introduces a two-stream attention mechanism to model temporal dependency and spatial association, respectively [18,19].
The temporal attention flow adopts a segmented loop strategy. The memory module Mt ∈ RM×d stores the hidden state of the previous segment, and the current segment input Ht=Concat(Mt,Et) calculates the self-attention weight:
In the formula, the relative position encoding matrix Srel ∈ RL×L is generated through learnable parameters, initialized as a zero-mean Gaussian distribution. Qt=WqHt and Kt=WkHt are the query matrix and key matrix, respectively, Wq and Wk ∈ Rd×dk are projection weights, and dk=64 is the dimension of the attention head.
The spatial attention flow divides the interface into G×G grid regions, where each region g corresponds to a spatial position code pg ∈ Rd and aggregates neighborhood information through a graph convolutional network [20]:
In the formula, 𝒩 (g) represents the neighborhood of region g, W (l) ∈ ℝd×d is the weight matrix of the l-layer, andσis the GELU activation function [21]. The attention coefficient αgg′ is determined by the Euclidean distance dgg′ between regions and the similarity of brand features:
In the formula, w ∈ ℝ2 is the learnable parameter, cos (·) calculates the cosine similarity, and
is the brand feature vector of region g.
The dual-stream output is combined into a gated fusion module to generate a joint representation:
In the formula, Wg ∈ R2d×1 is the gate weight matrix. The prediction module uses a mixed density network to output the distribution parameters of the gaze points in the future Δt steps:
In the formula, the mixing coefficient πk, mean
, and covariance Σk are calculated through the fully connected layer and satisfy
=1. The loss function jointly optimizes the position prediction error and distribution likelihood:
The engine is trained with the AdamW optimizer (learning rate η = 3 × 10-4, momentum parameter β1 = 0.9, β2 = 0.98), gradient clipping is used to prevent divergence, and overfitting is controlled by early stopping on the validation set.
2.4 Design of reinforcement learning optimizer for embedding brand visual features
To achieve the coordinated optimization of brand aesthetic rules and user cognitive laws, a multi-agent deep reinforcement learning framework is designed, and interface elements (such as product display area, purchase button, brand logo) are defined as independent agents [22,23]. The state space of each agent contains local observation information: the current position (xi,yi), size (wi,hi), transparency αi, brand constraint satisfaction ρi of the element, and the predicted gaze heat value ϕi ∈ [0,1] obtained from the spatiotemporal attention engine. The action space Ai is defined as a three-dimensional continuous vector ai = [Δxi,Δwi,Δαi], which controls the position offset (pixel), width scaling (ratio), and transparency adjustment (0-1), respectively. The global state Sg integrates all agent states and the brand feature vector zb, and models the spatial relationship between elements through the graph attention network:
In the formula, Wq, Wk, and Wv ∈ Rd×d are projection matrices, d=256 is the hidden layer dimension, and N(i) represents the set of agents that are spatially adjacent to element i. Multi-agent modeling demonstrates the game mechanism of independent element decision-making and global state integration, balancing local optimization and overall brand consistency requirements.
Brand visual constraints are embedded in the optimization process through a hard reward function and a soft penalty term [24]. Hard constraints include brand logo visibility: LOGO display area wlogo × hlogo ≥ Smin(Smin = 64 × 64px) ≥10% of the screen area (based on Nielsen Norman Group's web usability guidelines); primary color ratio: the proportion of pixels with a RGB value of cbrand in the brand's primary color in the interface is ηcolor ≥ 30% (derived from analysis of 50 luxury brand websites in the Interbrand 2023 report); font level consistency: the title font size and the body text font size meet Ftitle/Fbody ≥ 2.5.
The soft reward function r(Sg,a) consists of three weighted parts:
Attention reward measures how well the layout matches the predicted eye movement hotspots:
In the formula, IoU calculates the intersection and union ratio of the element bounding box and the hot zone, and ϕi is the thermal value weight. Brand aesthetics rewar evaluates the cosine similarity between the interface style and brand characteristics:
In the formula, zrender is the StyleGAN2 latent vector of the current interface rendering. Conflict penalty rconflict suppresses element overlap and visual confusion:
In the formula, Overlap calculates the overlapping area between elements, Entropy (Hcolor) is the information entropy of the interface color histogram, and λ = 0.1 balances the weights of the two items. The hard constraints embedded in brand constraints ensure basic design specifications, and the soft rewards guide the Pareto optimality of aesthetics and functionality, avoiding the oscillation problem of the traditional penalty function method.
The policy network πθ (a ∣ 𝒮g) is optimized using the double-delayed deep deterministic policy gradient (TD3) algorithm, and the critic network Qϕ (𝒮g, a) is fully connected (layer structure 512-256-128). The TD3 algorithm improves the stability of the strategy through a dual-critic network and priority sampling, and is suitable for layout optimization tasks in high-dimensional continuous action spaces. The objective function is to maximize the discounted cumulative reward:
The discount factor γ=0.99, and the exploration noise is a truncated normal distribution 𝒩(0,0.12). Prioritized experience replay is used in the training phase, and the sampling probability pi is adjusted based on the temporal difference error δi:
Hyperparameters ϵ=0.01, α=0.6. The optimizer uses the Adam algorithm (η = 10-4, β1 = 0.9, β2 = 0.999) to update the target network parameters every 10,000 steps.
The reward function design of the optimizer quantifies the collaborative goal of “cognitive adaptation-brand expression” through the ternary balance of attention matching, brand similarity, and conflict penalty.
2.5 Real-time interactive interface generation and feedback loop
Based on the layout parameter matrix L = [l1,l2,...,ln] (where li = (xi,yi,wi,hi,αi) represents element position, size, and transparency) output by the reinforcement learning optimizer [25], a real-time interface rendering engine is constructed. The Unity URP rendering pipeline is used to implement dynamic layout mapping and define the geometric transformation formula of interface element primitive Pi:
In the formula, sx and sy are screen scaling factors (calculated dynamically based on device resolution), and R(θi) is the rotation matrix (default θi=0 to preserve brand design specifications). The alpha channel is dynamically blended via the fragment shader:
The rendering engine updates the interface at a frame rate of 60FPS and receives the layout parameter stream Lt through the WebSocket protocol. The transmission delay constraint is Δtlatency≤16.7 ms (matching the single frame time).
User interaction data Dinteract ={ (tk,uk,ak,fk) }is captured through the browser event monitoring module, where uk ∊ {click, hover, scroll} is the operation type, ak is the operation target element ID, and fk = (xk,yk,Δtk) records the cursor coordinates and dwell time. The data stream is converted into a time series feature vector vt ∈ R8 through a sliding window aggregator:
In the formula, T = 5s is the window length, σx, σy are the standard deviation of the cursor position, Entropy(Ha) calculates the operation target distribution entropy, CLS is the cross-attention score of brand characteristics and historical behaviors, and ||ΔLt||2 measures the intensity of layout changes.
The feedback loop updates the spatiotemporal attention engine and reinforcement learning policy network through online incremental learning. The attention engine parameter Θe is updated using the elastic weight consolidation (EWC) algorithm:
In the formula, Fi is the diagonal element of the Fisher information matrix, calculated at the end of each incremental learning cycle (every 15 min), and λ=103 controls the retention strength of old knowledge. The reinforcement learning policy network πθ performs online fine-tuning through proximal policy optimization PPO (Proximal Policy Optimization):
The advantage function At is calculated by the generalized advantage estimation (GAE):
Hyperparameters λgae=0.95, ϵ = 0.2, and learning rate η=3×10-5. The system performs an incremental update every 15 min, and a double buffering mechanism is used to maintain interface fluency during the update.
The performance monitoring module calculates key indicators in real time, including layout rendering delay τrender = tGPU+tnetwork, strategy inference time τinfer≤ 2 ms, and model update jitter
. When Jupdate>5ms, the resource reallocation strategy is triggered to dynamically adjust the computing node load.
2.6 UI design effects of fashion brand products
Among the several typical fashion interfaces studied in this paper, fashion products such as clothing, gestures, and jewelry are taken as the research direction, and corresponding UI interfaces are designed. The interface effects are shown in Figures 2–4.
![]() |
Fig. 2 Apparel fashion brand UI display. |
![]() |
Fig. 3 Jewelry fashion brand display. |
![]() |
Fig. 4 UI display of a high-end jewelry fashion brand. |
3 Simulation optimization experiment design and evaluation
3.1 Experimental design process
In order to verify the effectiveness of the eye tracking simulation and dynamic optimization framework proposed in this paper, the experimental design includes two parts: simulation experiment and optimization experiment, which respectively evaluate the performance of the system in the simulation environment and the layout optimization effect after actual deployment.
The simulation experiment aims to verify the prediction accuracy of the spatiotemporal attention engine on user eye movement behavior and to build a virtual user interaction environment and a multi-scenario test set. The experimental platform is developed based on the Unity engine and integrates the Tobii Pro SDK simulator to generate an eye movement data stream. The test set includes three types of fashion interface scenarios: the apparel e-commerce homepage simulates users browsing new product recommendations, filtering products, viewing details and other tasks; the jewelry fashion interface simulates users selecting materials, adjusting sizes, previewing 3D models and other operations; the virtual fitting room simulates users trying on clothes, rotating perspectives, saving and sharing and other behaviors.
500 sets of test cases are generated for each scenario, covering different brand styles such as fast fashion, affordable luxury, high-end customization, and user groups. The simulated eye movement data is injected with random noise (Gaussian noise σ= 5 px) through a preset attention distribution model based on the Gaussian Mixture Model fitted from real user data to simulate the measurement error of the real device. In the evaluation phase, the gaze hotspots predicted by the engine are compared with the true hotspots generated by the simulator, and the intersection over union (IoU), fixation displacement error (FDE), and scanning path similarity (DTW) are calculated. The experimental results are shown in Table 1.
According to Table 1, the predicted hot zone loU of the clothing e-commerce homepage reaches 85.2%, the gaze point offset error is 12.4 pixels, and the scanning path similarity is 0.78, which are better than other scenes, thanks to its strong layout regularity and clear user goals. The jewelry customization tool involves dynamic parameter adjustment, so the IoU drops to 82.7% and the FDE increases to 14.1 pixels, reflecting the challenge of interaction complexity to prediction accuracy. Affected by dynamic perspective switching, the IoU of the virtual fitting room further drops to 79.8%, and the FDE and DTW reach 16.3 pixels and 0.65, respectively, highlighting the difficulty of modeling the rapid migration of visual focus in dynamic scenes. The IoU standard deviation (3.1%∼5.2%) and FDE fluctuation range (2.8∼4.1 pixels) of the three types of scenes show that the engine has good robustness to noise interference, and the IoU is higher than 79% and the FDE is less than 17 pixels in the actual measurement results, which meets the actual accuracy requirements of the eye tracker, proving that the spatiotemporal attention engine can provide reliable attention distribution prior for layout optimization.
The optimization experiment targeted the actual deployed interface layout optimization system and selected the online platforms of three fashion brands (fast fashion brand A, affordable luxury brand B, and high-end jewelry brand C) for A/B testing. The experimental group adopted the dynamic optimization framework in this paper, and the control group maintained the manual layout solution of the original design team. Each experiment lasted for 4 weeks and covered 3 types of core pages: the brand homepage optimized the product display area, promotion entrance, and navigation bar layout; the product details page adjusted the image carousel area, purchase button, and recommended accessories location; the user center page reconstructed the visual hierarchy of order tracking, favorites, and customer service entrance.
During the experiment, user behavior data collected include task completion time, click-through rate, page dwell time, etc.; eye movement data include hot spot overlap, gaze entropy, scanning path length, etc.; brand indicators include style consistency score, main color ratio, LOGO visibility, etc.; business indicators include conversion rate, bounce rate, user satisfaction, etc. The results of some indicators of the optimization experiment are shown in Table 2.
In Table 2, the task completion time of fast fashion brand A is reduced from 8.3 s to 5.6 s, and the conversion rate increases by 22.4%, which confirms the significant gain of layout optimization on decision-making efficiency in high information density interfaces. The brand consistency score of affordable luxury brand B increases from 3.5 to 4.6, and the overlap of hot spots reaches 87%, reflecting the advantage of the framework in balancing aesthetic norms and functional accessibility. The hot zone overlap of high-end jewelry brand C increases by 19 percentage points (63%→82%), and user satisfaction NPS (Net Promoter Score) increases from 6.5 to 8.0, reflecting the strengthening effect of cognitive alignment on emotional experience in deep customization scenarios. The conversion rates of the three types of brands increase by 15.3% to 22.4%, and the NPS exceeds 8 points, indicating that the optimization not only shortens the operation path but also enhances user recognition through the collaborative design of “brand expression-user experience”. Data differences reveal the differentiated demands of brand characteristics for optimization goals: fast fashion focuses on efficiency leaps, high-end brands rely on deepening experience, and affordable luxury brands need to take both into account. The framework achieves generalized adaptation across brand scenarios through dynamic constraint embedding and multi-objective game, providing empirical support for commercial interface deployment.
The optimized UI page is shown in Figure 5. It can be seen that both the UI details and content have been effectively improved and supplemented.
Simulation experiments verified the cross-scenario robustness of the eye movement behavior prediction engine, and optimization experiments confirmed the effectiveness of the dynamic layout framework in a real business environment. The two together support the practical value of the “data-driven design” paradigm.
Data results under different simulation scenarios.
Optimization results of indicators for three fashion brands.
![]() |
Fig. 5 Optimized UI page. |
3.2 User interface layout effect evaluation indicators
In order to comprehensively evaluate the layout optimization effect, this paper defines five categories of quantitative indicators covering dimensions such as user experience and interface performance.
The task completion time measures the time it takes for the user to enter the interface and complete the target operation, and is calculated by recording the difference between the start and end timestamps of the operation through the log system.
The hot zone overlap uses the intersection over union (IoU) ratio to quantify the spatial match between the predicted gaze area and the actual user gaze distribution.
The brand consistency score is evaluated by three senior designers independently to see whether the interface meets the brand visual specifications (LOGO position, primary color ratio, font hierarchy, etc.). The score ranges from 1 to 5 points, and the average is taken as the final result.
User satisfaction is measured by the Net Promoter Score (NPS), and user recommendation willingness (0–10 points) is collected through a questionnaire.
Conversion rate improvement is the difference in the proportion of users who completed the target action between the optimization group and the control group during the experimental period. Five indicators systematically verify the layout optimization effect from the perspectives of efficiency, cognitive alignment, aesthetic compliance, emotional identification, and commercial value.
4 Verification and analysis of the optimization effect of fashion interface layout
In order to systematically evaluate the optimization effect of interface layout, this paper combined multi-source data with visualization statistical methods to analyze from three dimensions: spatial distribution, temporal behavior, and brand expression, to support the verification of the collaborative goal of “cognitive adaptation-brand expression”.
4.1 Comparative analysis of hot zone spatial distribution
Figure 6 shows the spatial distribution comparison of the heat map predicted by the spatiotemporal attention engine and the real eye-tracking data in the clothing e-commerce homepage scenario. The horizontal axis represents the horizontal coordinate of the interface (0–800 px), and the vertical axis represents the vertical coordinate (0–500 px). From top to bottom, the vertical coordinates are from small to large, and each grid corresponds to 50 px. The coordinates of the five core areas correspond to the promotion entrance (100, 200), navigation bar (250, 150), product display area (400, 300), brand story column (550, 250), and customer service entrance (700, 400). The predicted heat map shows that the promotion entrance has the highest heat value (0.92) and the customer service entrance has the lowest heat value (0.61). The real heat map (b) shows the same trend but with slightly lower values (0.88 for the promotion entrance and 0.54 for the customer service entrance). The matching degree of hot areas is quantified by the IoU intersection-over-union ratio: the promotion entrance (89.5%) and the brand story column (85.1%) have the best matching, reflecting the accurate prediction of the core functional area; the customer service entrance has the lowest matching (62.8%), exposing the prediction deviation of the edge area. The scatter point size maps the thermal intensity, and the color gradually changes from blue to red to form a visual intensity gradient.
The heat difference comes from the cognitive modeling characteristics of the spatiotemporal attention engine: the high match between the promotion entrance and the brand story column (IoU>85%) is due to the Transformer-XL model accurately capturing the visual appeal mechanism of the brand's main elements, and strengthening the feature expression of the area around the LOGO through the self-attention weight; the low match of the customer service entrance reveals the model's prediction limitations for low-frequency interaction areas—when the user's eyes quickly scan the edge, the spatial attention coefficient of the graph convolutional network is not fully activated because the joint weight of the Euclidean distance and brand feature similarity is insufficiently calculated in the edge area. The overall high prediction value reflects the underfitting of the Gaussian mixing coefficient to the long-tail gaze behavior when the distribution parameters of the mixed density network output are in place. Especially when user fatigue causes the pupil diameter dt to fluctuate, the state equation of the Kalman filter fails to eliminate noise interference.
![]() |
Fig. 6 Comparison between simulation prediction and actual. |
4.2 Cross-brand verification of dynamic layout optimization framework
Figure 7 uses the number of experimental weeks (1–4 weeks) as the horizontal axis and the task completion time as the vertical axis. The solid line is the experimental group, and the dotted line is the control group. It shows the efficiency evolution trend of three types of fashion brands (fast fashion A, light luxury B, and high-end jewelry C) under the dynamic layout optimization framework. The task time of the fast fashion brand A (blue) experimental group drops sharply from 8.3 s in the first week to 5.6 s in the fourth week, a decrease of 32.5%, while the control group remains stable at around 8.0 s, and a significant difference is seen from the second week (**p<0.01). The light luxury brand B (orange) experimental group drops from 12.1 s to 8.7 s, a decrease of 28.1%. The optimization range continues to expand, and in the fourth week, it drops by another 4.6% compared with the third week. The experimental group of high-end jewelry brand C (yellow) shows gradual improvement, dropping from 15.4 s to 10.9 s, which becomes significant from the third week, and the rate of decline is the fastest in the later period (a drop of 7.6% in the fourth week). The data of the three-brand control group maintains a stable fluctuation (±0.3 s), proving that the effect of the experimental group is due to layout optimization rather than time factors. The final optimization amount in the fourth week is marked on the right side of the figure: Brand A reduces by 2.4 s, a decrease of 30.0%, Brand B reduces by 3.0 s, a decrease of 25.6%, and Brand C reduces by 3.9 s, a decrease of 26.4%.
The step-by-step optimization of task efficiency maps to the underlying mechanism of the dynamic reinforcement learning framework: Brand A's early and rapid response stems from the precise positioning of the promotion area-purchase button path by the multi-agent competition algorithm, which improves the visual hotspot matching through the reward function, thereby increasing the probability of key operation transfer; the subsequent acceleration of Brand C benefits from the gradual adaptation of PPO incremental learning to complex tasks − when the gaze entropy of the user's center page decreases, the algorithm automatically strengthens the transparency adjustment of the order tracking control to reduce cognitive interference. The delay of high-end brands significantly reflects the initial bargaining cost of brand constraints: to meet the hard constraint of LOGO visibility, the first week's layout sacrifices some hot zone overlap; as time goes by, the attention stream accumulates enough state data, the dual-stream fusion gating coefficient is adjusted, and time-dependent modeling is prioritized, ultimately achieving an efficiency leap under strict aesthetic standards. Data shows that for Brand C, the temporal attention stream's weight in the gated fusion module increased from 0.52 in Week 1 to 0.78 in Week 4. The cross-brand differentiation trend confirms that the framework achieves the Pareto optimality of “efficiency-brand” through adaptive weighting of brand feature vectors.
![]() |
Fig. 7 Trends in multi-brand task completion time optimization. |
4.3 Brand visual consistency assessment
Figure 8 quantifies the brand visual consistency optimization effect through a five-dimensional radar chart. The five vertices correspond to the LOGO position, primary color proportion, font level, white space ratio, and dynamic consistency evaluation dimensions. After optimization, the five-dimensional scores of fast fashion brand A (blue series) are comprehensively improved: the logo position increases from 3.8 to 4.5, an increase of 18.4%, the font level increases from 3.9 to 4.6, an increase of 17.9%, the dynamic consistency improves most significantly from 3.0 to 4.0, and the polygon area increases by 28.6%. The optimization of light luxury brand B (orange series) is even greater: the proportion of the main color jumps from 3.3 to 4.5, an increase of 36.4%, the LOGO position increases from 3.5 to 4.6, an increase of 31.4%, and the dynamic consistency achieves a leapfrog growth from 2.8 to 4.3, an increase of 53.6%. After optimization, the polygons of both brands approach the ideal pentagon (maximum radius 5.0), with Brand B establishing an advantage in LOGO position and primary color ratio, while Brand A maintains its lead in font level and white space ratio. The dynamic consistency dimension specifically marks the improvement value (Brand A+1.0, Brand B+1.5), revealing the core contribution of the framework to dynamic interface adaptation.
The score improvement maps the multi-constraint collaborative mechanism of the brand feature embedding optimizer: the increase in the main color proportion of brand B is due to the reinforcement learning that binds the reward function to the StyleGAN2 latent vector. When the proportion of main color pixels in the interface is low, a penalty is triggered, driving the algorithm to improve through color histogram reweighting; the optimization of brand A's font level is attributed to the multi-agent action space design. The outstanding progress in the dynamic consistency dimension (brand B+1.5 points) verifies the closed-loop coupling of the spatiotemporal attention engine and reinforcement learning: when the deviation between the user's gaze hotspot and the brand LOGO area exceeds a certain range, the gated fusion coefficient is dynamically adjusted, and brand constraint compensation is executed first, thereby reducing layout jitter when the interface changes during the promotion season. The five-dimensional balanced improvement verification framework achieves the Pareto optimality of the aesthetics and functions of the fashionable UI interface through adaptive reward weights.
![]() |
Fig. 8 Brand consistency results in each dimension. |
4.4 User glance transfer probability verification
Figure 9 uses a Sankey diagram to intuitively demonstrate the optimization effect of the user's scanning path transfer probability. The left node is the starting point of the visual line, including navigation, products, purchases, recommendations, and customer service, and the right node is the end point, including products, purchases, recommendations, customer service, and navigation. In the path before optimization (warm color), the probability of navigation-product transfer is 35% (red broadband), product-purchase is 28%, and purchase-recommendation is 18%, forming the main path. After optimization (cold colors), the probability of the same path increases significantly: navigation-product jumps to 52% (blue broadband), up 48.6% from before optimization; product-purchase increases to 41%, up 46.4% from before optimization; purchase-recommendation increases to 32%, and the total core path bandwidth expands by 62%. Edge path synchronization optimization: the probability of recommendation-customer service drops from 12% to 8%, a decrease of 33.3%, reflecting the simplification of redundant paths; customer service → navigation increases from 7% to 11%, an increase of 57.1%, reflecting the strengthening of the closed loop of operation. The overall cold-color flow band is wider than the warm-color band, visually forming the optimization feature of “core focus-edge simplification”.
Probability jump mapping dynamic layout optimization cognitive guidance mechanism: the 48.6% increase in navigation-product path is due to the rightward shift and expansion of the product display area, which increases the overlap with the hot area of the navigation bar and strengthens the attention anchor effect; the increase in the purchase-recommendation path is attributed to the multi-agent competition algorithm − when the recommendation bar agent detects that the gaze time of the purchase button exceeds a certain time, it automatically triggers transparency adjustment to enhance visual relevance, and the transfer probability weight is increased in the reward function. Edge path optimization reflects the conflict penalty mechanism: the probability of recommendation-customer service decreases because when the overlap area of the bounding boxes of the algorithm-detected elements is too large, the customer service entrance is forced to be reduced in size and moved to reduce visual interference; customer service-navigation growth verifies the effectiveness of gated fusion feedback − when users stay in the customer service area for a certain period, the spatiotemporal attention engine output drives the navigation LOGO to increase transparency, guiding the line of sight back. The expansion of the core path bandwidth confirms that the reinforcement learning strategy successfully converges the user's visual movement line to the optimal path of “brand expression-functional accessibility”.
![]() |
Fig. 9 Comparison of user scanning path transition probability optimization. |
4.5 Optimization and verification of the layout of core elements of fashion interface
Figure 10 shows the layout optimization effect of the core elements of the fashion interface through the displacement vector in the Cartesian coordinate system. The horizontal axis represents the horizontal displacement, with positive values indicating rightward displacement and negative values indicating leftward displacement; the vertical axis represents the vertical displacement, with positive values indicating upward displacement and negative values indicating downward displacement. The product display area is a blue arrow, which is significantly moved to the right by 120 px and down by 30 px, and the size is increased by 15% (solid circle); the purchase button is an orange arrow that moves 80 px to the left and 50 px up, increasing its size by 20%; the brand logo is a yellow arrow that remains in the same horizontal position but moves 40 px up, increasing its size by 10%; the recommended accessories column is a purple arrow that moves 150 px to the right and reduces its size by 5% (dashed circle); the customer service entrance moves 60 px to the left and 20 px down, increasing its size by 8%. The length of the colored arrows precisely quantifies the displacement distance, the size of the rings maps the element scaling, and the element labels with a translucent white background indicate the key adjustment parameters.
The displacement pattern reflects the collaborative solution of the reinforcement learning optimizer to the dual goals of “cognitive adaptation-brand expression”: the rightward shift of the product display area is due to the spatiotemporal attention engine predicting that the overlap of the right visual hotspot is high, and the multi-agent decision-making is driven by position offset rewards; the leftward shift and size expansion of the purchase button respond to the need to optimize cognitive load after the gaze entropy is reduced, and the +20% action is triggered when the button gaze duration exceeds a certain time. The pure vertical upward movement of the brand LOGO verifies the hard constraint priority mechanism − to meet the visibility constraint, the intelligent agent gives up the horizontal displacement freedom and chooses the Pareto optimal solution of vertical displacement. The reduction in the size of the recommended accessories column reflects the suppression of element overlap by conflict penalties; the left shift of the customer service entrance is the result of brand feature embedding. When the RGB value is less similar to the background color, the reward function drives the enhancement of visual contrast. The overall displacement vector distribution confirms the spatial mapping of the framework by balancing the user cognitive rules and brand aesthetic rules through dynamic weights.
![]() |
Fig. 10 Comparison of core element layout optimization results. |
5 Conclusion
This study proposes a reinforcement learning optimization framework that integrates eye tracking simulation and brand feature embedding. It predicts user visual movement lines through a spatiotemporal attention mechanism, encodes brand aesthetic rules through a multi-agent competition game, and realizes a dynamic layout closed loop through a real-time rendering engine, successfully constructing the dual-goal collaborative paradigm of “cognitive adaptation-brand expression” for fashion interface. Empirical evidence shows that the overlap of hot spots increased by 19–27 percentage points (82% for high-end jewelry brands), the task completion time decreased by 28–32% (5.6 s for fast fashion brands), and the brand consistency score reached as high as 4.8 points (accessible luxury brands +31.4%), while simultaneously driving the conversion rate up by 22.4%. Eye-tracking data has achieved an innovative paradigm shift from an evaluation tool to a design input, breaking through the limitations of traditional static layouts; however, the prediction accuracy of dynamic controls and the cold start problem with small samples still need to be overcome. In the future, it will integrate other multimodal biological signals to enhance prediction robustness, develop a 3D optimization engine that supports virtual reality, and build a fashion knowledge graph to achieve zero-sample style transfer, continuously leading the paradigm shift of user interfaces from empirical design to cognitive intelligence.
Funding
This work was supported by: The Development of Virtual Tour Digital Platform Technology for Commercial Complexes (grant number YS-2024-10-H).
Conflicts of interest
The authors have nothing to disclose.
Data availability statement
This article has no associated data generated and/or analyzed.
Author contribution statement
Conceptualization, C. X. and M. Z. I.; Methodology, C. X.; Software, C. X.; Validation, M. Z. I.; Formal Analysis, M. Z. I.; Investigation, M. Z. I.; Resources, M. Z. I.; Writing − Original Draft Preparation C. X. and M. Z. I.; Visualization, C. X.; Supervision, C. X., all authors were aware of the publication of the article and agreed to its publication.
References
- M. Setyawan, R. Perkins, Desain user interface sistem order berbasis mobile untuk produk brand clothing pada rown division, IT-Explore: J. Penerapan Teknologi Informasi dan Komunikasi, 1, 62–76 (2022) [Google Scholar]
- A.L.S. Lima, C. Gresse von Wangenheim, Assessing the visual esthetics of user interfaces: a ten-year systematic mapping, Int. J. Hum. −Comput. Interact. 38, 144–164 (2022) [Google Scholar]
- H. Guo, P. Fang, S. Liu, S. Yang, Research on information visualization design strategy of intelligent parking interface based on situational experience, J. Anhui Univ. Technol. (Social Sciences), 40, 36–39 (2023) [Google Scholar]
- M Xu, Research on interface design of ship intelligent navigation system based on visual communication, Ship Sci. Technol. 45, 166–169 (2023) [Google Scholar]
- A. Plopski, T. Hirzle, N. Norouzi, L. Qian, G. Bruder, T. Langlotz, The eye in extended reality: a survey on gaze interaction and eye tracking in head-worn extended reality, ACM Comput. Surv. (CSUR) 55, 1–39 (2022) [Google Scholar]
- M.F. Santoso, Implementation of UI/UX concepts and techniques in web layout design with figma, J. Teknologi Dan Sistem Informasi Bisnis, 6, 279–285 (2024) [Google Scholar]
- W. Li, Y. Zhou, S. Luo, Y. Dong, Design factors to improve the consistency and sustainable user experience of responsive interface design, Sustainability 14, 9131 (2022) [Google Scholar]
- X. Wang, M. Tong, Y. Song, C. Xue, Utilizing multiple regression analysis and entropy method for automated aesthetic evaluation of interface layouts, Symmetry 16, 523 (2024) [Google Scholar]
- H. Yan, H. Zhang, L. Liu, D. Zhou, X. Xu, Z. Zhang et al., Toward intelligent design: an AI-based fashion designer using generative adversarial networks aided by sketch and rendering generators, IEEE Trans. Multimedia 25, 2323–2338 (2022) [Google Scholar]
- X. Zhan, Y. Xu, Y. Liu, Personalized UI layout generation using deep learning: an adaptive interface design approach for enhanced user experience, J. Artif. Intell. Gen. Sci. F (JAIGS), 6, 463–478 (2024). ISSN: 3006–4023 [Google Scholar]
- A. Khamaj, A.M. Ali, Adapting user experience with reinforcement learning: personalizing interfaces based on user behavior analysis in real-time, Alex. Eng. J. 95, 164–173 (2024) [Google Scholar]
- R. Guo, N. Kim, J. Lee, Empirical insights into eye-tracking for design evaluation: applications in visual communication and new media design, Behav. Sci. 14, 1231 (2024) [Google Scholar]
- Z. Wang, P. Zhan, Eye-tracking-based hidden Markov modeling for revealing within-item cognitive strategy switching, Behav. Res. Methods, 57, 1–38 (2025) [Google Scholar]
- J. Yin, J. Sun, J. Li, K. Liu, An effective gaze-based authentication method with the spatiotemporal feature of eye movement, Sensors 22, 3002 (2022) [Google Scholar]
- O. Tov, Y. Alaluf, Y. Nitzan, O. Patashnik, D. Cohen-Or, Designing an encoder for stylegan image manipulation, ACM Trans. Graph. (TOG), 40, 1–14 (2021) [Google Scholar]
- F. Diaz-Guerra, A. Jimenez-Molina, Continuous prediction of web user visual attention on short span windows based on gaze data analytics, Sensors 23, 2294 (2023) [Google Scholar]
- L.M. Vortmann, F. Putze, Combining implicit and explicit feature extraction for eye tracking: attention classification using a heterogeneous input, Sensors 21, 8205 (2021) [Google Scholar]
- S. Chakraborty, Z. Wei, C. Kelton et al., Predicting visual attention in graphic design documents, IEEE Trans. Multimedia 25, 4478–4493 (2022) [Google Scholar]
- L. Xiao, S. Wang, Mobile marketing interface layout attributes that affect user aesthetic preference: an eye-tracking study, Asia Pac. J. Mark. Logist. 35, 472–492 (2023) [Google Scholar]
- Y. Li, Y. Tang, Design on intelligent feature graphics based on convolution operation, Mathematics 10, 384 (2022) [Google Scholar]
- M. Lee, Mathematical analysis and performance evaluation of the gelu activation function in deep learning, J. Math. 2023, 4229924 (2023) [Google Scholar]
- D. Vidmanov, A. Alfimtsev, Mobile user interface adaptation based on usability reward model and multi-agent reinforcement learning, Multimodal Technol. Interact. 8, 26 (2024) [Google Scholar]
- D. Gaspar-Figueiredo, M. Fernández-Diego, S. Abrahão, E. Insfran, A comparative study on reward models for user interface adaptation with reinforcement learning, Empir. Softw. Eng. 30, 1–48 (2025) [Google Scholar]
- A. Burnap, J.R. Hauser, A. Timoshenko, Product aesthetic design: a machine learning augmentation, Mark. Sci. 42, 1029–1056 (2023) [Google Scholar]
- S. Jang, S. Yoo, N. Kang, Generative design by reinforcement learning: enhancing the diversity of topology optimization designs, Computer-Aided Design 146, 103225 (2022) [Google Scholar]
Cite this article as: Cuiyu Xi, Muhammad Zaffwan Idris, User interface layout optimization design based on eye tracking simulation, Int. J. Simul. Multidisci. Des. Optim. 16, 26 (2025), https://doi.org/10.1051/smdo/2025026
All Tables
All Figures
![]() |
Fig. 1 Overall design architecture. |
| In the text | |
![]() |
Fig. 2 Apparel fashion brand UI display. |
| In the text | |
![]() |
Fig. 3 Jewelry fashion brand display. |
| In the text | |
![]() |
Fig. 4 UI display of a high-end jewelry fashion brand. |
| In the text | |
![]() |
Fig. 5 Optimized UI page. |
| In the text | |
![]() |
Fig. 6 Comparison between simulation prediction and actual. |
| In the text | |
![]() |
Fig. 7 Trends in multi-brand task completion time optimization. |
| In the text | |
![]() |
Fig. 8 Brand consistency results in each dimension. |
| In the text | |
![]() |
Fig. 9 Comparison of user scanning path transition probability optimization. |
| In the text | |
![]() |
Fig. 10 Comparison of core element layout optimization results. |
| In the text | |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.

































