| Issue |
Int. J. Simul. Multidisci. Des. Optim.
Volume 16, 2025
Innovative Multiscale Optimization and AI-Enhanced Simulation for Advanced Engineering Design and Manufacturing
|
|
|---|---|---|
| Article Number | 24 | |
| Number of page(s) | 17 | |
| DOI | https://doi.org/10.1051/smdo/2025014 | |
| Published online | 24 October 2025 | |
Research Article
Rapid recognition and localization of virtual assembly components in bridge 3D point clouds based on supervoxel clustering and transformer
School of Computer Science, Semyung University, 65 Semyung-ro, Jecheon-si 27136, Chungcheongbuk-do, Republic of Korea
* e-mail: 2023624801@semyung.ac.kr
Received:
29
July
2025
Accepted:
12
August
2025
Traditional rule-based manual bridge inspection methods often suffer from low efficiency and poor accuracy, making them inadequate for the demands of industrial-scale production. This study aims to achieve rapid recognition and localization of virtual assembly components within bridge 3D point clouds by constructing an intelligent analytical framework that integrates supervoxel clustering with a Transformer architecture. Specifically, an improved supervoxel clustering algorithm is developed, deeply integrating geometric morphology, density distribution, and structural response features to generate multimodal voxel units, thereby enhancing the semantic representation of local features. A graph-based Transformer module is introduced to model spatial relationships and semantic associations among supervoxel nodes through a self-attention mechanism, effectively integrating global contextual information. Additionally, a voxel voting strategy within a pose estimation module is employed to optimize component localization accuracy, forming an end-to-end recognition and localization system. The proposed model demonstrates excellent performance across multiple datasets, including Stanford Large-Scale 3D Indoor Spaces Dataset, ETH Zurich Building Dataset, International Society for Photogrammetry and Remote Sensing Benchmark Dataset, and National Building Museum Point Cloud Dataset. Compared to baseline models, the proposed approach achieves improvements of over 21.5% in semantic segmentation Mean Intersection over Union, instance recognition accuracy, and pose regression precision. In complex multi-box girder bridge scenarios, the recognition accuracy for small-scale connectors improves by up to 37.1%. Computational efficiency increases by more than 18.7%, with inference time reductions of up to 31.5% when processing large-scale data. Overall improvements in bridge component recognition exceed 22.4%, with recognition accuracy for critical connection components increasing by up to 37.4%, and localization accuracy improving by over 26.2%, reaching up to 35.9% for key node localization. The results demonstrate that the proposed model effectively addresses critical challenges in processing bridge point cloud data through multimodal feature fusion and global structural reasoning, significantly enhancing component recognition accuracy and localization precision in complex scenes while maintaining a balance between algorithmic efficiency and model performance. This study provides an efficient solution for the digital delivery and quality control of intelligent bridge construction. By integrating finite element analysis with deep learning, the model enhances semantic understanding of bridge structural functions, contributing significantly to the advancement of intelligent bridge engineering.
Key words: Bridge inspection / 3D point cloud / supervoxel clustering / transformer / assembly components
© C. Huang et al., Published by EDP Sciences, 2025
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
In modern bridge engineering, steel–concrete composite girder bridges have become a key choice for transportation infrastructure due to their superior mechanical performance and economic efficiency [1]. The assembly accuracy of prefabricated components in such bridge structures directly affects the load-bearing capacity and service life of the bridge, imposing stringent requirements on both manufacturing precision and assembly efficiency [2]. Against this background, virtual assembly technology based on bridge 3D point clouds offers an innovative approach to precise prefabricated component assembly by constructing high-precision digital twins models, gradually becoming a core technology in intelligent bridge construction [3].
Traditional methods for inspecting prefabricated bridge components mainly rely on rule-based manual interventions, such as manual measurement and visual inspection, which suffer from low efficiency and fail to meet the demands for high precision and speed in industrial production [4]. With breakthroughs in deep learning for computer vision, data-driven point cloud processing methods were introduced for component recognition and localization in bridges, enabling automatic feature extraction and improving detection efficiency to some extent [5]. However, these methods demonstrate limited generalization capability under small-sample conditions and struggle to adapt to the complex and variable working conditions and component shapes encountered in bridge engineering [6]. Meanwhile, supervoxel clustering and Transformer architectures exhibit unique advantages in point cloud processing [7]. Supervoxel clustering technology achieves semantically aware over-segmentation that effectively preserves geometric details of point cloud data but lacks the capability to model global structural relationships [8]. Transformer models, centered on self-attention mechanisms, excel at capturing long-range dependencies but face challenges such as low computational efficiency and weak local feature extraction when processing unstructured point cloud data [9]. Currently, the application of these two methods to recognition and localization of virtual assembly components in bridge 3D point clouds has not formed a systematic solution, making it difficult to achieve collaborative representation of local details and global structural information.
To address the aforementioned challenges, this study proposes an intelligent analytical framework that integrates supervoxel clustering with a Transformer architecture to achieve rapid recognition and localization of virtual assembly components in bridge 3D point clouds. The framework’s core innovations lie in three aspects: first, an improved supervoxel clustering algorithm deeply fuses geometric morphology, density distribution, and structural response features to construct lightweight multimodal voxel units, enhancing the semantic representation of local features. Second, a graph-based Transformer module is introduced, leveraging the self-attention mechanism to model spatial relationships and semantic associations among supervoxel nodes, enabling efficient integration of global contextual information. Third, a voxel voting strategy is incorporated into the pose estimation module to optimize component localization accuracy, forming an end-to-end recognition and localization system.
The subsequent sections of this study are organized as follows: first, a literature review summarizes advances in 3D point cloud processing, Building Information Modeling (BIM) and point cloud fusion, and intelligent construction applications, clarifying existing technological bottlenecks. Next, the design of the rapid recognition and localization model for 3D point cloud virtual assembly components is detailed, including the specific fusion mechanisms between supervoxel clustering and the Transformer, as well as component recognition techniques. Then, the study design and data sources are described, introducing multi-scenario datasets used for model validation. Following this, comparative experiments and multidimensional evaluations verify the proposed model’s advantages in recognition accuracy, localization precision, and computational efficiency. Finally, conclusions are drawn, limitations identified, and future research directions proposed.
The results of this study provide an efficient virtual pre-assembly solution for intelligent bridge construction, supporting digital delivery and quality control objectives. Meanwhile, by integrating finite element analysis with deep learning across disciplines, the model significantly enhances semantic understanding of bridge structural functions, achieving improved recognition accuracy and robustness in complex scenarios. This offers new theoretical and technical support for the intelligent development of bridge engineering and holds important academic value and engineering application potential.
2 Literature review
With the rapid development of BIM technology and 3D point cloud data processing techniques, the use of 3D point clouds for the accurate recognition, modeling, and assembly of construction components has become a core research focus in the field of intelligent construction. In bridge engineering, which serves as a critical sector of infrastructure development, the high-precision assembly of prefabricated components is essential for improving construction efficiency and ensuring structural safety. However, the complexity of bridge structures, the unstructured nature of point cloud data, and the limitations of traditional methods in analyzing both local details and global structures have posed significant challenges for meeting the demands of intelligent bridge construction. This section systematically reviews the relevant research progress from the perspectives of 3D point cloud data processing, the integration of BIM and point clouds, and applications in intelligent construction. It also analyzes the technical bottlenecks of current approaches, providing theoretical support for the proposed method of bridge 3D point cloud virtual assembly component recognition and localization.
In the field of 3D point cloud data processing and BIM element updating, Rausch and Haas (2021) proposed an automated method that optimized the registration algorithm between point clouds and BIM models to achieve real-time updates of the shapes and poses of construction components. Their study provided technical support for dynamic monitoring during the construction process. Although this method effectively improved the alignment between BIM models and point cloud data, its generalization capability in complex bridge structure scenarios still required further validation [10]. Lim et al. (2021) addressed the issue of dynamic object interference in static 3D point cloud map construction by proposing the ERASOR algorithm, which efficiently removed dynamic objects by calculating the centripetal ratio of pseudo-occupancy. This significantly enhanced the usability of point cloud data and the accuracy of map construction, and the real-time performance and robustness of the algorithm had been fully validated in mobile robot navigation scenarios [11].
In the field of cultural heritage building digitization and semantic analysis of point clouds, Croce et al. (2021) proposed a semi-automated method that applied machine learning techniques to process semantic point clouds, converting them into historical building information models and offering a new approach for the digital preservation and management of cultural heritage [12]. Cotella (2023) further explored the application of Artificial Intelligence (AI) in the cultural heritage domain by employing deep learning algorithms to extract structural semantic information from point cloud data, thereby achieving the automated construction of historical building information models and demonstrating the potential of intelligent algorithms in parsing non-standard structural point clouds [13]. Mirzaei et al. (2022) conducted a comprehensive review of the application of machine learning in 3D point cloud processing for buildings and infrastructure. They systematically summarized technological advancements in point cloud data acquisition, preprocessing, feature extraction, and object recognition, while highlighted the insufficient generalization ability of deep learning in small-sample scenarios as a critical challenge to be addressed [14].
Significant progress has also been made in bridge engineering and intelligent construction research. Sun et al. (2023) proposed a construction monitoring method for large and complex steel structures based on laser point clouds. By establishing a deviation analysis framework between point clouds and design models, this approach enabled real-time assessment and dynamic adjustment of construction quality, effectively ensuring the precision of steel structure assembly [15]. Huang et al. (2022) employed point cloud data obtained through unmanned aerial vehicle photogrammetry, combined with semantic segmentation techniques, to achieve three-dimensional change detection at construction sites, providing a data-driven foundation for construction progress management and resource allocation [16]. Kim et al. (2022) utilized sensor-equipped quadruped robots to collect point cloud data of scaffolding and employed deep learning algorithms for three-dimensional scaffolding reconstruction, demonstrating the feasibility and efficiency of mobile platforms in data acquisition within complex construction environments [17].
At the level of integration and application of intelligent construction technologies, Casini (2022) conducted a systematic review of the application of Extended Reality (XR) technologies in intelligent building operations and maintenance, exploring the potential value of combining XR with point cloud data and BIM models, thus providing a reference for the digital operation and maintenance of bridge engineering [18]. Moyano et al. (2022) investigated the operability of point cloud data within architectural heritage information models, proposing data optimization and integration strategies and emphasizing the importance of the structured expression of point cloud data for enhancing model application performance [19]. Pan and Zhang (2023) conducted an in-depth analysis of the current state of the integration of BIM and AI in intelligent construction management. They noted that the synergy between the two could effectively improve the level of construction process intelligence, although challenges remained in algorithmic efficiency and multi-source data integration [20].
Despite significant progress in 3D point cloud processing and intelligent construction, the existing research faces notable limitations: first, most current methods focus on the building or cultural heritage domains, with limited research on the identification and localization of complex bridge structural components, such as prefabricated elements of steel-concrete composite beam bridges; second, traditional deep learning methods lack generalization capabilities in small sample and complex conditions, making them inadequate for addressing the diverse shapes of bridge components and the high assembly accuracy required in engineering applications; third, the collaborative expression of local geometric features and global structural information in point cloud data processing is weak, leading to limited accuracy in component recognition and localization in complex environments. Therefore, innovative technological approaches are needed to overcome the limitations of existing methods and achieve efficient recognition and precise localization of bridge 3D point cloud virtual assembly components, thereby advancing the development of intelligent bridge construction technologies.
3 3D point cloud virtual assembly component rapid recognition and localization model design
3.1 Supervoxel clustering and transformer model design
The unstructured nature of bridge 3D point cloud data and the complex geometric semantics pose dual challenges for the precise recognition and localization of prefabricated components [21]. Traditional methods struggle to balance the preservation of local geometric details and the modeling of global structural relationships. To address these challenges, this study proposes an intelligent framework integrating multimodal supervoxel clustering and graph-structured Transformer, utilizing rigorous mathematical modeling to achieve comprehensive quantitative analysis from feature encoding to structural reasoning. Figure 1 illustrates the results of supervoxel clustering.
As shown in Figure 1, supervoxel clustering effectively retains 3D structural details by merging spatial proximity and feature similarity, enhancing segmentation accuracy in complex scenarios. Additionally, its computational efficiency surpasses traditional voxel methods, demonstrating significant advantages in point cloud processing. Figure 2 illustrates the design process and recognition effect of supervoxel clustering.
![]() |
Fig. 1 Effect of supervoxel clustering. |
![]() |
Fig. 2 Supervoxel clustering process and recognition effect. |
3.1.1 Multimodal feature encoding of supervoxel clustering
In the point cloud set
, where
represents the three-dimensional coordinates and N is the total number of points, the following steps generate supervoxel units containing geometric, density, and structural response features:
–Geometric feature quantification analysis
For any point pi, the local centroid of its k-nearest neighbors N(pi) is defined as equation (1):
The covariance matrix is computed based on the local point cloud distribution:
Eigen-decomposition of Ci is performed as equation (3):
The normal vector ni is the unit eigenvector corresponding to the smallest eigenvalue:
The minimum eigenvalue λmin (Ci) represents the local surface curvature:
Here, λmin (Ci) denotes the smallest eigenvalue of Ci, indicating the local surface curvature [22].
–Density feature normalization
For a point pi, the number of points within a sphere of radius r centered at pi is denoted as ni. The density feature is normalized as:
In this context ε = 10-6 is a small value to avoid division by zero.
–Structural response feature mapping
The nodal stress
obtained from finite element analysis is mapped to the point cloud space through coordinate mapping [23]. For non-finite element nodes pi, the structural response feature is calculated via trilinear interpolation:
Here, bk refers to the vertices of the hexahedral mesh surrounding pi,
is the coordinate of vertex bk along the d-th axis, and hd is the edge length of the mesh along that axis. The weights wk satisfy
[24].
–Region growing (RG) similarity measure
The multimodal similarity measure between points pm and pn is defined as equation (11):
Here:
In the above expressions α, β, γ and λ are weight coefficients determined via Bayesian optimization, and cmax and cmin are the global extrema of curvature. When
point pm is incorporated into the current supervoxel [25].
–Supervoxel feature vector construction
The supervoxel vk contains the point set
, and its centroid is:
The multimodal feature vector integrates local statistics as:
is the normalized value:
Here, PCAk denotes the first three principal component feature vectors of the supervoxel point cloud [26].
–Global structure inference based on graph-structured transformer
The supervoxel set is modeled as a graph
, where the node features
, and the edges are determined by the centroid Euclidean distance
[27]. The Transformer module implements global relationship modeling through the following steps:
–Geometric positional encoding
To retain spatial coordinate information, absolute positional encoding is designed as equation (20):
In equation (20), Lx, Ly, and Lz are the lengths of the point cloud bounding box along the respective axes. The input feature is updated as
[28].
–Multi-head self-attention mechanism
Features are linearly transformed to generate query, key, and value vectors:
The attention weight for the ℎ-th head is computed as:
The attention output is given by the follows:
–Multi-Layer Feature Aggregation
The outputs of all heads are concatenated and then passed through layer normalization and a feedforward network:
In this context, σ is the GELU activation function,
,
, and ⊙ denotes element-wise multiplication [29].
–Graph structure update
Cross-layer connections and residual mechanisms ensure gradient flow:
–End-to-End Recognition and Localization Mechanism
The supervoxel features
, processed by L-layer Transformers, are input into the classification and regression branches:
–Semantic classification model
The classification probabilities are calculated as equation (31):
In equation (31), C is the number of component categories,
, and
, with
representing the category probability distribution.
–Pose regression model
A quaternion
is used to represent the rotation, and the regression parameters are defined as equation (32):
The quaternion is normalized as the follows
–Voxel voting optimization strategy
The pose of supervoxels within the same component c is fused using a weighted voting scheme:
Here,
denotes the number of points contained in the supervoxel, serving as the voting weight.
–Non-maximum suppression (NMS)
The intersection over union (IoU) of the predicted bounding boxes is computed to suppress redundant detections:
Predictions with a confidence level p> 0.5 and no overlap (i.e., IoU< 0.6 are retained as the final results [30]. This design achieves hierarchical encoding of multimodal features and deep reasoning of global structure through rigorous mathematical modeling. Each formula corresponds to a specific physical meaning or algorithmic step, ensuring the model’s interpretability and reproducibility [31]. The supersomal clustering combined with Transformer (SC-Transformer) forms an efficient solution for the recognition and localization of 3D point cloud virtual assembly components, providing theoretical support and a technological pathway for precision assembly in the field of intelligent construction.
As shown in Figure 3, in point cloud processing, supervoxel feature segmentation integrates geometric positions, normal vectors, color, and other features, effectively preserving 3D structural details. This approach improves segmentation accuracy in complex scenes and offers better computational efficiency compared to traditional voxel-based methods, demonstrating strong adaptability.
As displayed in Figure 4, the results of the 3D laser point cloud model and supervoxel clustering segmentation, as well as the assembly effect, are shown. This approach effectively enhances the segmentation and assembly accuracy, further demonstrating the effectiveness of the proposed method in real-world scenarios.
![]() |
Fig. 3 Point cloud data processing and supervoxel feature segmentation results. |
![]() |
Fig. 4 3D laser point cloud model, supervoxel clustering segmentation, and assembly results. |
3.2 Design of recognition technology for 3D point cloud virtual assembly components
The recognition technology for 3D point cloud virtual assembly components in bridge construction is a critical element for achieving precise positioning of prefabricated components. The primary challenge lies in effectively integrating multimodal local features and global structural semantics to address issues such as the variability of component shapes, significant scale differences, and high semantic ambiguity in complex bridge structures [32]. The recognition technology designed in this study is based on the geometric-physical features generated by supervoxel clustering and the global context information extracted by the Transformer. Through hierarchical feature fusion, the construction of an adaptive classification network, and multidimensional post-processing optimization, a component recognition system with both robustness and generalization capability is developed. Figure 5 shows the technical framework design of 3D point cloud virtual assembly components and rapid identification and positioning.
As shown in Figure 5, the innovation of this recognition technology lies in overcoming the limitations of traditional methods that rely on single features or fixed scales. By deeply integrating multimodal features, designing adaptive network structures, and optimizing through multidimensional post-processing, a complete mapping chain from point cloud features to component instances is established [33]. Table 1 shows the comparison between traditional methods and 3D laser scanning technology methods.
As shown in Table 1, the core advantage of 3D laser scanning methods lies in effectively addressing the complexity of bridge structures and the unstructured nature of the data. These methods demonstrate excellent recognition performance in engineering scenarios such as steel–concrete composite girder bridges and multi-box girder bridges, providing key technical support for automated and intelligent bridge assembly. Notably, although the proposed SC-Transformer framework involves complex mechanisms of multimodal feature fusion and global structural reasoning, its interpretability can be realized through multidimensional visualization. The hierarchical presentation of supervoxel clustering results intuitively reveals the synergistic effects of local geometry, density, and structural response features. Furthermore, the Transformer attention weight heatmaps clearly identify global associations that decisively impact component recognition (e.g., spatial constraints between connectors and main girders), thus elucidating the model’s decision-making logic. At the deployment level, the framework has strong potential for deep integration with bridge construction workflows. On one hand, it enables real-time interaction with computer systems through standardized data interfaces, dynamically importing recognition and localization results into digital twin models to support automatic verification of prefabricated component assembly deviations and visualization of construction progress. On the other hand, the framework can form a closed-loop control system with on-site construction robots (such as hoisting manipulators and inspection drones). The accurate pose information output by the lightweight inference module (optimized single-frame processing time ≤2 seconds) guides robots to perform millimeter-level precise assembly operations. Regarding computational requirements, the model inference phase requires a mid-range graphics processor. When processing million-scale point clouds, memory consumption ranges from approximately 8 to 12 GB. The number of supervoxels can be dynamically adjusted (with 500–800 core supervoxels retained in complex scenarios) to balance accuracy and efficiency, thereby meeting real-time demands on construction sites. This collaborative design of interpretability, deployment adaptability, and controllable computation further strengthens the practical value of the method in digital delivery and quality control for intelligent construction.
Figure 6 shows the final pseudocode design of the proposed model.
As illustrated in Figure 6, multiple key metrics are employed to evaluate the performance of the proposed model, defined as follows:
Mean Intersection over Union (mIoU) serves as the core metric for assessing semantic segmentation quality. It calculates the average IoU across all classes, comprehensively reflecting the model’s segmentation accuracy for different semantic categories. The IoU is defined as the ratio of the area of overlap between the predicted component region and the ground-truth annotation to the area of their union. A higher mIoU value indicates better consistency between segmentation results and the ground truth.
Instance recognition accuracy measures the model’s ability to correctly identify component instances. It is computed as the ratio of correctly recognized component instances to the total number of detected instances. A “correct recognition” requires two conditions to be met simultaneously: first, the component category is accurately classified; second, the IoU between the predicted bounding box and the ground-truth bounding box exceeds a predefined threshold (set to 0.5 in this study), ensuring spatial consistency.
Pose estimation accuracy evaluates the precision of component localization, covering both positional and rotational errors. Positional error refers to the Euclidean distance between the predicted and true 3D coordinates of the component center. Rotational error is represented by quaternions describing the component’s orientation, quantified by the angular difference (in degrees) between the predicted and ground-truth quaternions. Pose estimation accuracy is quantified by the mean or maximum of positional and rotational errors; smaller errors indicate higher localization precision.
Computational efficiency metrics primarily include the number of points processed per second, model inference time, and GPU memory consumption. Points processed per second reflect the model’s data throughput, while inference time measures the total duration from input point cloud to output recognition and localization results. These indicators assess the model’s real-time capability and deployment feasibility in practical engineering scenarios.
Together, these metrics comprehensively and objectively quantify the model’s performance in complex bridge point cloud environments from semantic segmentation, instance recognition, localization accuracy, and computational efficiency perspectives, ensuring the evaluation’s engineering applicability.
![]() |
Fig. 5 Technical framework design for 3D point cloud virtual assembly components and rapid recognition and positioning. |
Comparison between Traditional Methods and 3D Laser Scanning Technology.
![]() |
Fig. 6 Final pseudocode design of the proposed model. |
4 Research design and data sources
This study focuses on the recognition and localization of bridge 3D point cloud virtual assembly components, constructing a technical framework of “supervoxel feature encoding—Transformer structural inference--end-to-end prediction”. The first step involves generating supervoxel units that integrate geometric, density, and structural response features using an improved RG algorithm, thus building multimodal local features. Next, a graph-structured Transformer module is employed to capture spatial relationships and semantic associations between supervoxels, enabling the integration of global contextual information. Finally, a voxel voting strategy is used to regress the component pose parameters, resulting in a complete recognition and localization system. The study integrates real bridge point cloud data with synthetic data to construct a multi-scenario test set, primarily focusing on validating the algorithm's robustness in complex bridge types (e.g., multi-box girder bridges) and noisy environments. Additionally, engineering validation is carried out in the context of bridge digital pre-assembly processes, assessing the model's real-time performance and error tolerance.
This study utilizes four publicly available datasets:
Stanford Large-Scale 3D Indoor Spaces (S3DIS) Dataset: This dataset includes high-precision point cloud data of six large indoor scenes, annotated with semantic information such as room layouts and architectural components (e.g., beams, columns). It is suitable for validating component recognition algorithms in complex structures and supports the analysis of multimodal features (e.g., color, normal vectors).
ETH Zurich Building (ETH-Building) Dataset: This dataset provides laser point clouds and BIM model alignment data for multiple buildings, covering detailed geometry and material properties of steel-concrete composite structures. It can be used to validate the positioning accuracy and structural semantic understanding of bridge prefabricated components.
International Society for Photogrammetry and Remote Sensing Benchmark (ISPRS) Dataset: This dataset contains city-scale 3D point clouds and semantic segmentation annotations, covering infrastructure such as bridges and roads. It supports noise robustness testing and multi-source data fusion research in complex scenarios.
National Building Museum Point Cloud (NBM) Dataset: This dataset includes detailed point cloud data of large public buildings, featuring diverse structural components (e.g., steel trusses, concrete beams). It is suitable for validating the generalization capability of recognition algorithms for multi-scale components, ranging from millimeter-level connectors to meter-scale components.
The proposed method effectively addresses inevitable occlusions and data loss in practical scanning environments through a multi-module collaborative mechanism. Its core logic builds a robust processing framework by leveraging the complementarity of multi-source information and global structural relationships.
During feature extraction, the improved supervoxel clustering algorithm integrates geometric morphology, density distribution, and structural response features to form multimodal voxel units. In regions where occlusion causes distortion or absence of geometric features (such as coordinates and normals), density features (point distribution density within local space) and structural response features (mechanical attributes mapped from finite element analysis) provide alternative semantic cues. Structural response features, as quantitative representations of physical properties, assist in classifying and localizing components through their mechanical functional correlations, even when geometric information is incomplete, thus mitigating the limitations of relying solely on geometric features.
The graph-based Transformer module enhances fault tolerance to data loss by modeling global relationships. Treating supervoxels as graph nodes, it leverages the self-attention mechanism to capture spatial positions and semantic associations among nodes. When local regions suffer scanning gaps (e.g., point cloud discontinuities caused by blind spots), the model infers and completes missing information through feature propagation from neighboring nodes and global structural constraints. The introduction of geometric positional encoding further reinforces spatial logical consistency, ensuring that component spatial distributions conform to the bridge’s topological rules despite incomplete data, thereby avoiding global inference biases triggered by local missing areas.
In the localization refinement stage, the voxel voting strategy aggregates predictions from multiple supervoxels within the same component to reduce the impact of occlusion on localization accuracy. For partially occluded components (such as connectors blocked by large girders), this strategy assigns different weights based on the number of points contained in each supervoxel: higher weights are given to unoccluded regions with more points, while lower weights correspond to occluded regions with fewer points, thus minimizing interference from outliers in the final localization. Additionally, a NMS algorithm based on bounding box IoU filters false detections caused by fragmented data, retaining the most reliable recognition results and further enhancing model stability in complex scanning scenarios.
This multi-level processing mechanism combines physical priors (structural response features), data-driven contextual reasoning, and statistical aggregation strategies (voxel voting) to achieve effective fault tolerance against occlusion and data loss, ensuring stable recognition and localization of bridge components in real-world scanning environments.
5 Evaluation of the 3D point cloud virtual assembly component recognition and localization model
5.1 Evaluation of the SC-transformer model
To validate the effectiveness of the proposed supervoxel clustering and Transformer fusion model, a comparative evaluation is conducted against current mainstream point cloud processing models and traditional methods. The selected comparison models include: RG, Point Neural Network (PointNet), Hierarchical Point Neural Network (PointNet++), Graph Convolutional Network (GCN), Point Cloud Transformer (Point Transformer), and Random Local Attention Network (RandLA-Net). Figure 7 displays the results of the model accuracy comparison evaluation.
As shown in Figure 7, in the comparative evaluation, the proposed SC-Transformer model demonstrates significant advantages in bridge component recognition and localization tasks. Compared to models such as RG, PointNet, and Point Transformer, the SC-Transformer model achieves improvements in semantic segmentation mIoU, instance recognition accuracy, and pose regression accuracy, with an increase of over 21.5%. In complex multi-box girder bridge scenarios, the recognition accuracy of small-scale connectors improves by up to 37.1%, highlighting the effectiveness of multimodal feature fusion and global structural reasoning.
As shown in Figure 8, in multi-scenario testing covering multiple data sources, the SC-Transformer model exhibits significant advantages in computational efficiency. The evaluation was based on key metrics such as the number of point clouds processed per second, model inference time, and Graphics Processing Unit (GPU) memory usage, with systematic comparison to advanced models like PointNet++ and RandLA-Net. The results indicate that, in complex bridge point cloud data processing, the SC-Transformer model achieves efficiency improvements of over 18.7%. Particularly in handling bridge point cloud data at the scale of millions of points, the lightweight supervoxel clustering algorithm and optimized Transformer architecture result in a maximum computational efficiency increase of 31.5%. At the same time, the model effectively reduces GPU memory consumption, significantly shortening computation time, thereby fully validating its efficiency and practicality in algorithm design and engineering applications.
![]() |
Fig. 7 Comparison of model computational accuracy (a: S3DIS dataset, b: ETH-Building dataset, c: ISPRS dataset, d: NBM dataset). |
![]() |
Fig. 8 Comparison of model computational efficiency (a: S3DIS dataset, b: ETH-Building dataset, c: ISPRS dataset, d: NBM dataset). |
5.2 Evaluation of the ridge 3D point cloud virtual assembly component rapid recognition and localization
To comprehensively validate the practicality of the SC-Transformer model in bridge 3D point cloud scenarios, effect evaluations were conducted across dimensions such as component recognition accuracy, localization precision, robustness in complex conditions, and model efficiency. The evaluation utilized both field measurement data and public datasets, with quantitative metrics and visual analysis revealing the model's advantages. Figure 9 presents the results of the model's recognition efficiency evaluation.
As shown in Figure 9, in the evaluation of complex bridge structures, the SC-Transformer model demonstrated exceptional performance. In addressing the challenges of varying component forms, significant scale differences, and high semantic ambiguity in complex structures such as steel-concrete composite bridges and multi-box girder bridges, the study compared the model with mainstream methods like PointTransformer and GCN using core metrics such as average precision, localization error, and segmentation accuracy. The results indicate that, in tests with multi-source datasets such as S3DIS and ETH-Building, the model's overall performance in bridge component recognition improved by more than 22.4%. Particularly in the complex scenarios of multi-box girder bridges, the model's recognition efficiency for critical connecting components reached up to 37.4%, effectively overcoming the limitations of traditional methods and fully validating its effectiveness and superiority in complex environments.
As shown in Figure 10, in the evaluation of bridge 3D point cloud localization tasks, the study employed core metrics such as root mean square error, mean absolute error, and pose estimation accuracy to conduct a systematic comparison between the SC-Transformer model and mainstream models such as PointNet and RandLA-Net. The evaluation results revealed that the model's localization performance improved by more than 26.2%, with the highest improvement of 35.9% achieved in the localization of key nodes in complex scenarios. This effectively reduced localization errors caused by data noise, structural occlusions, and other factors, highlighting the model's technological superiority and robustness in the field of precise bridge 3D point cloud localization, thereby providing reliable assurance for high-precision assembly in bridge intelligent construction.
![]() |
Fig. 9 Model recognition efficiency evaluation results (a: S3DIS dataset, b: ETH-Building dataset, c: ISPRS dataset, d: NBM dataset). |
![]() |
Fig. 10 Model localization efficiency evaluation results (a: S3DIS dataset, b: ETH-Building dataset, c: ISPRS dataset, d: NBM dataset). |
5.3 Model robustness evaluation
To validate the robustness of the proposed model across different scenarios, this study selects multiple heterogeneous datasets with varying levels of complexity and clutter for testing. These datasets encompass indoor spaces, building structures, urban-scale environments, and large public buildings, exhibiting significant differences in point cloud structural complexity, noise levels, and component diversity. This diverse testing environment provides a comprehensive assessment of the model’s adaptability. The robustness evaluation results are presented in Figure 11.
As shown in Figure 11, the SC-Transformer model achieves over 21.5% overall improvement in semantic segmentation mIoU, instance recognition accuracy, and pose regression precision compared to baseline methods such as RG, PointNet, and PointNet++. This advancement stems from the model’s synergistic representation of local details and global structures. The improved supervoxel clustering algorithm integrates geometric morphology, density distribution, and structural response features to construct multimodal voxel units that encompass physical properties and spatial characteristics, effectively addressing semantic ambiguities caused by reliance on single-feature methods. Meanwhile, the graph-based Transformer module captures long-range spatial correlations and functional semantics among supervoxel nodes via a self-attention mechanism, compensating for supervoxel clustering’s limitations in global structure modeling. This enables recognition accuracy improvements of up to 37.1% for small-scale connectors in complex bridge structures such as multi-box girder bridges, fully demonstrating the effectiveness of multimodal feature fusion in fine component recognition.
In terms of computational efficiency, the SC-Transformer model realizes a significant leap when processing large-scale point cloud data through lightweight design and architectural optimization. Compared to advanced models like PointNet++, it improves computational efficiency by over 18.7% and reduces inference time by up to 31.5%. This advantage derives from two key innovations: first, supervoxel clustering compresses massive point clouds into semantically rich voxel units via over-segmentation, substantially reducing the computational burden on the subsequent Transformer module; second, the graph-structured Transformer optimizes the attention mechanism by limiting interaction computations to spatially adjacent and semantically relevant nodes, avoiding redundant processing typical of conventional Transformers applied to global point clouds. This balance of accuracy and efficiency enables real-time processing in engineering applications.
The robustness of the SC-Transformer model is further highlighted in bridge-specific complex scenarios, including structural occlusion, significant component scale variation, and data noise interference. In key connector recognition tasks, accuracy improves by up to 37.4%, key node localization precision increases by as much as 35.9%, and overall localization accuracy improves by over 26.2%. These gains result from the combined effect of the voxel voting strategy and NMS: voxel voting aggregates weighted predictions from multiple supervoxels within the same component to suppress noise from occluded regions, while global structural reasoning constrains functional associations to reduce localization biases caused by local data loss. This ensures stable model performance across heterogeneous datasets and complex operating conditions.
The model exhibits certain limitations during testing. Under extreme noise conditions, such as strong electromagnetic interference or severe weather effects causing feature distortion in point cloud data, recognition accuracy for small-scale connectors declines significantly. For specialized components in novel structural bridges, generalization remains limited due to insufficient training samples. Additionally, while inference time outperforms baseline models when processing ultra-large-scale point clouds (tens of millions of points or more), peak instantaneous memory usage remains high, leading to performance lag on lower-end hardware. These cases indicate that improvements are still needed in adapting to extreme conditions and expanding sample coverage.
![]() |
Fig. 11 Model Robustness Evaluation Results (a: S3DIS dataset, b: ETH-Building dataset, c: ISPRS dataset, d: NBM dataset). |
6 Conclusion
Amid the intelligent development of modern bridge engineering, steel–concrete composite girder bridges have become a core choice for transportation infrastructure due to their superior mechanical performance and cost-effectiveness. However, the assembly accuracy of prefabricated components directly impacts the load-bearing capacity and service life of bridges. Traditional manual inspection methods suffer from low efficiency and insufficient precision, failing to meet the demands of industrial-scale production. Existing deep learning–based point cloud processing methods exhibit limited generalization in small-sample scenarios and lack a systematic integration of supervoxel clustering with Transformer architectures, making it difficult to balance the synergistic representation of local details and global structural information.
To address these challenges, this study proposes an intelligent analytical framework that integrates supervoxel clustering with a Transformer architecture to enable rapid recognition and localization of virtual assembly components in bridge 3D point clouds. The study improves the supervoxel clustering algorithm by deeply integrating geometric morphology, density distribution, and structural response features to construct multimodal voxel units that enhance the semantic representation of local features. A graph-based Transformer module is introduced to model spatial relationships and semantic associations among supervoxel nodes via a self-attention mechanism, enabling efficient integration of global contextual information. Combining a voxel voting strategy with a pose estimation module optimizes localization accuracy, forming an end-to-end recognition and localization system.
Validated on multiple heterogeneous datasets, the model demonstrates significant advantages. Compared to traditional methods such as RG algorithms and PointNet, semantic segmentation mIoU, instance recognition accuracy, and pose regression precision improve by over 21.5%. In complex multi-box girder bridge scenarios, recognition accuracy for small-scale connectors increases by up to 37.1%. Regarding computational efficiency, improvements exceed 18.7% compared to advanced models like PointNet++, with inference time reductions of up to 31.5% when processing large-scale data. Overall, bridge component recognition and localization performance improve by more than 22.4%, with key connector recognition accuracy increasing by 37.4%, localization accuracy improving by over 26.2%, and key node localization precision achieving a maximum gain of 35.9%. These results indicate that through deep multimodal feature fusion and global structural reasoning, the model effectively addresses core challenges in processing bridge point cloud data, significantly enhancing recognition and localization accuracy in complex scenes while balancing algorithmic efficiency and performance. This provides an efficient solution for digital delivery and quality control in intelligent bridge construction.
Nevertheless, certain limitations remain. Under extreme conditions, strong electromagnetic interference causes a surge in point cloud noise, while heavy rain or dense fog results in data loss or distortion. In such cases, recognition accuracy for small-scale connectors (e.g., bolts and shear keys) significantly declines, as these components inherently have small physical dimensions and weak signals easily masked by environmental noise. For novel bridge structures (such as modular steel truss bridges and cable-stayed–suspension hybrid systems), their specialized components (e.g., new nodes and composite connectors) are underrepresented in existing training datasets, leading to insufficient semantic understanding and frequent misclassification (e.g., misidentifying new nodes as common steel components). To address these limitations, future work may introduce domain adaptation techniques, employing adversarial training to enable the model to learn environment-invariant features and reduce the impact of electromagnetic interference and adverse weather on feature extraction. Additionally, integrating physics engines to generate synthetic point clouds simulating extreme environments (e.g., adding Gaussian noise or simulating partial occlusion) can augment training datasets and enhance model robustness.
Funding
This research received no external funding.
Conflicts of interest
The authors declare that they have no conflicts of interest.
Data availability statement
The data used to support the findings of the research are available from the corresponding author upon reasonable request.
Author contribution statement
Chenglong Huang: Conceptualization; Methodology; Software; Validation; Investigation; Data curation; Writing—original draft preparation; Visualization. Chi-Ho Lin: Conceptualization; Validation; Resources; Writing—review and editing; Supervision. Suan Lee: Formal analysis; Writing—review and editing; Supervision. All authors have read and agreed to the published version of the manuscript.
References
- L. Wang, Z. Wang, P. Kendall et al., Digital-twin deep dynamic camera position optimisation for the V-STARS photogrammetry system based on 3D reconstruction, Int. J. Product. Res. 62, 3932–3951 (2024) [Google Scholar]
- D.S. Silva, R.S. Astolfi, S.K. Jagatheesaperumal et al., Advances in 3D fusion of multimodal medical images: 3D reconstruction of bone, muscle, and ligament structures under load from radiographs and magnetic resonance imaging, Res. Biomed. Eng. 41, 1–35 (2025) [Google Scholar]
- J. Ka, H. Kim, J. Kim et al., Analysis of virtual reality teaching methods in engineering education: assessing educational effectiveness and understanding of 3D structures, Virtual Real. 29, 17 (2025) [Google Scholar]
- F. Cao, J. Shi, C. Wen, A dynamic graph aggregation framework for 3D point cloud registration, Eng. Appl. Artif. Intell. 120, 105817 (2023) [Google Scholar]
- V. Barrile, E. Bernardo, G. Bilotta, An experimental HBIM processing: Innovative tool for 3D model reconstruction of morpho-typological phases for the cultural heritage, Remote Sens. 14, 1288 (2022) [Google Scholar]
- A. Murtiyoso, E. Pellis, P. Grussenmeyer et al., Towards semantic photogrammetry: generating semantically rich point clouds from architectural close-range photogrammetry, Sensors 22, 966 (2022) [Google Scholar]
- M. Wei, Z. Wei, H. Zhou et al., AGConv: adaptive graph convolution on 3D point clouds, IEEE Trans. Pattern Anal. Mach. Intell. 45, 9374–9392 (2023) [Google Scholar]
- M. Korosteleva, S.H. Lee, Neuraltailor: reconstructing sewing pattern structures from 3d point clouds of garments, ACM Trans. Graphics (TOG) 41, 1–16 (2022) [Google Scholar]
- Z. Zhang, A. Ji, K. Wang et al., UnrollingNet: an attention-based deep learning approach for the segmentation of large-scale point clouds of tunnels, Automat. Construct. 142, 104456 (2022) [Google Scholar]
- C. Rausch, C. Haas, Automated shape and pose updating of building information model elements from 3D point clouds, Automat. Construct. 124, 103561 (2021) [Google Scholar]
- H. Lim, S. Hwang, H. Myung, ERASOR: Egocentric ratio of pseudo occupancy-based dynamic object removal for static 3D point cloud map building, IEEE Robot. Automat. Lett. 6, 2272–2279 (2021) [Google Scholar]
- V. Croce, G. Caroti, L. De Luca et al., From the semantic point cloud to heritage-building information modeling: a semiautomatic approach exploiting machine learning, Remote Sens. 13, 461 (2021) [Google Scholar]
- V.A. Cotella, From 3D point clouds to HBIM: application of artificial intelligence in cultural heritage, Autom. Construct. 152, 104936 (2023) [Google Scholar]
- K. Mirzaei, M. Arashpour, E. Asadi et al., 3D point cloud data processing with machine learning for construction and infrastructure applications: a comprehensive review, Adv. Eng. Inform. 51, 101501 (2022) [CrossRef] [Google Scholar]
- W. Sun, J. Wang, F. Jin et al., Intelligent construction monitoring method for large and complex steel structures based on laser point cloud, Buildings 13, 1749 (2023) [Google Scholar]
- R. Huang, Y. Xu, L. Hoegner et al., Semantics-aided 3D change detection on construction sites using UAV-based photogrammetric point clouds, Autom. Construct. 134, 104057 (2022) [Google Scholar]
- J. Kim, D. Chung, Y. Kim et al., Deep learning-based 3D reconstruction of scaffolds using a robot dog, Autom. Construct. 134, 104092 (2022) [Google Scholar]
- M. Casini, Extended reality for smart building operation and maintenance: a review, Energies 15, 3785 (2022) [Google Scholar]
- J. Moyano, J.E. Nieto-Julián, L.M. Lenin et al., Operability of point cloud data in an architectural heritage information model, Int. J. Architect. Heritage 16, 1588–1607 (2022) [Google Scholar]
- Y. Pan, L. Zhang, Integrating BIM and AI for smart construction management: current status and future directions, Arch. Comput. Methods Eng. 30, 1081–1110 (2023) [Google Scholar]
- H. Feng, W. Li, Z. Luo et al., GCN-based pavement crack detection using mobile LiDAR point clouds, IEEE Trans. Intell. Transport. Syst. 23, 11052–11061 (2021) [Google Scholar]
- C. Yin, B. Wang, V.J.L. Gan et al., Automated semantic segmentation of industrial point clouds using ResPointNet++, Autom. Construct. 130, 103874 (2021) [Google Scholar]
- R. Rashdi, J. Martínez-Sánchez, P. Arias et al., Scanning technologies to building information modelling: a review, Infrastructures 7, 49 (2022) [CrossRef] [Google Scholar]
- K. You, C. Zhou, L. Ding, Deep learning technology for construction machinery and robotics, Automat. Construct. 150, 104852 (2023) [Google Scholar]
- M. Pepe, A.R. Garofalo, D. Costantino et al., From point cloud to BIM: a new method based on efficient point cloud simplification by geometric feature analysis and building parametric objects in rhinoceros/grasshopper software, Remote Sens. 16, 1630 (2024) [Google Scholar]
- Y. Shen, L. Hui, H. Jiang et al., Reliable inlier evaluation for unsupervised point cloud registration, Proc. AAAI Conf. Artific. Intell. 36, 2198–2206 (2022) [Google Scholar]
- J. Kim, S. Lee, J. Seo et al., The integration of earthwork design review and planning using UAV-based point cloud and BIM, Appl. Sci. 11, 3435 (2021) [Google Scholar]
- Z. Ballouch, R. Hajji, F. Poux et al., A prior level fusion approach for the semantic segmentation of 3D point clouds using deep learning, Remote Sens. 14, 3415 (2022) [Google Scholar]
- Y. Xu, Y. Zhou, P. Sekula et al., Machine learning in construction: From shallow to deep learning, Dev. Built Environ. 6, 100045 (2021) [Google Scholar]
- C. Tao, J. Du, PointSGRADE: Sparse learning with graph representation for anomaly detection by using unstructured 3D point cloud data, IISE Trans. 57, 131–144 (2025) [Google Scholar]
- Y. Li, L. Zhao, Y. Chen et al., 3D LiDAR and multi-technology collaboration for preservation of built heritage in China: a review, Int. J. Appl. Earth Observ. Geoinform. 116, 103156 (2023) [Google Scholar]
- D. Yu, Z. He, Digital twin-driven intelligence disaster prevention and mitigation for infrastructure: advances, challenges, and opportunities, Natural Hazards 112, 1–36 (2022) [Google Scholar]
- V. Croce, G. Caroti, A. Piemonte et al., H-BIM and artificial intelligence: classification of architectural heritage for semi-automatic scan-to-BIM reconstruction, Sensors 23,2497 (2023) [Google Scholar]
Cite this article as: Chenglong Huang, Chi-Ho Lin, Suan Lee, Rapid recognition and localization of virtual assembly components in bridge 3D point clouds based on supervoxel clustering and transformer, Int. J. Simul. Multidisci. Des. Optim. 16, 24 (2025), https://doi.org/10.1051/smdo/2025014
All Tables
All Figures
![]() |
Fig. 1 Effect of supervoxel clustering. |
| In the text | |
![]() |
Fig. 2 Supervoxel clustering process and recognition effect. |
| In the text | |
![]() |
Fig. 3 Point cloud data processing and supervoxel feature segmentation results. |
| In the text | |
![]() |
Fig. 4 3D laser point cloud model, supervoxel clustering segmentation, and assembly results. |
| In the text | |
![]() |
Fig. 5 Technical framework design for 3D point cloud virtual assembly components and rapid recognition and positioning. |
| In the text | |
![]() |
Fig. 6 Final pseudocode design of the proposed model. |
| In the text | |
![]() |
Fig. 7 Comparison of model computational accuracy (a: S3DIS dataset, b: ETH-Building dataset, c: ISPRS dataset, d: NBM dataset). |
| In the text | |
![]() |
Fig. 8 Comparison of model computational efficiency (a: S3DIS dataset, b: ETH-Building dataset, c: ISPRS dataset, d: NBM dataset). |
| In the text | |
![]() |
Fig. 9 Model recognition efficiency evaluation results (a: S3DIS dataset, b: ETH-Building dataset, c: ISPRS dataset, d: NBM dataset). |
| In the text | |
![]() |
Fig. 10 Model localization efficiency evaluation results (a: S3DIS dataset, b: ETH-Building dataset, c: ISPRS dataset, d: NBM dataset). |
| In the text | |
![]() |
Fig. 11 Model Robustness Evaluation Results (a: S3DIS dataset, b: ETH-Building dataset, c: ISPRS dataset, d: NBM dataset). |
| In the text | |
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.
















































