Issue 
Int. J. Simul. Multisci. Des. Optim.
Volume 5, 2014



Article Number  A05  
Number of page(s)  13  
DOI  https://doi.org/10.1051/smdo/2013003  
Published online  04 February 2014 
Article
Interpretability and variability of metamodel validation statistics in engineering system design optimization: a practical study
^{1}
Electronics and Communications Engineering Department, College of Engineering at AlLith, Umm AlQura University, Makkah AlMukarramah, KSA
^{2}
Biomedical Engineering Department, Hijjawi College of Engineering Technology, Yarmouk University, Irbid, Jordan
^{3}
Biomedical Engineering Department, College of Engineering, Jordan University of Science and Technology, Irbid, Jordan
^{*} email: husam@yu.edu.jo
Received:
14
December
2012
Accepted:
5
November
2013
Prediction accuracy of a metamodel of an engineering system in comparison to the simulation model it approximates is one fundamental criterion that is used in metamodel validation. Many statistics are used to quantify prediction accuracy of metamodels in deterministic simulations. The most frequently used ones include the rootmeansquare error (RMSE) and the Rsquare metric derived from it, and to a lesser degree the average absolute error (AAE) and its derivates such as the relative average absolute error (RAAE). In this paper, we compare two aspects of these statistics: interpretability of results returned by these statistics and their sampletosample variations, putting more emphasis on the latter. We use the differencemode to commonmode ratio (DMCMR) as a measure of sampletosample variations for these statistics. Preliminary results are obtained and discussed via a number of analytic and electronic engineering examples.
Key words: Simulation / Modeling / Metamodel validation
© H. Hamad et al., Published by EDP Sciences, 2014
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Simulators are used in the design of many engineering systems. For example, the design of electronic integrated circuits usually involves the use of transistorlevel simulators such as PSPICE. Because such simulators are often expensive in terms of simulation times, different approximation modelsthat are often termed metamodels may be used to replace simulation models. Metamodels are built and validated using simulation results for samples of data points in the input space. Two fundamental criteria are used as the basis for accepting or rejecting a metamodel: efficiency and accuracy. Efficiency is indicative of how expeditiously predictions can be obtained; accuracy is indicative of how good these predictions are.
Efficiency of a metamodel can be determined prior to metamodel construction, and without any computational cost in terms of the simulation runs needed, e.g., the time taken to evaluate a secondorder polynomial metamodel in a given number of dimensions is the same regardless of the underlying simulation model. On the other hand, determining the accuracy of a metamodel is closely linked to the number of data points used in error calculations.
The accuracy of a metamodel is determined using quantitative methods which are mostly based on average statistics, or subjective methods using data displays such as box plots as in Sargent [1] and Kleijnen and Deflandre [2]. Hamad and AlHamdan use circle, ordinal, and marksman plots in [3] and [4].
Two of the most popular quantitative measures used in deterministic simulations to validate metamodels in terms of their accuracy of prediction are the rootmeansquare error (RMSE) and the average absolute error (AAE), or some of their derivates. Calculation of these statistics is obtained usually using all of the available points in validation test samples, but in some techniques crossvalidation methods are employed using subsets of the available test data; see Martin and Simpson [5] and Meckesheimer et al. [6].
Derivates of the RMSE and AAE statistics which are in essence relative error averages are sometimes used in the literature to give more interpretability to the results returned by RMSE and AAE, or otherwise to enable comparisons of metamodels especially when responses from different disciplines are approximated. Two of the mostly used ones include Rsquare and relative average absolute error (RAAE) or the like; see for example Jin et al. [7]. The Rsquare metric is in essence the square of the RMSE relative to the variance of the response data in the test sample, while RAAE may be obtained by relating AAE to the standard deviation (defining equations for these statistics are given later). Note that some applications use AAE and RMSE relative to the average response instead, e.g., Qu et al. [8]. Other work relates RMSE to the range of response values in the test data samples, e.g., Eeckelaert et al. [9].
An important assumption for the validity of the results obtained by averagebased statistics is related to the number of data points used in test samples. Of course, results of these statistics are meaningful only if the data used is sufficient in number. It is often the case that obtaining a sufficient number of observations is impractically expensive for complex simulation models. For such cases, averagebased metrics such as RMSE and AAE may be “sensitive” to the number of observations used.
This paper provides a comparative study of four of the averagebased statistics used for metamodel assessment in terms of prediction accuracy. The statistics used in this study are the RMSE and Rsquare, and AAE and RAAE. Two aspects of these metrics are considered: interpretability of results and sampletosample variation in the results. We put more emphasis on the issue of sampletosample variation, introducing a measure to quantify this variation. The term given to this measure is the commonmode to differencemode ratio (DMCMR), as defined in the next section.
The remainder of this paper is organized as follows. In Section 2, the four statistics mentioned above are defined and contrasted in terms of their results interpretability and variability, after defining what we mean by these terms. Preliminary results are presented via examples in Section 3 with a discussion in Section 4. The paper is then concluded by Section 5.
2 Statistics for metamodel prediction accuracy
In this section we define and compare the four statistics of RMSE, Rsquare, AAE, and RAAE used for expressing prediction accuracy of metamodels in relation to their respective simulation models. We compare two aspects of these statistics: interpretability of the results they return and the sampletosample variation of these results. We start by defining these four statistics. We then clarify the term interpretability used in this context, followed by the definition of DMCMR – the measure we use to quantify sampletosample variations.
2.1 RMSE and Rsquare
Two of the more important measures used for model accuracy assessment including deterministic simulation models are RMSE and Rsquare. They are defined by(1) (2)where MSE is the mean square error and σ ^{2} is the variance. In these equations and the discussion to follow, y _{ i } is used to denote the response modeled by for the ith data point in a validation test sample having n observations. Note that Rsquare is essentially derived from RMSE by squaring it then relating the result to the variance of y _{ i } data. Theoretical thresholds for best accuracies are zero and unity for RMSE and Rsquare, respectively.
2.2 AAE and RAAE
The other two statistics that are compared in this paper are the AAE and its derivate RAAE. AAE is defined by(3)
Rather than relating AAE to the standard deviation of the response data y _{ i } as in Jin et al. [7], we obtain RAAE by relating AAE to the average of absolute values of the response data y_{i} in the validation sample. Defining RAAE this way gives more interpretability to the results as explained shortly and provides a simpler means for comparing results for different responses. RAAE is then defined as(4)
As can be seen from these defining equations, RAAE can be thought of as representing a measure for the percentage error in the metamodel. The lower thresholds indicative of best accuracies for both AAE and RAAE are zero.
2.3 Interpretability of results
Assume that in a test problem RMSE is found to be 0.05 and 50 in another, then which of the two metamodels has better prediction accuracy for its respective simulation model? Assume on the other hand that the statistics for Rsquare are given instead as 0.78 and 0.98 for the two metamodels, respectively, then which one is better?
Note that results returned by RMSE cannot be interpreted without referring to the context of the problem. For the example given, if the response values are within the range 0.04–0.06 for the first metamodel with RMSE = 0.05, and 10,000–20,000 for the second metamodel having RMSE = 50, then obviously the second metamodel is superior. On the other hand, the closer Rsquare result for a metamodel is to unity the better its fit quality is regardless of response values. Looking at interpretability of results of these statistics from another perspective, what thresholds can we set on the values of the four statistics defined above for adequate metamodels? Note that thresholds can be readily set for Rsquare and RAAE, but not for RMSE and AAE without reference to the context of the problem. For example, the lower and upper thresholds in a given situation may be set at 0.95 and 5% for Rsquare and RAAE, respectively. However, thresholds for RMSE and AAE cannot be easily set without reference to the context of the problem.
Based on this discussion, we use two classifiers in this paper for interpretability of statistical results in terms of prediction accuracy. The results are either: (1) interpretable, or (2) not interpretable without context. Hence, Rsquare and RAAE as defined in equations (3) and (4) above are classified as having “interpretable results”, while results returned by RMSE and AAE are classified as “not interpretable without context”.
2.4 Sampletosample variation using DMCMR
Another probably more important issue related to the results returned by these statistics concerns the sampletosample variations in the results. Ideally, variations should approach zero for deterministic simulation models. However, achieving such ideal results comes at the cost of increased test samples sizes. In this work, we quantify sampletosample changes in results using DMCMR, the differencemode to commonmode ratio, where DMCMR for the two quantities ζ_{1} and ζ_{2} is defined by(5)where absolute values are taken to calculate the commonmode component in the dominator of equation (5).
Variations with sample size n in the statistics defined by equations (1)–(4) above are expected to be noticeable for test sample sizes which are not adequate for taking averages, like all other averagebased quantities. However, since the sample size n appears in different forms in these equations (RMSE ∝ n ^{−1/2}, AAE ∝ n ^{−1}, while n cancels out in the numerator and dominator for Rsquare and for RAAE), then variations with n are expected to be different for these four statistics, as will be demonstrated by the examples of the next section.
3 Examples
We compare in this section variation with sample size for RMSE and Rsquare on one hand, and AAE and RAAE and on the other hand, via three examples. For these examples:

Polynomials metamodels are used. The number of coefficients q for a polynomial in k dimensions with a degree d is q = (k+d)!/k!d!. These q coefficients are determined by the method of least squares in the examples.

Latin hypercube validation test samples are used to determine the four statistics above. Sample sizes of ωq are used, where the number of coefficients multiplier ω is varied in steps of 1 starting at ω = 1. Latin hypercube sampling is used to provide flexibility with sample sizes and good uniformity over the input space.
The examples are taken from Hamad [10]. Example 1 uses a onedimensional analytic function for the response. For this example, two metamodels having different number of coefficients q are studied. In Example 2, a twodimensional function that is frequently used in the literature is modeled, also via two metamodels with different complexities. The third example involves simulation results for an electronic circuit with three design variables (inputs).
3.1 Example 1
The following response is defined for the space x ∈ [−1,1]:(6)
Two metamodels are derived for this response: the first one is a secondorder polynomial built using a minimum bias design having four points, and the second metamodel is a fifthorder polynomial derived using another minimum bias design with 10 points.
Accuracy tests are carried out using 50 samples for each metamodel. The number of observations for these samples are ωq; ω = 1, 2, …, 50. The number of coefficients q for the secondorder and fifthorder polynomials is three and six, respectively. Calculations of RMSE, Rsquare, AAE, and RAAE are carried out for the secondorder polynomial metamodel using the 50 test samples in turn. Results are shown in Figure 1 for one Latin hypercube sampling trial for each of the 50 samples.
Figure 1. Validation results for the secondorder polynomial metamodel vs. the number of coefficients multiplier ω. 
Note from the figure that each metric settles at a “final” value shown by solid lines crossing the plots in the middle. These final values for RMSE in part (a) of the figure and AAE in part (c) are at approximately 750 and 250, respectively. Based on these two results, is the secondorder polynomial an acceptable metamodel? Moving on to parts (b) and (d) of the figure for Rsquare and RAAE final values of approximately 0 and 19%, respectively, now is the metamodel acceptable?
Note that results from RMSE and AAE cannot be interpreted without reference to the context of the problem to examine the range of response values. On the other hand, merely taking into consideration that the value of Rsquare is much lower than the upper limit of one, then it is concluded that there is a problem with the metamodel prediction capability. Similarly, RAAE results can be interpreted by the mere consideration of the values returned. A value of 19% for RAAE also indicates that there is a problem. Note that some kind of quantification to the size of the problem may also be inferred from RAAE results; it can be further concluded that the size of the problem is such that the prediction performance of the metamodel is erroneous by 19% on average. In light of the discussion given so far, we may therefore say that results returned by Rsquare and RAAE are classified as “interpretable” by comparison to RMSE and AAE results which are “not interpretable without context”.
Sampletosample variations measured by DMCMR defined in equation (5) above are depicted in Figures 2 and 3. Figure 2a shows variation results for RMSE superimposed on those for Rsquare. Similarly, Figure 2b shows superimposition of results for AAE and RAAE. Figure 3 shows sampletosample variations for all four metrics for sample sizes of 10q or less.
Figure 2. Sampletosample variations in DMCMR vs. ω. 
Figure 3. Sampletosample variations in DMSMR vs. ω for “smaller” samples. 
Note from Figure 2a that the minimum sample size is at approximately 25q for RMSE to settle to within ±10% of the final value (the two dotted lines running across the figure), and approximately 30q for Rsquare. For smaller sample sizes of 10q or less, Figure 3 reveals that as a whole Rsquare performs worse than RMSE visàvis sampletosample variations. This unexpected result may be explained by noting that the response y in equation (6) above is almost constant for most of the input space and increases sharply for x > 0.8, a condition which is not favorable for Rsquare calculation. To explain, refer to equation (2) above. It can be seen that the dominator of the second term in equation (2) is small for nearly constant response values, and even if the metamodel returns nearly zero MSE, the value of the second term in equation (2) resulting from dividing two small numbers is not without numerical problems. A similar situation leading to questioning the validity of Rsquare results that is dealt with in the literature rises for the case when n is close to the number of coefficients q in the metamodel. For such cases, Rsquare is “adjusted” to accommodate the relative size of n to q; see Kleijnen and Deflandre [2].
Sampletosample variations for AAE and RAAE can be compared by reference to Figures 2b and 3 for the smaller sample sizes. It can be seen from both figures that RAAE performs slightly better that AAE. Note from Figure 2b that both AAE and RAAE settle down to ±10% variations at a minimum sample size of around 25q, which is the same result for RMSE as discussed above. Referring to Figure 3, it can be seen that RMSE behaves only marginally worse that AAE or RAAE.
In order to investigate the sampletosample variations of the four metrics in more detail, 10 trials are carried out for each of the 50 Latin hypercube samples and DMCMRs are calculated for each metric. Then, minimum sample sizes after which the corresponding metric is confined within DMCMR levels of ±10% are noted and plotted in Figure 4. The average of such minimum sample sizes for the 10 trials for each metric are also calculated and given in Table 1. Figure 5 explains how the minimum size is determined for trial 1 for DMCMR calculations for RMSE.
Figure 4. Minimum sample sizes for confinement within ±10% DMCMR using 10 trials for each of the 50 test samples. 
Figure 5. RMSE sampletosample variation for Latin hypercube sampling trial 1 showing the minimum sample size of 24q for confinement within ±10% DMCMR. 
Average of minimum sample sizes for ±10% DMCMR levels confinement using 10 trials for each test sample.
As can be seen from Figure 4, Rsquare performs consistently worse for all 10 trials, while RAAE performs consistently the best. The results in Table 1 also show that RAAE has the best performance on average in terms of sampletosample variations in its results, with minimum average size of 11.7q. This means that a test sample of size 11.7q on average is considered “adequate” in the sense that the results returned will not be more than ±10% of the “true” value that would be obtained if the test sample size were infinite. By the same token, adequate sample sizes for RMSE, Rsquare, and AAE are 21q, 29.2q, and 16.2q respectively, as given in Table 1. Note that for this example Rsquare unexpectedly performs worse than RMSE. The reason for the unexpected performance of Rsquare was mentioned above, where it was explained that Rsquare results cannot be used in two situations: (1) if the number of observations n is close to the number of coefficients q, and (2) if the response is nearly constant for a sizable portion of the input space. Note in addition that Rsquare is close to zero for most test samples; see Figure 1b above.
In summary, RAAE outperforms the other three statistics for Metamodel 1 of Example 1 because: (1) it has the best interpretable results, and (2) it has the smallest adequate sample size of 11.7q, i.e., the minimum sample size for ±10% deviation from the “true” result is 11.7q or nearly 35 observations for this case.
3.1.1 Metamodel 2
The order of the metamodel polynomial is changed to five. The final values are approximately 450, 0.65, 170, and 13% for RMSE, Rsquare, AAE, and RAAE, respectively. Sampletosample variations for the four metrics are calculated, again using 50 Latin hypercube samples. Results for DMCMRs are shown for one sampling trial in Figure 6 superimposed for RMSE and Rsquare in part (a), and AAE and RAAE in part (b) of the figure.
Figure 6. Sampletosample variations in DMCMR vs. ω. 
As seen in Figure 6a, performance of Rsquare in terms of sampletosample variation is overall better than RMSE performance, as would be expected. Note for example that the minimum sample size for ±10% confinement of DMCMR level variations for the sampling trial shown is 28q for RMSE and 17q for Rsquare. The situation for AAE and RAAE is similar to that for Metamodel 1 above, with RAAE performing marginally better than AAE as seen in Figure 6b. Sampletosample variations for the four metrics are shown in Figure 7 for sample sizes of 10q or smaller. It can be seen from the figure that for the smaller samples with sizes between 2q and 3q the variability is largest for Rsquare, while variability becomes largest for RMSE for sample sizes of 4q up to 10q.
Figure 7. Sampletosample variations vs. ω for “smaller” samples. 
In order to investigate the sampletosample variations of the four metrics in more detail, the scenario used for Metamodel 1 above is followed here where 10 sampling trials are carried out for each of the 50 Latin hypercube samples and DMCMRs are calculated for each metric. Then, minimum sample sizes after which the corresponding metric is confined within DMCMR levels of ±10% are noted and plotted in Figure 8. The average of such minimum sample sizes for the 10 trials for each metric are also calculated and given in Table 1 above.
Figure 8. Minimum sample sizes for confinement within ±10% DMCMR using 10 trials for each of the 50 test samples. 
Note from Figure 8 that RMSE performs consistently worse for all 10 trials. It can also be seen from the figure that Rsquare performs marginally better than AAE and RAAE for all 10 trials except trial 9. RAAE performs slightly better than AAE for all 10 trials; see Figure 8. These are the same conclusions in relation to the average minimum sample sizes in Table 1, where RMSE has the worst average of 24.9q, and the other three metrics having nearly equal averages, with Rsquare slightly the best.
Therefore, we can conclude for Metamodel 2 of Example 1 that RMSE has the worst performance in relation to sampletosample variations. Also, interpretability of its results does not put it up in front of the other three metrics either. While the sampletosample variation for RAAE is slightly higher than Rsquare (16.7q for RAAE and 14.6q for Rsquare as given in Table 1), however, its interpretability of results may make it the best choice in this situation as well. Thus, while the grounds for rejecting a metamodel because of an Rsquare of 0.65 are shaky, but on the other hand a metamodel with a RAAE of 13% is understood to have an error of 13% on average, may be giving clearer indications of whether to accept or to reject the metamodel.
3.2 Example 2
The second example involves the two dimensional response studied in other work including Martin and Simpson [5], Jin et al. [7], and Hamad et al. [11]; given by:(7)
Two metamodels are derived and tested for this response. Metamodel 1 is a secondorder polynomial having q = 6, while Metamodel 2 is a piecewise metamodel consisting of two second order polynomials: one for each part of the input space partitioned along x _{1} in two halves, and with q = 6 for each polynomial; see Hamad et al. [11].
3.2.1 Metamodel 1
Figure 9 shows validation results for Metamodel 1 using onehundred Latin hypercube samples with sizes ωq = 1q, 2q, …, 100q observations. Part (a) of the figure shows that most of the onehundred test samples have RMSEs around 0.32, while AAEs for most samples are around 0.27 as depicted in part (c); what can be inferred about the prediction validity of Metamodel 1 from these results? Again, prediction accuracy cannot be judged by merely referring to the results for RMSE or AAE.
Figure 9. Validation results for Metamodel 1 vs. the number of coefficients multiplier ω. 
Referring to Rsquare results of Figure 9b, however, it can be immediately concluded that the metamodel is not accepted in terms of its prediction merits because Rsquare is far from the upper threshold of unity for all test samples; what is wrong with Metamodel 1? This question can be answered by interpreting RAAE results of Figure 8d. RAAE results show that for the onehundred samples tested, their RAAE levels are between 10% and 17%, with most points having 12–14% RAAE as depicted in the figure, i.e., the error for observations in these samples is between 12% and 14% on average.
Sampletosample variations for the four metrics are calculated, again using onehundred Latin hypercube samples. Results for DMCMRs are shown in Figure 10 for one sampling trial for the first 50 samples superimposed for RMSE and Rsquare in part (a), and AAE and RAAE in part (b) of the figure. Superimposition of the four metrics is shown in Figure 11 for the first 10 test samples. As can be seen from these figures, Rsquare performance is poor; however, the results for the other three metrics are comparable.
Figure 10. Sampletosample variations in DMCMR vs. ω. 
Figure 11. Sampletosample variations vs. ω for “smaller” samples. 
In order to investigate the sampletosample variations of the four metrics in more detail, we use 10 sampling trials as before for each of the first 50 Latin hypercube samples and DMCMRs are calculated for each metric. Then, minimum sample sizes after which the corresponding metric is confined within DMCMR levels of ±10% are noted and plotted in Figure 12. The average of such minimum sample sizes for the 10 trials for each metric are also calculated and given in Table 2.
Figure 12. Min. sample sizes for confinement within ±10% DMCMR using 10 trials for each of the 50 test samples. 
Average of minimum sample sizes for ±10% DMCMR levels confinement using 10 trials for each test sample.
It can be seen from Figure 12 and Table 2 that sampletosample variation is worst for Rsquare, with RMSE slightly better that AAE and RAAE. Note from Figure 9b that Rsquare is close to zero; this is a similar situation to that obtained for Metamodel 1 of Example 1 and shown in Figure 1b.
The response considered in this example is waving in the direction of x _{1}, and the global secondorder Metamodel 1 cannot follow this waving response consistently leading to increased sample sizes as shown in Table 2. Results are improved by using the piecewise Metamodel 2 as demonstrated below.
3.2.2 Metamodel 2
The piecewise Metamodel 2 is validated to test its prediction accuracy using the same scheme outlined above for Metamodel 1. Results are shown in Figure 13 for Rsquare and RAAE only in order to save space. Part (a) of the figure shows that Rsquare is around 0.82 for most of the test samples indicating that the piecewise secondorder metamodel is a good improvement relative to the global Metamodel 1. With this improved Rsquare value of 0.82, is Metamodel 2 acceptable? RAAE provides more information that helps in answering this question. Refer to Figure 12b, it is seen that RAAE is 4–5% for most samples, meaning that Metamodel 2 has 4–5% error on average. This allows for a more informed basis for judging the acceptability of the metamodel.
Figure 13. Validation statistics for Metamodel 2 vs. the number of coefficients multiplier ω. 
Sampletosample variations are shown in Figure 14a for RMSE and Rsquare, and for AAE and RAAE in Figure 14b, and Figure 15 shows these variations for the smaller samples with sizes of 10q or less. To investigate these variations in more details, the method used above for Metamodel 1 is repeated here for Metamodel 2, where 10 sampling trials for each of the 50 Latin hypercube test samples are examined for minimum sizes after which the respective metrics are confined to ±10% variations. The results are summarized by Figure 16, and Table 2.
Figure 14. Superimposition of DMCMR results. 
Figure 15. Variations vs. ω for “smaller” samples. 
Figure 16. Minimum sample sizes for confinement within ±10% DMCMR. 
It can be seen from Figure 16 and Table 2 that Rsquare performs marginally better than the other three metrics. RAAE performs the same as AAE for all 10 trials, and RMSE is slightly worse for the first seven trials; see Figure 16.
Similar conclusions can be made for Metamodel 2 in this example as those given above for Metamodel 2 in Example 1, where it was mentioned that while the sampletosample variation for RAAE is slightly worse than Rsquare, however, its interpretability of results may make it the best choice. Thus, while the grounds for rejecting a metamodel because of an Rsquare of 0.82 are shaky (see Figure 13a), but on the other hand a metamodel with a RAAE of 4–5% is understood to have an error of 4–5% on average, may be giving clearer indications of whether to accept or to reject the metamodel.
3.3 Example 3
The threedimensional problem in this subsection is an electronic engineering problem that relates the portion H of the input signal that appears as an output in the circuit of Figure 17.
Figure 17. Electric circuit for Example 3. 
The portion H is dependent upon the three design variables R _{1}, R _{2}, and R _{3} connected as shown in the figure. Using a circuit simulator gives results which are identical to those given by the following equation obtained from elementary circuit analysis techniques(8)
A secondorder polynomial is constructed from a minimum bias experimental design having seventeen points in the space [1,100]^{3}. Accuracy tests are then carried out using onehundred Latin hypercube samples. The number of observations for these samples are ωq; ω = 1, 2, …, 100. The number of coefficients q for this case is 10. Calculations of RMSE, Rsquare, AAE, and RAAE are carried out for the metamodel using the onehundred test samples in turn. Results are shown in Figure 18 for RMSE and RAAE using one sampling trial for each of the onehundred Latin hypercube test samples used. Results for AAE and Rsquare are omitted to save space.
Figure 18. Validation statistics for Example vs. the number of coefficients multiplier ω. 
Figure 18a shows that RMSE varies from a little above 0.02 to nearly 0.043, being approximately 0.033 ± 0.002 for most samples; is this metamodel accurate based on these RMSE results? Note that in equation (8) above 0 < H < 1 for the entire input space; now the quality of the metamodel can be determined after this information about the response is known. Interpreting the results from Rsquare can be carried out without reference to the context of the problem; Rsquare values (not shown) are between 0.959 and 0.991 for all of the onehundred test samples, a performance which is close to the upper theoretical threshold of unity.
Note that interpreting the results for RAAE shown in Figure 18b gives more information about the prediction accuracy of the metamodel. The figure reveals that for all of the onehundred test samples RAAE is 3.1–4.8%, indicating that the error in metamodel predictions is on average between 3.1% and 4.8%.
Sampletosample variations are depicted by Figure 19a for RMSE and Rsquare, and by Figure 19b for AAE and RAAE, while Figure 20 depicts these variations superimposed for small sizes having 10q observations or less. The same scales are used in both parts of Figure 19 for easy comparison. Note from the figures that sampletosample variation is worst for RMSE, with the Rsquare performance being up in front of the other three statistics.
Figure 19. Superimposition of DMCMR results. 
Figure 20. Sampletosample variations vs. ω for “smaller” samples. 
These sampletosample variations are investigated further by carrying out 10 sampling trials for each the first 50 of the 100 Latin hypercube samples and DMCMRs are calculated for each of the four metrics considered. Minimum sample sizes after which the corresponding metric is confined within DMCMR levels of ±10% are plotted in Figure 21. The average of such minimum sample sizes for the 10 sampling trials for each metric are also calculated and given in Table 3.
Figure 21. Minimum sample sizes for confinement within ±10% DMCMR. 
Average of minimum sample sizes for ±10% DMCMR levels confinement.
It can be seen from Figure 21 and Table 3 that Rsquare far outperforms the other three metrics with an average minimum size of 2q, and RMSE showing worst results in terms of sampletosample variations with an average minimum size of nearly 40q. RAAE performs the same as AAE for nearly all 10 trial at averages of about 26q. Thus, Rsquare is up in the front being robust against sampletosample variations. Also, the results returned by it can be interpreted without context, as compared to those results returned by RMSE or AAE.
4. Discussion
Five metamodels are considered in this comparative study: two metamodels for each problem in Examples 1–2, and one metamodel in Example 3. These metamodels are validated in terms of their prediction accuracy using four statistics: RMSE and its derivate Rsquare, and AAE and its derivate RAAE. Results obtained using these four statistics are compared based on two criteria: interpretability and sampletosample variations. With regard to interpretability of results:

RMSE and AAE results are not interpretable without referring to the problems context; a metamodel with RMSE of 0.05 and AAE of 0.02, say, may be an accurate model or otherwise depending on the response values used.

On the other hand, there is no need to refer to the problem’s context for Rsquare and RAAE results; an Rsquare of 0.999 is always welcome and a RAAE of 0.5% is also always taken positively. However, there is more “substance” in the information returned by RAAE since it points to the size of average error in prediction and not merely giving a yes/no answer to the metamodel acceptability as in the case for Rsquare.
Therefore, RAAE has some preference over Rsquare in terms of interpretability of results, with both RMSE and AAE way behind in this respect. This leading role for RAAE visàvis interpretability does not necessarily guarantee the first position for it among the other three statistics; the other more important aspect related to sampletosample variability is still to be considered.
The performance of the four statistics in terms of sampletosample variation is thoroughly studied in this work: at least 50 test samples are used for any of the five metamodels and the number of observations used in these samples being 1q, 2q, up to 50q, with each of the 50 test samples undergoing 10 Latin hypercube sampling trials. Variability is quantified by the newly introduced measure referred to as differencemode to commonmode ratio DMCMR. The minimum sample size for a given sampling trial for which DMCMR is confined within ±10% (see Figure 5) is noted for the 10 sampling trials and plotted for the five metamodels in the five Figures 4, 8, 12, 16, and 21. Averages of results in these figures are summarized in three tables: Tables 1–3. These five figures and three tables, summarized in Figure 22 for convenience, are the basis upon which preliminary results for sampletosample variability are given. The following can be concluded by reference to the summary in Figure 22:

Sampletosample variability for RAAE shows marginal improvement over AAE variability in worst cases. Another more important merit for RAAE over AAE is related to interpretability of results. Therefore, if the choice has to be made between AAE and RAAE, then RAAE would be a better choice since it conveys more information about the metamodel prediction validity with reduced sampletosample variability.

Rsquare has the best performance in relation to sampletosample variability provided that the following conditions are satisfied by the response data used in its calculation: (1) The number of such data points is larger than the number of coefficients in the metamodel, and more importantly (2) The data variance is not close to zero.
Figure 22. Summary of sampletosample variations results (averages in the last column are for, from top: RMSE, Rsquare, AAE, and RAAE). 
Based on this preliminary study we are inclined to conclude that using both Rsquare and RAAE in metamodel predictability validation serves two purposes at the same time: (1) it minimizes the number of observations in test samples for robustness against sampletosample variations, and (2) it allows for a more solid ground for judging the acceptability of the metamodel. To clarify this latter point, recall the results obtained for Metamodel 2 in Example 2 above, with Rsquare values in Figure 13a close to 0.82 and RAAE results are 4–5% as shown in Figure 13b. Using Rsquare results alone only permits a choice between either yes or no for acceptability of the metamodel. However, if in addition it is known that the prediction error is 4–5% on average, then this may constitute a better grounds for accepting (or rejecting) the metamodel.
Finally, it can be seen by referring to the summary in Figure 22 that minimum sample sizes for confinement within ±10% variability are impractically high for most cases. For instance, the average of minimum sample sizes for the 10 sampling trials is 4.6q for Rsquare results related to Metamodel 2 in Example 2, and as can be seen from the figure, this is the second best case among the 20 averages given. This result means that the required sample size is nearly twentyeight observations for the twodimensional problem considered in the example with six coefficients used for the corresponding metamodel. This then raises the question about the need to develop a “substitute” statistic which can be exclusively used in validating metamodels for deterministic simulations. This statistic should be chosen to reduce the minimum size of adequate validation samples. The Metamodel Acceptance Score (MAS) discussed in [10] is one such statistic. MAS sampletosample variability is reduced since MAS value depends on error “count” in a sample rather than error average.
5. Conclusions
This paper presented a comparative study of four statistics used in validating metamodels in deterministic simulations: RMSE along with its derivate Rsquare, and AAE and its derivate RAAE. RMSE and AAE are “purely” averagebased, while their derivates are “less” dependant on averages through the cancellation of the sample size terms in the numerators and dominators of their defining equations. It is shown that the derivates are less prone to variations in sample sizes by comparison to their respective statistic, at least marginally as in the case of AAE and RAAE, provided that the derivate is used correctly in the first place, e.g., the variance of response data used for Rsquare calculations not approaching zero. In addition to being more robust against sample size variations, the derivate statistics provide better interpretability of results without reference to the context of the problem. Based on the results of this preliminary study, we are not hesitant to advocate the use of Rsquare and RAAE together in reporting metamodel validation results for deterministic simulations instead of RMSE or AAE. However, due to the excessive sample sizes required for ±10% sampletosample variations, we are even less hesitant to report the need to develop a substitute statistic which can be used exclusively for deterministic simulations.
References
 Sargent R. 2004. Validation and verification of simulation models. In Proceedings of IEEE Winter Simulation Conference. IEEE, p. 13–24. [Google Scholar]
 Kleijnen J, Deflandre D. 2006. Validation of regression metamodels in simulation: bootstrap approach. European Journal of Operational Research, 170, 120–131. [CrossRef] [MathSciNet] [Google Scholar]
 Hamad H, AlHamdan S. 2005. Two new subjective validation methods using data displays. In Proceedings of IEEE Winter Simulation Conference. IEEE, p. 2542–2545. [Google Scholar]
 Hamad H, AlHamdan S. 2007. Discovering metamodels’ qualityoffit via graphical techniques. European Journal of Operational Research, 178, 543–559. [CrossRef] [Google Scholar]
 Martin J, Simpson T. 2005. Use of Kriging models to approximate deterministic computer models. American Institute of Aeronautics and Astronautics Journal, 43, 853–863. [CrossRef] [Google Scholar]
 Meckesheimer M, Booker A, Barton R, Simpson T. 2002. Computationally inexpensive metamodel assessment strategies. American Institute of Aeronautics and Astronautics Journal, 40, 2053–2056. [CrossRef] [Google Scholar]
 Jin R, Chen W, Simpson T. 2002. Comparative studies of metamodeling techniques under multiple modeling criteria. Journal of Structural Optimization, 23, 1–5. [Google Scholar]
 Qu X, Venter G, Haftka R. 2004. New formulation of a minimumbias central composite experimental design and Gauss quadrature. Structural and Multidisciplinary Optimization, 28, 231–242. [CrossRef] [Google Scholar]
 Eeckelaert T, Daems W, Gielen G, Sansen W. 2004. Generalized simulationbased posynomial model generation for analog integrated circuits. Analog Integrated Circuits and Signal Processing, 40, 193–203. [CrossRef] [Google Scholar]
 Hamad H. 2011. Validation of metamodels in simulation: a new metric. Engineering with Computer, 27, 309–317. [CrossRef] [Google Scholar]
 Hamad H, AlHamdan S, AlZaben A. 2010. Space partitioning in engineering design: a graphical approach. Structural and Multidisciplinary Optimization, 41, 441–452. [CrossRef] [Google Scholar]
Cite this article as: Hamad H, AlZaben A & Owies R: Interpretability and variability of metamodel validation statistics in engineering system design optimization: a practical study. Int. J. Simul. Multisci. Des. Optim., 2014, 5, A05.
All Tables
Average of minimum sample sizes for ±10% DMCMR levels confinement using 10 trials for each test sample.
Average of minimum sample sizes for ±10% DMCMR levels confinement using 10 trials for each test sample.
All Figures
Figure 1. Validation results for the secondorder polynomial metamodel vs. the number of coefficients multiplier ω. 

In the text 
Figure 2. Sampletosample variations in DMCMR vs. ω. 

In the text 
Figure 3. Sampletosample variations in DMSMR vs. ω for “smaller” samples. 

In the text 
Figure 4. Minimum sample sizes for confinement within ±10% DMCMR using 10 trials for each of the 50 test samples. 

In the text 
Figure 5. RMSE sampletosample variation for Latin hypercube sampling trial 1 showing the minimum sample size of 24q for confinement within ±10% DMCMR. 

In the text 
Figure 6. Sampletosample variations in DMCMR vs. ω. 

In the text 
Figure 7. Sampletosample variations vs. ω for “smaller” samples. 

In the text 
Figure 8. Minimum sample sizes for confinement within ±10% DMCMR using 10 trials for each of the 50 test samples. 

In the text 
Figure 9. Validation results for Metamodel 1 vs. the number of coefficients multiplier ω. 

In the text 
Figure 10. Sampletosample variations in DMCMR vs. ω. 

In the text 
Figure 11. Sampletosample variations vs. ω for “smaller” samples. 

In the text 
Figure 12. Min. sample sizes for confinement within ±10% DMCMR using 10 trials for each of the 50 test samples. 

In the text 
Figure 13. Validation statistics for Metamodel 2 vs. the number of coefficients multiplier ω. 

In the text 
Figure 14. Superimposition of DMCMR results. 

In the text 
Figure 15. Variations vs. ω for “smaller” samples. 

In the text 
Figure 16. Minimum sample sizes for confinement within ±10% DMCMR. 

In the text 
Figure 17. Electric circuit for Example 3. 

In the text 
Figure 18. Validation statistics for Example vs. the number of coefficients multiplier ω. 

In the text 
Figure 19. Superimposition of DMCMR results. 

In the text 
Figure 20. Sampletosample variations vs. ω for “smaller” samples. 

In the text 
Figure 21. Minimum sample sizes for confinement within ±10% DMCMR. 

In the text 
Figure 22. Summary of sampletosample variations results (averages in the last column are for, from top: RMSE, Rsquare, AAE, and RAAE). 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.