initial commit FESurrogateModelTutorial

2026-05-21 17:03:51 +09:00
parent 93665d9ee6
commit 43b86669fa
122 changed files with 7929 additions and 0 deletions
@@ -0,0 +1,114 @@
+# DOE, Sampling, Validation
+
+## 목적
+Surrogate model의 품질은 모델 종류만큼이나 데이터 설계에 좌우된다. FEM surrogate에서는 입력 변수 범위, 샘플링 방법, train/test split, 검증 지표가 모두 모델 해석에 영향을 준다.
+
+## 입력 공간 정의
+이 튜토리얼의 기본 입력 변수는 다음과 같다.
+
+| 변수 | 의미 | 단위 | 예시 범위 |
+| --- | --- | --- | --- |
+| `L_m` | beam length | m | 1.0-3.0 |
+| `b_m` | rectangular section width | m | 0.02-0.08 |
+| `h_m` | rectangular section height | m | 0.04-0.16 |
+| `E_pa` | Young's modulus | Pa | 100e9-220e9 |
+| `P_n` | tip point load magnitude | N | 100-2000 |
+
+파생 변수는 FEM 해석 직전에 계산한다.
+
+```text
+A = b h
+I = b h^3 / 12
+```
+
+`h`는 bending stiffness에 세제곱으로 들어가므로, tip displacement와 bending stress에 큰 영향을 준다. 이런 구조적 비선형성 때문에 단순 선형 회귀보다 다양한 surrogate 비교가 의미를 가진다.
+
+## Latin Hypercube Sampling
+Latin Hypercube Sampling(LHS)은 각 입력 변수의 marginal distribution을 층화하여, 적은 샘플에서도 각 변수 범위를 비교적 고르게 덮도록 설계한다. McKay, Beckman, Conover의 1979년 논문은 computer code output 분석에서 sampling plan을 비교한 고전적 출처다.
+
+SciPy의 `scipy.stats.qmc.LatinHypercube`는 `[0, 1)^d` 단위 hypercube에 샘플을 만들고, 이후 사용자가 물리 범위로 scaling한다.
+
+```text
+u ~ LHS([0,1)^d)
+x_j = lower_j + u_j (upper_j - lower_j)
+```
+
+이 튜토리얼은 다음을 기본값으로 사용한다.
+
+- `n_samples = 300`
+- `seed = 20260521`
+- `target = tip_uy_m`
+- 동일 dataset과 동일 split을 모든 model notebook에서 사용
+
+## Train/Test Split
+단일 test set만으로 모델을 판단하면 샘플 배치에 민감할 수 있다. 따라서 다음 두 가지를 함께 사용한다.
+
+- Hold-out test set: 최종 성능 비교용.
+- K-fold cross validation: 학습 데이터 안에서 모델 안정성 확인용.
+
+추천 기본값:
+
+```text
+test_size = 0.2
+cv_folds = 5
+random_state = 20260521
+```
+
+## 평가 지표
+### RMSE
+큰 오차에 민감하다. 구조해석 surrogate에서 위험한 outlier를 확인하는 데 유용하다.
+
+```text
+RMSE = sqrt(mean((y - y_hat)^2))
+```
+
+### MAE
+평균적인 절대 오차를 직관적으로 보여준다.
+
+```text
+MAE = mean(abs(y - y_hat))
+```
+
+### R2
+분산 설명력을 나타낸다. 다만 target scale이나 test set 분포에 따라 해석이 왜곡될 수 있으므로 RMSE/MAE와 함께 본다.
+
+```text
+R2 = 1 - sum((y - y_hat)^2) / sum((y - mean(y))^2)
+```
+
+## Plot 기반 진단
+- Parity plot: 예측값과 실제값이 `y=x` 선 주변에 있는지 확인.
+- Residual plot: 예측값 또는 주요 입력 변수에 따른 오차 패턴 확인.
+- Error histogram: 오차 분포와 outlier 확인.
+- Model comparison bar plot: RMSE, MAE, R2, 학습 시간, 예측 시간 비교.
+
+## Dataset Metadata
+CSV만으로는 dataset의 생성 조건을 알기 어렵다. 따라서 metadata JSON을 함께 저장한다.
+
+```json
+{
+  "dataset_name": "beam2d_lhs_300",
+  "sample_count": 300,
+  "random_seed": 20260521,
+  "unit_system": "SI",
+  "fea_model": "2D Euler-Bernoulli beam/frame, linear static",
+  "target_columns": [
+    "tip_uy_m",
+    "max_abs_bending_stress_pa",
+    "mass_kg",
+    "compliance_j"
+  ]
+}
+```
+
+## 주의점
+- DOE 범위는 surrogate의 유효 영역이다. 범위 밖 예측은 별도 경고를 둔다.
+- 물리적으로 불가능한 조합을 허용하지 않는다.
+- FEM solver 실패 케이스는 조용히 버리지 말고 metadata에 기록한다.
+- 단위가 섞이면 모델 비교가 무의미해진다.
+- 동일 데이터셋을 쓰지 않으면 모델별 비교가 공정하지 않다.
+
+## References
+- McKay, M. D., Beckman, R. J., and Conover, W. J. (1979), "Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code", Technometrics. https://doi.org/10.1080/00401706.1979.10489755
+- OSTI bibliographic record for McKay et al. https://www.osti.gov/biblio/5236110
+- SciPy `LatinHypercube` documentation. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.qmc.LatinHypercube.html