Vibe Coder Academy – ระบบ AI สำหรับงานเงินจริง

โมเดลไม่พังแบบมี error มันค่อย ๆ แย่ลงเมื่อข้อมูลจริงเปลี่ยนไปจากตอนเทรน (เช่น ธนาคารเปลี่ยนดีไซน์สลิป) งานของ MLOps คือ “รู้ก่อน” ที่ลูกค้าจะรู้ ด้วยการวัด drift, แจ้งเตือนทันที, และ deploy เวอร์ชันใหม่ โดยไม่มี downtime

ปัญหา: โมเดลแย่ลงเงียบ ๆ

accuracy ที่วัดตอน deploy เป็นแค่ภาพ ณ วันนั้น เมื่อ distribution ของ input เปลี่ยน (data drift) หรือ ความสัมพันธ์ระหว่าง input กับ output เปลี่ยน (concept drift) โมเดลจะทำงานแย่ลงโดยไม่มีสัญญาณ error ใด ๆ เราต้องวัด distribution ของ feature/prediction เทียบกับ baseline อย่างต่อเนื่อง

Population Stability Index

PSI วัดว่า distribution ปัจจุบันเลื่อนจาก baseline ไปแค่ไหน โดยแบ่งค่าออกเป็น bin แล้วเทียบสัดส่วน เกณฑ์อ่านค่ามาตรฐานในงานความเสี่ยง:

PSI < 0.1 - เสถียร ไม่ต้องทำอะไร
0.1 ≤ PSI < 0.25 - เริ่มเลื่อน เฝ้าดูใกล้ ๆ
PSI ≥ 0.25 - เลื่อนมาก ควรพิจารณา retrain/rollback

สูตร: PSI = Σ (actual% − expected%) × ln(actual% / expected%) รวมทุก bin

คำนวณ PSI ใน Python

psi.py

Python

1import numpy as np
2
3def population_stability_index(
4    expected: np.ndarray,   # baseline (ตอนเทรน)
5    actual: np.ndarray,     # ข้อมูลจริงช่วงล่าสุด
6    bins: int = 10,
7) -> float:
8    # ใช้ quantile ของ baseline ตั้งขอบ bin เพื่อให้แต่ละ bin สมดุล
9    quantiles = np.linspace(0, 1, bins + 1)
10    edges = np.unique(np.quantile(expected, quantiles))
11    edges[0], edges[-1] = -np.inf, np.inf
12
13    exp_counts, _ = np.histogram(expected, bins=edges)
14    act_counts, _ = np.histogram(actual, bins=edges)
15
16    eps = 1e-6  # กัน log(0) และหารศูนย์
17    exp_pct = exp_counts / exp_counts.sum() + eps
18    act_pct = act_counts / act_counts.sum() + eps
19
20    psi = np.sum((act_pct - exp_pct) * np.log(act_pct / exp_pct))
21    return float(psi)
22
23def drift_level(psi: float) -> str:
24    if psi < 0.10:
25        return "stable"
26    if psi < 0.25:
27        return "watch"
28    return "alert"

แจ้งเตือน Discord / Slack

เมื่อ PSI เข้าเขต alert ส่งเข้า channel ที่ทีมเห็นทันทีผ่าน incoming webhook อย่าแจ้งซ้ำทุกชั่วโมง ใช้กลไก dedupe เพื่อกัน alert fatigue

alerts.py

Python

1import os
2import httpx
3
4DISCORD_WEBHOOK = os.environ["DISCORD_WEBHOOK_URL"]
5SLACK_WEBHOOK = os.environ["SLACK_WEBHOOK_URL"]
6
7async def alert_drift(feature: str, psi: float, model_version: str):
8    title = f":rotating_light: ตรวจพบ drift: {feature}"
9    detail = (f"PSI = {psi:.3f} (เกณฑ์ alert ≥ 0.25)\n"
10              f"model: {model_version}")
11
12    async with httpx.AsyncClient(timeout=10) as client:
13        # Discord
14        await client.post(DISCORD_WEBHOOK, json={
15            "embeds": [{
16                "title": title,
17                "description": detail,
18                "color": 0xE0B34F,
19            }],
20        })
21        # Slack
22        await client.post(SLACK_WEBHOOK, json={
23            "text": f"{title}\n{detail}",
24        })

monitor_job.py

Python

1# งานนี้รันเป็น scheduled task (เช่น ทุกชั่วโมง) บน worker
2from .psi import population_stability_index, drift_level
3from .alerts import alert_drift
4
5DEDUPE_TTL = 6 * 3600  # ไม่แจ้งซ้ำภายใน 6 ชั่วโมง
6
7async def check_drift(redis, store, model_version: str):
8    for feature in ["slip_brightness", "ocr_text_len", "prediction_conf"]:
9        baseline = store.baseline(feature)
10        recent = store.recent(feature, window="24h")
11        psi = population_stability_index(baseline, recent)
12
13        if drift_level(psi) == "alert":
14            key = f"alerted:{model_version}:{feature}"
15            if await redis.set(key, "1", nx=True, ex=DEDUPE_TTL):
16                await alert_drift(feature, psi, model_version)

Blue-green zero-downtime

เมื่อต้อง deploy โมเดลเวอร์ชันใหม่ อย่าแทนที่ของเดิมทันที รัน “green” (ใหม่) ขนานกับ “blue” (เดิม) ทำ health check + smoke test บน green ก่อน แล้วค่อยสลับ traffic ที่ load balancer ถ้าพังก็ชี้กลับ blue ได้ทันทีโดยไม่มี downtime

deploy.sh

Shell

1#!/usr/bin/env bash
2set -euo pipefail
3
4NEW_VERSION="$1"
5
6# 1) ยิง green ขึ้นมาขนานกับ blue (คนละ port/container)
7docker compose up -d --no-deps model-green
8echo "รอ green warm up..."
9sleep 5
10
11# 2) health + smoke test บน green ก่อนรับ traffic จริง
12for i in $(seq 1 10); do
13  if curl -fsS http://localhost:9091/health >/dev/null; then break; fi
14  sleep 2
15done
16curl -fsS -X POST http://localhost:9091/predict \
17  -F file=@./fixtures/golden_slip.jpg | grep -q '"status":"done"' \
18  || { echo "smoke test ล้มเหลว ยกเลิก deploy"; docker compose stop model-green; exit 1; }
19
20# 3) สลับ upstream ที่ reverse proxy แบบ atomic แล้ว reload (ไม่ตัด connection)
21ln -sfn /etc/nginx/upstreams/green.conf /etc/nginx/upstreams/active.conf
22nginx -s reload
23
24echo "green (${NEW_VERSION}) รับ traffic แล้ว - blue ยังอยู่เผื่อ rollback"

health.py

Python

1from fastapi import FastAPI, Response
2
3app = FastAPI()
4MODEL_READY = False
5
6@app.on_event("startup")
7async def warm_up():
8    global MODEL_READY
9    await load_model_weights()      # โหลด weight ก่อนประกาศ ready
10    await run_dummy_prediction()    # วอร์ม inference path
11    MODEL_READY = True
12
13@app.get("/health")
14def health(response: Response):
15    # liveness แยกจาก readiness: ยังไม่พร้อม = 503 LB จะไม่ส่ง traffic มา
16    if not MODEL_READY:
17        response.status_code = 503
18        return {"status": "warming"}
19    return {"status": "ok"}

เช็กลิสต์ production

เก็บ baseline distribution ตอน deploy แล้ววัด PSI ของ feature + prediction ต่อเนื่อง
ตั้งเกณฑ์ alert ที่ PSI ≥ 0.25 และ dedupe ไม่ให้แจ้งซ้ำจน alert fatigue
deploy แบบ blue-green: smoke test บน green ก่อนสลับ traffic เสมอ
readiness probe ต้องสะท้อนความพร้อมจริง รวมการวอร์ม model แล้ว
เก็บปุ่ม rollback ให้กดได้ทันที - blue ยังอยู่จนมั่นใจว่า green เสถียร

สรุปสำคัญ

PSI วัดว่า distribution ปัจจุบันเลื่อนจาก baseline แค่ไหน เกณฑ์ alert ที่ PSI >= 0.25
deploy แบบ blue-green: smoke test บน green ก่อนสลับ traffic เสมอ
readiness probe ต้องสะท้อนความพร้อมจริง รวมการวอร์ม model แล้ว

ทดสอบความเข้าใจ

ควิซท้ายบท

0/4 ข้อ

01Population Stability Index (PSI) ใช้วัดอะไร
02ตามเกณฑ์มาตรฐานในงานความเสี่ยง ค่า PSI ระดับใดที่ควรพิจารณา retrain หรือ rollback
03การวัด PSI ของ 'การกระจายของ prediction (เช่น confidence)' มีข้อดีอย่างไร
04กุญแจสำคัญที่ทำให้ blue-green deployment มี zero-downtime จริงคืออะไร

ตอบให้ครบทุกข้อแล้วกดส่งคำตอบเพื่อดูเฉลย