fix: address design review ACT NOW items (6 risk gaps)

- Add migrate_config() to ConfigGenerator protocol for schema version upgrades - Add per-server operation lock to ProcessManager to prevent start/stop races - Add busy_timeout retry/backoff strategy (exponential: 1s, 2s, 4s) for DB lock exhaustion - Add ConfigForm testing strategy and error boundary for malformed schemas - Add schema cache invalidation on adapter version change - Add ConfigMigrationError to typed adapter exceptions
2026-04-16 17:29:19 +07:00
parent 624d7594e2
commit b17d199301
6 changed files with 94 additions and 4 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@@ -211,6 +211,7 @@ def start(self, server_id: int) -> dict:
 ### ProcessManager (Core)
 - Singleton that owns all subprocess handles
 - Thread-safe dict: `{server_id: subprocess.Popen}`
+- **Per-server operation lock** (`_operation_locks: dict[int, threading.Lock]`) — serializes start/stop/restart for the same server. Prevents race conditions when two admins hit "start" simultaneously or when start+stop overlap. Each server gets its own lock; different servers operate independently.
 - `start()` sets `cwd=servers/{server_id}/` so relative config paths resolve correctly
 - On Windows: `terminate()` = `TerminateProcess` (hard kill, no SIGTERM) — graceful shutdown must go through adapter's RemoteAdmin
 - Provides: `start()`, `stop()`, `restart()`, `is_running()`, `send_command()`
@@ -232,6 +233,7 @@ def start(self, server_id: int) -> dict:
 - **Validation is delegated to adapter's Pydantic models** — core never inspects config content
 - **Sensitive field encryption**: calls `adapter.get_config_generator().get_sensitive_fields(section)` to identify which JSON keys to encrypt/decrypt via Fernet
 - **Optimistic locking**: each row includes `config_version` (integer). On PUT, client sends the version they read. If version mismatch, return 409 Conflict.
+- **Schema migration**: on read, if `schema_version` differs from `adapter.get_config_version()`, core calls `adapter.migrate_config(old_version, config_json)`. On success, updates the row with migrated JSON and new `schema_version`. On `ConfigMigrationError`, keeps original config and logs a warning.
 - Provides: `get_section()`, `get_all_sections()`, `upsert_section()`, `delete_sections()`

 ### Adapter Exceptions (Standard Error Types)
@@ -245,6 +247,7 @@ Adapters raise specific exception types so the core can handle errors precisely:
 | `LaunchArgsError` | build_launch_args() fails (missing mod, bad path) | Set server status='error', return 400 |
 | `RemoteAdminError` | Remote admin connection/command fails | Log warning, return 503 with detail |
 | `ExeNotAllowedError` | Executable not in adapter's allowlist | Return 400 with allowed list |
+| `ConfigMigrationError` | `migrate_config()` fails to transform old schema | Keep original config, log warning, server runs with old schema |

 ---

@@ -529,10 +532,11 @@ Steps:
 | Sync vs async DB | **Sync SQLAlchemy only** | All DB access is synchronous; background threads are non-async; no aiosqlite dependency |
 | WebSocket auth | JWT in query param on connect | Browser WS API doesn't support headers |
 | Process ownership | **ProcessManager singleton** | Single source of truth; prevents duplicate launches |
+| Server operation safety | **Per-server operation lock** | ProcessManager holds a lock per server_id that serializes start/stop/restart. Two concurrent start requests for the same server: the second waits for the first to complete, then sees the server is already running. Different servers are independent (no cross-server locking). |
 | Config files | **Adapter regenerates on each start** | Always fresh from DB; no sync drift; adapter's structured builder prevents config injection |
 | Config write failure | **Atomic write + rollback** | Adapter writes to temp files first, then atomic rename. On failure, temp files are cleaned up — original files remain untouched. Server start never proceeds with partial config. |
 | Sensitive field encryption | **Adapter declares via get_sensitive_fields()** | ConfigGenerator protocol returns list of JSON keys per section that need Fernet encryption. Core's ConfigRepository handles encrypt/decrypt transparently. |
-| Adapter schema versioning | **config_version in game_configs row** | Each config section row stores a version string. On adapter update, if version differs, adapter provides a migration function. |
+| Adapter schema versioning | **config_version in game_configs row** | Each config section row stores a version string. On adapter update, if version differs, core calls `adapter.migrate_config(old_version, config_json)` which returns the migrated dict. On migration failure (ConfigMigrationError), core keeps the original config and logs a warning — the server can still run with the old schema. |
 | Adapter error communication | **Typed adapter exceptions** | Adapters raise specific exception types (ConfigWriteError, ConfigValidationError, LaunchArgsError, RemoteAdminError). Core catches specifically and sets appropriate DB status + returns clear API errors. |
 | Remote admin thread safety | **Core wraps with lock** | Core wraps RemoteAdminClient calls with a threading.Lock. Adapter clients don't need to be thread-safe. One lock per server — API requests and poller thread share safely. |
 | Third-party adapter loading | **Setuptools entry_points** | Third-party adapters register via `languard.adapters` entry_point group. Core scans entry_points at startup and auto-registers. Built-in adapters registered on import. |
--- a/DATABASE.md
+++ b/DATABASE.md
@@ -6,6 +6,7 @@
 - WAL mode enabled: `PRAGMA journal_mode=WAL` — allows concurrent reads during writes
 - Foreign keys enabled: `PRAGMA foreign_keys=ON`
 - Busy timeout: `PRAGMA busy_timeout=5000` — prevents "database is locked" errors under concurrent thread writes
+- **Retry on exhaustion**: If busy_timeout is exceeded (5s), writes fail with `OperationalError("database is locked")`. Background threads retry with exponential backoff (1s, 2s, 4s), then skip the tick. API handlers retry up to 2 times with 1s backoff, then return 503. See THREADING.md for the full retry implementation.

 ---

@@ -537,7 +538,17 @@ The `config_version` column in `game_configs` prevents lost updates when two adm
 4. If match: increment version, write new JSON, return 200
 5. If no match (version changed): return **409 Conflict** with current config for client-side merge

-## game_data JSON Schema
+## Config Schema Migration
+
+When the adapter is updated and `get_config_version()` returns a newer version than what's stored in `game_configs.schema_version`, the core automatically migrates:
+
+1. On read, detect `stored_schema_version != adapter.get_config_version()`
+2. Call `adapter.migrate_config(stored_schema_version, config_json)` → returns migrated dict
+3. Update the row: `SET config_json = ?, schema_version = ? WHERE id = ?`
+4. On `ConfigMigrationError`: keep original config, log warning, server runs with old schema
+5. Migration is per-section — each section can have a different stored version
+
+This ensures config data is always compatible with the current adapter without manual intervention.

 The `game_data` columns on `players`, `missions`, `mods`, and `bans` are validated by adapter-provided Pydantic models. Each capability protocol optionally provides a schema:

--- a/FRONTEND.md
+++ b/FRONTEND.md
@@ -895,6 +895,14 @@ frontend/
 - Optimistic locking (send `config_version` on save)
 - 409 conflict resolution (show diff, allow override/merge)
 - Validation errors from adapter (field-level, from Pydantic)
+- **Error boundary**: if `jsonSchemaToFields()` encounters an unsupported or malformed schema type (deeply nested objects, unknown formats), render a fallback "This section uses an unsupported field type. Edit raw JSON." with a raw JSON editor instead of crashing the form
+
+**ConfigForm test plan** (critical — highest-risk component):
+- **Unit: `jsonSchemaToFields`** — test every supported type mapping (string→text, integer→number, boolean→toggle, enum→select, array→repeatable), test sensitive field masking, test unknown type → fallback descriptor with `type: "raw_json"`
+- **Unit: `jsonSchemaToFields`** — test malformed schema input (missing `properties`, nested `$ref`, `oneOf`/`anyOf`) → returns fallback descriptor, never throws
+- **Integration: `ConfigForm`** — render with a 2-section schema, verify field rendering, toggle a boolean, verify dirty state, submit and verify request payload
+- **Integration: 409 conflict** — after save, mock a 409 response, verify ConflictDialog appears with diff
+- **Integration: error boundary** — mount `ConfigForm` with a schema that has unsupported `type: "object"` nested 3 levels deep, verify raw JSON fallback renders instead of crash

 **`LogViewer`** — performance-critical:
 - Virtualized rendering (react-window or similar)
@@ -929,7 +937,7 @@ All API data uses TanStack Query with appropriate stale times:
 | Mods | 5min | — |
 | Bans | 2min | — |
 | Game types | 30min | — (rarely changes) |
-| Config schema | 30min | — (tied to adapter version) |
+| Config schema | 30min | — (tied to adapter version) **Invalidation**: when `GET /games/{type}` returns a different `schema_version` than previously cached, TanStack Query invalidates all config schema queries for that game type. This prevents stale forms after an adapter update. |

 **Optimistic updates** on:
 - Server start/stop → immediately update status in cache, rollback on error
@@ -1071,7 +1079,7 @@ const { hasMissions, hasMods, hasRemoteAdmin } = useCapability(serverId);
 interface FieldDescriptor {
  name: string;
  label: string;
-  type: "text" | "number" | "boolean" | "select" | "textarea" | "password" | "array";
+  type: "text" | "number" | "boolean" | "select" | "textarea" | "password" | "array" | "raw_json";
  default?: unknown;
  min?: number;
  max?: number;
@@ -1087,6 +1095,9 @@ function jsonSchemaToFields(
 ): FieldDescriptor[] {
  // Walk schema.properties, map each to a FieldDescriptor
  // Mark fields in sensitiveFields as type: "password"
+  // Unknown types (nested objects, oneOf/anyOf, $ref) → type: "raw_json"
+  //   The form renders a textarea with JSON editing for these fields
+  //   rather than crashing or hiding the field silently
 }
 ```

@@ -1105,6 +1116,31 @@ function jsonSchemaToFields(

 Future adapters register themselves; the card list auto-populates from `GET /games`.

+### Schema Cache Invalidation
+
+Adapter config schemas are cached with `staleTime: 30min`. When an adapter is updated and its `schema_version` changes, the frontend must not serve a stale schema. The invalidation pattern:
+
+```typescript
+// When fetching game type info, compare schema_version
+function useGameTypeWithInvalidation(gameType: string) {
+  const queryClient = useQueryClient();
+  return useQuery({
+    queryKey: ["gameType", gameType],
+    staleTime: 30 * 60 * 1000,
+    onSuccess: (data) => {
+      const prevVersion = localStorage.getItem(`schema_version_${gameType}`);
+      if (prevVersion && prevVersion !== data.schema_version) {
+        // Adapter was updated — invalidate all config schema queries
+        queryClient.invalidateQueries({ queryKey: ["configSchema", gameType] });
+        localStorage.setItem(`schema_version_${gameType}`, data.schema_version);
+      } else if (!prevVersion) {
+        localStorage.setItem(`schema_version_${gameType}`, data.schema_version);
+      }
+    },
+  });
+}
+```
+
 ---

 ## Error Handling UX
@@ -1121,6 +1157,7 @@ Future adapters register themselves; the card list auto-populates from `GET /gam
 | WS disconnect | StatusBar indicator + stale data with timestamp |
 | WS reconnect | Automatic; no user action needed |
 | Server crashed | Toast notification + status dot turns red |
+| Malformed adapter schema | Raw JSON fallback in ConfigForm section ("This section uses an unsupported field type. Edit raw JSON.") |

 **No `alert()` calls.** All feedback uses toast (transient) or inline (persistent) patterns.

--- a/IMPLEMENTATION_PLAN.md
+++ b/IMPLEMENTATION_PLAN.md
@@ -447,6 +447,7 @@ Verify React app can:
 ### Adapter contract tests
 - Template test suite that any new adapter should pass
 - Tests: ConfigGenerator produces valid sections and valid config files, ProcessConfig returns allowed executables, LogParser parses sample lines
+- ConfigGenerator migration test: `migrate_config(old_version, config_json)` returns valid migrated dict; `ConfigMigrationError` on invalid old_version

 ### Load notes
 - SQLite with WAL handles concurrent reads from 4 threads per server well
--- a/MODULES.md
+++ b/MODULES.md
@@ -281,10 +281,14 @@ class ProcessManager:
    _instance = None
    _processes: dict[int, subprocess.Popen] = {}
    _lock: threading.Lock
+    _operation_locks: dict[int, threading.Lock]  # per-server operation lock

    @classmethod
    def get() -> ProcessManager

+    def get_operation_lock(server_id) -> threading.Lock
+      # Returns a per-server lock that serializes start/stop/restart
+      # Prevents concurrent start+stop races for the same server
    def start(server_id, exe_path, args: list[str], cwd: str) -> int  # returns PID
    def stop(server_id, timeout=30) -> bool
    def kill(server_id) -> bool
@@ -579,6 +583,12 @@ class ConfigGenerator(Protocol):
    def get_config_version(self) -> str:
        """Current adapter schema version. Stored in game_configs.schema_version."""
        ...
+    def migrate_config(self, old_version: str, config_json: dict[str, dict]) -> dict[str, dict]:
+        """Transform config JSON from an older schema version to the current one.
+        Called by ConfigRepository when a section's stored schema_version
+        differs from get_config_version(). Returns the migrated config dict.
+        Raises ConfigMigrationError on failure (core rolls back; original stays)."""
+        ...
    def write_configs(self, server_id: int, server_dir: Path,
                      config_sections: dict[str, dict]) -> list[Path]: ...
    def build_launch_args(self, config_sections: dict[str, dict],
--- a/THREADING.md
+++ b/THREADING.md
@@ -137,8 +137,16 @@ This means:
 # Each background thread creates its own SQLAlchemy connection
 # from the same engine (WAL mode allows concurrent reads)
 # PRAGMA busy_timeout=5000 prevents "database is locked" errors
+#
+# If busy_timeout is exhausted (5s), the write fails with
+# OperationalError. Background threads retry with exponential
+# backoff: 1s, 2s, 4s — then log and skip the tick.
+# API request handlers retry up to 2 times with 1s backoff,
+# then return 503 "database temporarily unavailable".

 class BaseServerThread(threading.Thread):
+    _db_retry_delays = [1.0, 2.0, 4.0]  # seconds, exponential backoff
+
    def run(self):
        engine = get_engine()
        self._db = engine.connect()
@@ -147,6 +155,13 @@ class BaseServerThread(threading.Thread):
            while not self._stop_event.is_set():
                try:
                    self.tick()
+                except OperationalError as e:
+                    if "database is locked" in str(e):
+                        retried = self._retry_db_write(self.tick)
+                        if not retried:
+                            logger.warning(f"{self.name}: DB locked after all retries, skipping tick")
+                    else:
+                        self.on_error(e)
                except Exception as e:
                    self.on_error(e)
                self._stop_event.wait(self.interval)
@@ -155,6 +170,18 @@ class BaseServerThread(threading.Thread):
        finally:
            self.teardown()
            self._db.close()
+
+    def _retry_db_write(self, fn, max_retries=3):
+        for i, delay in enumerate(self._db_retry_delays[:max_retries]):
+            self._stop_event.wait(delay)
+            if self._stop_event.is_set():
+                return False
+            try:
+                fn()
+                return True
+            except OperationalError:
+                continue
+        return False
 ```

 ---