fix: address design review ACT NOW items (6 risk gaps)

- Add migrate_config() to ConfigGenerator protocol for schema version upgrades
- Add per-server operation lock to ProcessManager to prevent start/stop races
- Add busy_timeout retry/backoff strategy (exponential: 1s, 2s, 4s) for DB lock exhaustion
- Add ConfigForm testing strategy and error boundary for malformed schemas
- Add schema cache invalidation on adapter version change
- Add ConfigMigrationError to typed adapter exceptions
This commit is contained in:
Tran G. (Revernomad) Khoa
2026-04-16 17:29:19 +07:00
parent 624d7594e2
commit b17d199301
6 changed files with 94 additions and 4 deletions

View File

@@ -211,6 +211,7 @@ def start(self, server_id: int) -> dict:
### ProcessManager (Core)
- Singleton that owns all subprocess handles
- Thread-safe dict: `{server_id: subprocess.Popen}`
- **Per-server operation lock** (`_operation_locks: dict[int, threading.Lock]`) — serializes start/stop/restart for the same server. Prevents race conditions when two admins hit "start" simultaneously or when start+stop overlap. Each server gets its own lock; different servers operate independently.
- `start()` sets `cwd=servers/{server_id}/` so relative config paths resolve correctly
- On Windows: `terminate()` = `TerminateProcess` (hard kill, no SIGTERM) — graceful shutdown must go through adapter's RemoteAdmin
- Provides: `start()`, `stop()`, `restart()`, `is_running()`, `send_command()`
@@ -232,6 +233,7 @@ def start(self, server_id: int) -> dict:
- **Validation is delegated to adapter's Pydantic models** — core never inspects config content
- **Sensitive field encryption**: calls `adapter.get_config_generator().get_sensitive_fields(section)` to identify which JSON keys to encrypt/decrypt via Fernet
- **Optimistic locking**: each row includes `config_version` (integer). On PUT, client sends the version they read. If version mismatch, return 409 Conflict.
- **Schema migration**: on read, if `schema_version` differs from `adapter.get_config_version()`, core calls `adapter.migrate_config(old_version, config_json)`. On success, updates the row with migrated JSON and new `schema_version`. On `ConfigMigrationError`, keeps original config and logs a warning.
- Provides: `get_section()`, `get_all_sections()`, `upsert_section()`, `delete_sections()`
### Adapter Exceptions (Standard Error Types)
@@ -245,6 +247,7 @@ Adapters raise specific exception types so the core can handle errors precisely:
| `LaunchArgsError` | build_launch_args() fails (missing mod, bad path) | Set server status='error', return 400 |
| `RemoteAdminError` | Remote admin connection/command fails | Log warning, return 503 with detail |
| `ExeNotAllowedError` | Executable not in adapter's allowlist | Return 400 with allowed list |
| `ConfigMigrationError` | `migrate_config()` fails to transform old schema | Keep original config, log warning, server runs with old schema |
---
@@ -529,10 +532,11 @@ Steps:
| Sync vs async DB | **Sync SQLAlchemy only** | All DB access is synchronous; background threads are non-async; no aiosqlite dependency |
| WebSocket auth | JWT in query param on connect | Browser WS API doesn't support headers |
| Process ownership | **ProcessManager singleton** | Single source of truth; prevents duplicate launches |
| Server operation safety | **Per-server operation lock** | ProcessManager holds a lock per server_id that serializes start/stop/restart. Two concurrent start requests for the same server: the second waits for the first to complete, then sees the server is already running. Different servers are independent (no cross-server locking). |
| Config files | **Adapter regenerates on each start** | Always fresh from DB; no sync drift; adapter's structured builder prevents config injection |
| Config write failure | **Atomic write + rollback** | Adapter writes to temp files first, then atomic rename. On failure, temp files are cleaned up — original files remain untouched. Server start never proceeds with partial config. |
| Sensitive field encryption | **Adapter declares via get_sensitive_fields()** | ConfigGenerator protocol returns list of JSON keys per section that need Fernet encryption. Core's ConfigRepository handles encrypt/decrypt transparently. |
| Adapter schema versioning | **config_version in game_configs row** | Each config section row stores a version string. On adapter update, if version differs, adapter provides a migration function. |
| Adapter schema versioning | **config_version in game_configs row** | Each config section row stores a version string. On adapter update, if version differs, core calls `adapter.migrate_config(old_version, config_json)` which returns the migrated dict. On migration failure (ConfigMigrationError), core keeps the original config and logs a warning — the server can still run with the old schema. |
| Adapter error communication | **Typed adapter exceptions** | Adapters raise specific exception types (ConfigWriteError, ConfigValidationError, LaunchArgsError, RemoteAdminError). Core catches specifically and sets appropriate DB status + returns clear API errors. |
| Remote admin thread safety | **Core wraps with lock** | Core wraps RemoteAdminClient calls with a threading.Lock. Adapter clients don't need to be thread-safe. One lock per server — API requests and poller thread share safely. |
| Third-party adapter loading | **Setuptools entry_points** | Third-party adapters register via `languard.adapters` entry_point group. Core scans entry_points at startup and auto-registers. Built-in adapters registered on import. |

View File

@@ -6,6 +6,7 @@
- WAL mode enabled: `PRAGMA journal_mode=WAL` — allows concurrent reads during writes
- Foreign keys enabled: `PRAGMA foreign_keys=ON`
- Busy timeout: `PRAGMA busy_timeout=5000` — prevents "database is locked" errors under concurrent thread writes
- **Retry on exhaustion**: If busy_timeout is exceeded (5s), writes fail with `OperationalError("database is locked")`. Background threads retry with exponential backoff (1s, 2s, 4s), then skip the tick. API handlers retry up to 2 times with 1s backoff, then return 503. See THREADING.md for the full retry implementation.
---
@@ -537,7 +538,17 @@ The `config_version` column in `game_configs` prevents lost updates when two adm
4. If match: increment version, write new JSON, return 200
5. If no match (version changed): return **409 Conflict** with current config for client-side merge
## game_data JSON Schema
## Config Schema Migration
When the adapter is updated and `get_config_version()` returns a newer version than what's stored in `game_configs.schema_version`, the core automatically migrates:
1. On read, detect `stored_schema_version != adapter.get_config_version()`
2. Call `adapter.migrate_config(stored_schema_version, config_json)` → returns migrated dict
3. Update the row: `SET config_json = ?, schema_version = ? WHERE id = ?`
4. On `ConfigMigrationError`: keep original config, log warning, server runs with old schema
5. Migration is per-section — each section can have a different stored version
This ensures config data is always compatible with the current adapter without manual intervention.
The `game_data` columns on `players`, `missions`, `mods`, and `bans` are validated by adapter-provided Pydantic models. Each capability protocol optionally provides a schema:

View File

@@ -895,6 +895,14 @@ frontend/
- Optimistic locking (send `config_version` on save)
- 409 conflict resolution (show diff, allow override/merge)
- Validation errors from adapter (field-level, from Pydantic)
- **Error boundary**: if `jsonSchemaToFields()` encounters an unsupported or malformed schema type (deeply nested objects, unknown formats), render a fallback "This section uses an unsupported field type. Edit raw JSON." with a raw JSON editor instead of crashing the form
**ConfigForm test plan** (critical — highest-risk component):
- **Unit: `jsonSchemaToFields`** — test every supported type mapping (string→text, integer→number, boolean→toggle, enum→select, array→repeatable), test sensitive field masking, test unknown type → fallback descriptor with `type: "raw_json"`
- **Unit: `jsonSchemaToFields`** — test malformed schema input (missing `properties`, nested `$ref`, `oneOf`/`anyOf`) → returns fallback descriptor, never throws
- **Integration: `ConfigForm`** — render with a 2-section schema, verify field rendering, toggle a boolean, verify dirty state, submit and verify request payload
- **Integration: 409 conflict** — after save, mock a 409 response, verify ConflictDialog appears with diff
- **Integration: error boundary** — mount `ConfigForm` with a schema that has unsupported `type: "object"` nested 3 levels deep, verify raw JSON fallback renders instead of crash
**`LogViewer`** — performance-critical:
- Virtualized rendering (react-window or similar)
@@ -929,7 +937,7 @@ All API data uses TanStack Query with appropriate stale times:
| Mods | 5min | — |
| Bans | 2min | — |
| Game types | 30min | — (rarely changes) |
| Config schema | 30min | — (tied to adapter version) |
| Config schema | 30min | — (tied to adapter version) **Invalidation**: when `GET /games/{type}` returns a different `schema_version` than previously cached, TanStack Query invalidates all config schema queries for that game type. This prevents stale forms after an adapter update. |
**Optimistic updates** on:
- Server start/stop → immediately update status in cache, rollback on error
@@ -1071,7 +1079,7 @@ const { hasMissions, hasMods, hasRemoteAdmin } = useCapability(serverId);
interface FieldDescriptor {
name: string;
label: string;
type: "text" | "number" | "boolean" | "select" | "textarea" | "password" | "array";
type: "text" | "number" | "boolean" | "select" | "textarea" | "password" | "array" | "raw_json";
default?: unknown;
min?: number;
max?: number;
@@ -1087,6 +1095,9 @@ function jsonSchemaToFields(
): FieldDescriptor[] {
// Walk schema.properties, map each to a FieldDescriptor
// Mark fields in sensitiveFields as type: "password"
// Unknown types (nested objects, oneOf/anyOf, $ref) → type: "raw_json"
// The form renders a textarea with JSON editing for these fields
// rather than crashing or hiding the field silently
}
```
@@ -1105,6 +1116,31 @@ function jsonSchemaToFields(
Future adapters register themselves; the card list auto-populates from `GET /games`.
### Schema Cache Invalidation
Adapter config schemas are cached with `staleTime: 30min`. When an adapter is updated and its `schema_version` changes, the frontend must not serve a stale schema. The invalidation pattern:
```typescript
// When fetching game type info, compare schema_version
function useGameTypeWithInvalidation(gameType: string) {
const queryClient = useQueryClient();
return useQuery({
queryKey: ["gameType", gameType],
staleTime: 30 * 60 * 1000,
onSuccess: (data) => {
const prevVersion = localStorage.getItem(`schema_version_${gameType}`);
if (prevVersion && prevVersion !== data.schema_version) {
// Adapter was updated — invalidate all config schema queries
queryClient.invalidateQueries({ queryKey: ["configSchema", gameType] });
localStorage.setItem(`schema_version_${gameType}`, data.schema_version);
} else if (!prevVersion) {
localStorage.setItem(`schema_version_${gameType}`, data.schema_version);
}
},
});
}
```
---
## Error Handling UX
@@ -1121,6 +1157,7 @@ Future adapters register themselves; the card list auto-populates from `GET /gam
| WS disconnect | StatusBar indicator + stale data with timestamp |
| WS reconnect | Automatic; no user action needed |
| Server crashed | Toast notification + status dot turns red |
| Malformed adapter schema | Raw JSON fallback in ConfigForm section ("This section uses an unsupported field type. Edit raw JSON.") |
**No `alert()` calls.** All feedback uses toast (transient) or inline (persistent) patterns.

View File

@@ -447,6 +447,7 @@ Verify React app can:
### Adapter contract tests
- Template test suite that any new adapter should pass
- Tests: ConfigGenerator produces valid sections and valid config files, ProcessConfig returns allowed executables, LogParser parses sample lines
- ConfigGenerator migration test: `migrate_config(old_version, config_json)` returns valid migrated dict; `ConfigMigrationError` on invalid old_version
### Load notes
- SQLite with WAL handles concurrent reads from 4 threads per server well

View File

@@ -281,10 +281,14 @@ class ProcessManager:
_instance = None
_processes: dict[int, subprocess.Popen] = {}
_lock: threading.Lock
_operation_locks: dict[int, threading.Lock] # per-server operation lock
@classmethod
def get() -> ProcessManager
def get_operation_lock(server_id) -> threading.Lock
# Returns a per-server lock that serializes start/stop/restart
# Prevents concurrent start+stop races for the same server
def start(server_id, exe_path, args: list[str], cwd: str) -> int # returns PID
def stop(server_id, timeout=30) -> bool
def kill(server_id) -> bool
@@ -579,6 +583,12 @@ class ConfigGenerator(Protocol):
def get_config_version(self) -> str:
"""Current adapter schema version. Stored in game_configs.schema_version."""
...
def migrate_config(self, old_version: str, config_json: dict[str, dict]) -> dict[str, dict]:
"""Transform config JSON from an older schema version to the current one.
Called by ConfigRepository when a section's stored schema_version
differs from get_config_version(). Returns the migrated config dict.
Raises ConfigMigrationError on failure (core rolls back; original stays)."""
...
def write_configs(self, server_id: int, server_dir: Path,
config_sections: dict[str, dict]) -> list[Path]: ...
def build_launch_args(self, config_sections: dict[str, dict],

View File

@@ -137,8 +137,16 @@ This means:
# Each background thread creates its own SQLAlchemy connection
# from the same engine (WAL mode allows concurrent reads)
# PRAGMA busy_timeout=5000 prevents "database is locked" errors
#
# If busy_timeout is exhausted (5s), the write fails with
# OperationalError. Background threads retry with exponential
# backoff: 1s, 2s, 4s — then log and skip the tick.
# API request handlers retry up to 2 times with 1s backoff,
# then return 503 "database temporarily unavailable".
class BaseServerThread(threading.Thread):
_db_retry_delays = [1.0, 2.0, 4.0] # seconds, exponential backoff
def run(self):
engine = get_engine()
self._db = engine.connect()
@@ -147,6 +155,13 @@ class BaseServerThread(threading.Thread):
while not self._stop_event.is_set():
try:
self.tick()
except OperationalError as e:
if "database is locked" in str(e):
retried = self._retry_db_write(self.tick)
if not retried:
logger.warning(f"{self.name}: DB locked after all retries, skipping tick")
else:
self.on_error(e)
except Exception as e:
self.on_error(e)
self._stop_event.wait(self.interval)
@@ -155,6 +170,18 @@ class BaseServerThread(threading.Thread):
finally:
self.teardown()
self._db.close()
def _retry_db_write(self, fn, max_retries=3):
for i, delay in enumerate(self._db_retry_delays[:max_retries]):
self._stop_event.wait(delay)
if self._stop_event.is_set():
return False
try:
fn()
return True
except OperationalError:
continue
return False
```
---