feat: implement full backend + frontend server detail, settings, and create server pages

Backend: - Complete FastAPI backend with 42+ REST endpoints (auth, servers, config, players, bans, missions, mods, games, system) - Game adapter architecture with Arma 3 as first-class adapter - WebSocket real-time events for status, metrics, logs, players - Background thread system (process monitor, metrics, log tail, RCon poller) - Fernet encryption for sensitive config fields at rest - JWT auth with admin/viewer roles, bcrypt password hashing - SQLite with WAL mode, parameterized queries, migration system - APScheduler cleanup jobs for logs, metrics, events Frontend: - Server Detail page with 7 tabs (overview, config, players, bans, missions, mods, logs) - Settings page with password change and admin user management - Create Server wizard (4-step; known bug: silent validation failure) - New hooks: useServerDetail, useAuth, useGames - New components: ServerHeader, ConfigEditor, PlayerTable, BanTable, MissionList, ModList, LogViewer, PasswordChange, UserManager - WebSocket onEvent callback for real-time log accumulation - 120 unit tests passing (Vitest + React Testing Library) Docs: - Added .gitignore, CLAUDE.md, README.md - Updated FRONTEND.md, ARCHITECTURE.md with current implementation state - Added .env.example for backend configuration Known issues: - Create Server form: "Next" buttons don't validate before advancing, causing silent submit failure when fields are invalid - Config sub-tabs need UX redesign for non-technical users
2026-04-17 11:58:34 +07:00
parent 620429c9b8
commit 6511353b55
119 changed files with 13752 additions and 5000 deletions
--- a/THREADING.md
+++ b/THREADING.md
@@ -1,782 +1,173 @@
-# Languard Servers Manager — Threading & Concurrency Design
+# Threading & Concurrency Model

 ## Overview

-The system uses a hybrid concurrency model:
- **FastAPI (asyncio)** handles HTTP requests and WebSocket connections
- **Python threads** (`threading.Thread`) handle long-running background work per server
- **Queue** bridges the thread world → asyncio world for WebSocket broadcasting
- **SQLAlchemy sync sessions** are used in threads (thread-local connections)
+Languard uses a hybrid concurrency model:

-The key change for multi-game support: **core threads are game-agnostic** and receive game-specific behavior (log parsers, remote admin clients) via dependency injection from the adapter.
+- **FastAPI (asyncio)** handles HTTP requests and WebSocket connections on the main event loop
+- **Python `threading.Thread`** handles long-running background work per server
+- **`queue.Queue`** bridges the thread world to the asyncio world for WebSocket broadcasting
+- **SQLAlchemy sync sessions** with thread-local connections provide thread-safe database access

---
+## Thread Architecture

-## Thread Map
+For N running servers, the system runs up to 4N+1 background threads:

-```
-Main Process (FastAPI / asyncio event loop)
-│
-├── [uvicorn] HTTP/WS event loop                     (asyncio)
-│     ├── REST request handlers                       (async def / plain def)
-│     └── WebSocket handlers                          (async def)
-│
-├── BroadcastThread                                   (daemon thread, 1 global)
-│     └── Reads from broadcast_queue (thread-safe)
-│         Calls asyncio.run_coroutine_threadsafe()
-│         → ConnectionManager.broadcast()
-│
-└── Per-running-server thread group (started when server starts, stopped when server stops):
-      ├── ProcessMonitorThread    (1 per server, 1s interval)       — CORE
-      ├── LogTailThread           (1 per server, 100ms interval)    — CORE + adapter LogParser
-      ├── MetricsCollectorThread  (1 per server, 5s interval)      — CORE
-      └── RemoteAdminPollerThread (1 per server, 10s interval)    — CORE + adapter RemoteAdmin
-```
+| Thread Type | Count | Purpose |
+|---|---|---|
+| `BroadcastThread` | 1 (global) | Bridges `queue.Queue` to asyncio WebSocket broadcasts |
+| `LogTailThread` | 1 per server | Tails .rpt log files, parses lines, persists to DB, broadcasts events |
+| `ProcessMonitorThread` | 1 per server | Monitors server process, detects crashes, triggers auto-restart |
+| `MetricsCollectorThread` | 1 per server | Collects CPU/RAM metrics via psutil every 10 seconds |
+| `RemoteAdminPollerThread` | 1 per server | Polls player list via RCon, syncs join/leave events |

-For **N running servers**, there are:
- `4*N` background threads + 1 BroadcastThread = `4N+1` background threads total
- (If adapter has no `remote_admin`, RemoteAdminPollerThread is skipped → `3*N+1`)
+All server-specific threads are managed by `ThreadRegistry`, which creates/destroys thread bundles as servers start/stop.

---
+## BaseServerThread

-## Adapter Injection into Threads
+All background threads extend `BaseServerThread`, which provides:

-The `ThreadRegistry` resolves the adapter at thread creation time and injects game-specific components into the generic core threads:
-
-```python
-class ThreadRegistry:
-    @classmethod
-    def start_server_threads(cls, server_id: int, db: Connection) -> None:
-        server = ServerRepository(db).get_by_id(server_id)
-        adapter = GameAdapterRegistry.get(server["game_type"])
-
-        threads: dict[str, BaseServerThread] = {}
-
-        # Core threads — always present
-        threads["process_monitor"] = ProcessMonitorThread(server_id)
-        threads["metrics_collector"] = MetricsCollectorThread(server_id)
-
-        # Core thread with adapter's log parser injected
-        log_parser = adapter.get_log_parser()
-        threads["log_tail"] = LogTailThread(
-            server_id,
-            parser=log_parser,
-            log_file_resolver=log_parser.get_log_file_resolver(server_id),
-        )
-
-        # Core thread with adapter's remote admin injected (if supported)
-        remote_admin = adapter.get_remote_admin()
-        if remote_admin is not None:
-            threads["remote_admin_poller"] = RemoteAdminPollerThread(
-                server_id,
-                remote_admin_factory=lambda: remote_admin.create_client(
-                    host="127.0.0.1",
-                    port=server["rcon_port"],
-                    password=_get_remote_admin_password(server_id, db),
-                ),
-            )
-
-        # Adapter-declared custom threads (for game-specific background work)
-        for thread_factory in adapter.get_custom_thread_factories():
-            thread = thread_factory(server_id, db)
-            threads[thread.name_key] = thread
-
-        with cls._lock:
-            cls._threads[server_id] = threads
-
-        for thread in threads.values():
-            thread.start()
-```
-
---
-
-## Thread Safety Rules
-
-| Resource | Access Pattern | Protection |
-|----------|---------------|------------|
-| `ProcessManager._processes` | read/write from multiple threads | `threading.Lock` |
-| `ThreadRegistry._threads` | read/write from main + shutdown | `threading.Lock` |
-| `broadcast_queue` | multi-writer, single reader | `queue.Queue` (thread-safe built-in) |
-| `ConnectionManager._connections` | async, single event loop | `asyncio.Lock` |
-| SQLite connections | one connection per thread | Thread-local via `threading.local()` |
-| Config files on disk | write on start, read-only during run | No lock needed (regenerated before start) |
-| Adapter objects | read-only after registration | No lock needed (registered once at startup) |
-| RemoteAdminClient calls | called from RemoteAdminPollerThread only | **Core wraps with per-server `threading.Lock`** (see below) |
-
-### RemoteAdminClient Thread Safety
-
-Adapters do NOT need to make their `RemoteAdminClient` implementations thread-safe. The core wraps every RemoteAdminClient call with a **per-server `threading.Lock`** so only one call executes at a time against a given server's admin client.
-
-```python
-# In RemoteAdminPollerThread
-class RemoteAdminPollerThread(BaseServerThread):
-    def __init__(self, server_id: int,
-                 remote_admin_factory: Callable[[], "RemoteAdminClient"]):
-        super().__init__(server_id, self.interval)
-        self._client_factory = remote_admin_factory
-        self._client: RemoteAdminClient | None = None
-        self._connected = False
-        self._call_lock = threading.Lock()  # per-server lock
-
-    def _call(self, method, *args, **kwargs):
-        """All RemoteAdminClient calls go through this to serialize access."""
-        with self._call_lock:
-            return method(*args, **kwargs)
-
-    # In tick(), replace direct self._client.get_players() with:
-    #   players = self._call(self._client.get_players)
-```
-
-This means:
- Adapter authors write simple, non-thread-safe clients
- Core guarantees no concurrent calls to the same client
- Different servers' clients can call concurrently (different locks)
-
-### SQLite Thread Safety
-```python
-# Each background thread creates its own SQLAlchemy connection
-# from the same engine (WAL mode allows concurrent reads)
-# PRAGMA busy_timeout=5000 prevents "database is locked" errors
-#
-# If busy_timeout is exhausted (5s), the write fails with
-# OperationalError. Background threads retry with exponential
-# backoff: 1s, 2s, 4s — then log and skip the tick.
-# API request handlers retry up to 2 times with 1s backoff,
-# then return 503 "database temporarily unavailable".
-
-class BaseServerThread(threading.Thread):
-    _db_retry_delays = [1.0, 2.0, 4.0]  # seconds, exponential backoff
-
-    def run(self):
-        engine = get_engine()
-        self._db = engine.connect()
-        try:
-            self.setup()
-            while not self._stop_event.is_set():
-                try:
-                    self.tick()
-                except OperationalError as e:
-                    if "database is locked" in str(e):
-                        retried = self._retry_db_write(self.tick)
-                        if not retried:
-                            logger.warning(f"{self.name}: DB locked after all retries, skipping tick")
-                    else:
-                        self.on_error(e)
-                except Exception as e:
-                    self.on_error(e)
-                self._stop_event.wait(self.interval)
-        except Exception as e:
-            logger.error(f"{self.name} setup error: {e}")
-        finally:
-            self.teardown()
-            self._db.close()
-
-    def _retry_db_write(self, fn, max_retries=3):
-        for i, delay in enumerate(self._db_retry_delays[:max_retries]):
-            self._stop_event.wait(delay)
-            if self._stop_event.is_set():
-                return False
-            try:
-                fn()
-                return True
-            except OperationalError:
-                continue
-        return False
-```
-
---
-
-## BroadcastThread — Asyncio Bridge
-
-This is the critical bridge between background threads and the asyncio WebSocket layer. **Game-agnostic.**
-
-```
-Background Thread                         Asyncio Event Loop
-─────────────────                         ──────────────────
-Any background thread                     uvicorn runs here
-  │
-  ▼
-BroadcastThread.enqueue(                  loop = asyncio.get_running_loop()
-  server_id=1,                            (stored at app startup)
-  msg_type='log',
-  data={...}
-)
-  │
-  ▼
-broadcast_queue.put({                     asyncio.run_coroutine_threadsafe(
-  'server_id': 1,                             connection_manager.broadcast(
-  'type': 'log',                                  server_id=1,
-  'data': {...}                                   message={type, data}
-)                                             ),
-  │                                           loop=self._loop
-  ▼                                       )
-BroadcastThread.run() ──────────────────►
-  while True:
-    msg = queue.get()
-    fut = run_coroutine_threadsafe(
-              broadcast_coro,
-              self._loop
-          )
-    fut.result(timeout=5)
-```
-
-### Implementation Sketch
-```python
-# core/websocket/broadcaster.py
-import asyncio
-import queue
-import threading
-
-_broadcast_queue: queue.Queue = queue.Queue(maxsize=10000)
-_event_loop: asyncio.AbstractEventLoop | None = None
-
-class BroadcastThread(threading.Thread):
-    daemon = True
-
-    def __init__(self, loop: asyncio.AbstractEventLoop, manager):
-        super().__init__(name="BroadcastThread")
-        self._loop = loop
-        self._manager = manager
-        self._running = True
-
-    def run(self):
-        while self._running:
-            try:
-                msg = _broadcast_queue.get(timeout=1.0)
-                server_id = msg['server_id']
-                outgoing = {
-                    'type': msg['type'],
-                    'server_id': server_id,
-                    'data': msg['data'],
-                }
-                future = asyncio.run_coroutine_threadsafe(
-                    self._manager.broadcast(str(server_id), outgoing, channel=msg['type']),
-                    self._loop
-                )
-                try:
-                    future.result(timeout=5.0)
-                except TimeoutError:
-                    logger.warning(f"Broadcast timeout for server {server_id} msg type {msg['type']}")
-            except queue.Empty:
-                continue
-            except Exception as e:
-                logger.error(f"BroadcastThread error: {e}")
-
-    def stop(self):
-        self._running = False
-
-    @staticmethod
-    def enqueue(server_id: int, msg_type: str, data: dict):
-        """Thread-safe. Called from any background thread."""
-        try:
-            _broadcast_queue.put_nowait({
-                'server_id': server_id,
-                'type': msg_type,
-                'data': data,
-            })
-        except queue.Full:
-            logger.warning(f"Broadcast queue full, dropping {msg_type} for server {server_id}")
-```
-
---
-
-## ProcessMonitorThread — Crash Detection & Auto-Restart
-
-**Game-agnostic.** This thread only checks OS-level process status and updates the core `servers` table.
-
-```python
-class ProcessMonitorThread(BaseServerThread):
-    interval = 1.0
-
-    def tick(self):
-        proc = ProcessManager.get().get_process(self.server_id)
-        if proc is None:
-            self.stop()
-            return
-
-        exit_code = proc.poll()
-        if exit_code is not None:
-            self._handle_process_exit(exit_code)
-            self.stop()
-
-    def _handle_process_exit(self, exit_code: int):
-        is_crash = (exit_code != 0)
-        status = 'crashed' if is_crash else 'stopped'
-
-        server = ServerRepository(self._db).get_by_id(self.server_id)
-        ServerRepository(self._db).update_status(
-            self.server_id, status, pid=None,
-            stopped_at=datetime.utcnow().isoformat()
-        )
-        PlayerRepository(self._db).clear(self.server_id)
-        ServerEventRepository(self._db).insert(
-            self.server_id, status,
-            actor='system',
-            detail={'exit_code': exit_code}
-        )
-
-        BroadcastThread.enqueue(self.server_id, 'status', {'status': status})
-        BroadcastThread.enqueue(self.server_id, 'event', {
-            'event_type': status,
-            'detail': {'exit_code': exit_code}
-        })
-
-        # Stop other threads for this server via daemon cleanup thread
-        # (avoids thread joining itself)
-        import threading as _threading
-
-        def _cleanup_and_maybe_restart():
-            try:
-                ThreadRegistry.get().stop_server_threads(self.server_id)
-                if is_crash and server.get('auto_restart'):
-                    self._schedule_auto_restart(server)
-            except Exception as e:
-                logger.error(f"Cleanup/restart failed for server {self.server_id}: {e}")
-                BroadcastThread.enqueue(self.server_id, 'event', {
-                    'event_type': 'auto_restart_failed',
-                    'detail': {'error': str(e)}
-                })
-
-        _threading.Thread(
-            target=_cleanup_and_maybe_restart,
-            daemon=True,
-            name=f"StopCleanup-{self.server_id}"
-        ).start()
-
-    def _schedule_auto_restart(self, server: dict):
-        # IMPORTANT: Runs in daemon cleanup thread, NOT ProcessMonitorThread.
-        # Must create its own DB connection.
-        from database import get_thread_db
-        db = get_thread_db()
-
-        restart_count = server['restart_count']
-        max_restarts = server['max_restarts']
-        window = server['restart_window_seconds']
-        last_restart = server.get('last_restart_at')
-
-        if last_restart:
-            last_dt = datetime.fromisoformat(last_restart)
-            elapsed = (datetime.utcnow() - last_dt).total_seconds()
-            if elapsed > window:
-                ServerRepository(db).reset_restart_count(self.server_id)
-                restart_count = 0
-
-        if restart_count < max_restarts:
-            delay = min(10 * (restart_count + 1), 60)  # exponential backoff
-            logger.info(f"Auto-restarting server {self.server_id} in {delay}s (attempt {restart_count+1}/{max_restarts})")
-            threading.Timer(delay, self._auto_restart).start()
-        else:
-            logger.warning(f"Server {self.server_id} exceeded max auto-restarts ({max_restarts})")
-            BroadcastThread.enqueue(self.server_id, 'event', {
-                'event_type': 'max_restarts_exceeded',
-                'detail': {'restart_count': restart_count}
-            })
-
-    def _auto_restart(self):
-        from core.servers.service import ServerService
-        try:
-            ServerService().start(self.server_id)
-        except Exception as e:
-            logger.error(f"Auto-restart failed for server {self.server_id}: {e}")
-```
-
---
-
-## LogTailThread — Generic File Tailing with Adapter Parser
-
-**Core thread** that takes an adapter-provided `LogParser` for game-specific log line parsing and file discovery.
-
-```python
-class LogTailThread(BaseServerThread):
-    interval = 0.1  # 100ms
-
-    def __init__(self, server_id: int, log_parser: "LogParser",
-                 log_file_resolver: Callable[[Path], Path | None]):
-        super().__init__(server_id, self.interval)
-        self._parser = log_parser
-        self._log_file_resolver = log_file_resolver
-        self._file: TextIO | None = None
-        self._current_path: Path | None = None
-        self._last_size: int = 0
-
-    def setup(self):
-        self._open_latest_log()
-
-    def _open_latest_log(self):
-        """
-        Uses the adapter-provided log_file_resolver to find the current log file.
-        Opens it and seeks to end (tail behavior).
-
-        NOTE: Do NOT use os.stat().st_ino for rotation detection — on Windows/NTFS
-        st_ino is always 0. Instead, track filename and file size.
-        """
-        server_dir = get_server_dir(self.server_id)
-        log_path = self._log_file_resolver(server_dir)
-        if log_path is None:
-            return  # Server hasn't created log yet; retry in next tick
-
-        try:
-            self._file = open(log_path, 'r', encoding='utf-8', errors='replace')
-            self._file.seek(0, 2)  # seek to end
-            self._current_path = log_path
-            self._last_size = self._file.tell()
-        except OSError:
-            self._file = None
-
-    def tick(self):
-        if self._file is None:
-            self._open_latest_log()
-            return
-
-        # Rotation detection: only re-check every 5 seconds
-        now = time.monotonic()
-        if now - getattr(self, '_last_glob_time', 0) > 5.0:
-            self._last_glob_time = now
-            server_dir = get_server_dir(self.server_id)
-            log_path = self._log_file_resolver(server_dir)
-            if log_path is not None and log_path != self._current_path:
-                self._file.close()
-                self._open_latest_log()
-                return
-
-        try:
-            current_size = self._current_path.stat().st_size
-        except OSError:
-            return
-
-        if current_size < self._last_size:
-            self._file.close()
-            self._open_latest_log()
-            return
-
-        # Read new lines and parse using adapter's parser
-        while True:
-            line = self._file.readline()
-            if not line:
-                break
-            self._last_size = self._file.tell()
-            line = line.rstrip('\n')
-            if not line:
-                continue
-
-            # Adapter parses the line — game-specific format
-            entry = self._parser.parse_line(line)
-            if entry:
-                LogRepository(self._db).insert(self.server_id, entry)
-                BroadcastThread.enqueue(self.server_id, 'log', entry)
-
-    def teardown(self):
-        if self._file is not None:
-            try:
-                self._file.close()
-            except OSError:
-                pass
-            self._file = None
-```
-
---
-
-## MetricsCollectorThread — Game-Agnostic Resource Monitoring
-
-**Fully game-agnostic.** Uses psutil to monitor any process.
-
-```python
-class MetricsCollectorThread(BaseServerThread):
-    interval = 5.0
-
-    def tick(self):
-        pid = ProcessManager.get().get_pid(self.server_id)
-        if pid is None:
-            return
-
-        try:
-            proc = psutil.Process(pid)
-            cpu = proc.cpu_percent(interval=0.5)
-            ram = proc.memory_info().rss / (1024 * 1024)  # MB
-        except (psutil.NoSuchProcess, psutil.AccessDenied):
-            return
-
-        player_count = PlayerRepository(self._db).count(self.server_id)
-
-        MetricsRepository(self._db).insert(self.server_id, cpu, ram, player_count)
-        BroadcastThread.enqueue(self.server_id, 'metrics', {
-            'cpu_percent': cpu,
-            'ram_mb': ram,
-            'player_count': player_count,
-        })
-```
-
---
-
-## RemoteAdminPollerThread — Generic Polling with Adapter Client
-
-**Core thread** that takes an adapter-provided `RemoteAdmin` factory for game-specific admin protocol communication. Skipped entirely if adapter has no `remote_admin` capability.
-
-```python
-class RemoteAdminPollerThread(BaseServerThread):
-    interval = 10.0
-    STARTUP_DELAY = 30.0
-
-    def __init__(self, server_id: int,
-                 remote_admin_factory: Callable[[], "RemoteAdminClient"]):
-        super().__init__(server_id, self.interval)
-        self._client_factory = remote_admin_factory
-        self._client: RemoteAdminClient | None = None
-        self._connected = False
-
-    def setup(self):
-        # Wait for server to start up before attempting connection
-        # Uses _stop_event.wait() instead of time.sleep() for immediate shutdown
-        startup_delay = self._get_startup_delay()
-        if self._stop_event.wait(startup_delay):
-            return  # stop was requested during wait
-        self._connect()
-
-    def _get_startup_delay(self) -> float:
-        # Default delay; adapter may override via RemoteAdmin.get_startup_delay()
-        return self.STARTUP_DELAY
-
-    def _connect(self):
-        try:
-            self._client = self._client_factory()
-            self._connected = True
-        except Exception as e:
-            logger.warning(f"Remote admin connection failed for server {self.server_id}: {e}")
-            self._connected = False
-
-    def tick(self):
-        if not self._connected:
-            self._reconnect_attempts = getattr(self, '_reconnect_attempts', 0) + 1
-            delay = min(10 * 2 ** self._reconnect_attempts, 120)  # exponential backoff
-            if self._reconnect_attempts > 1:
-                logger.info(f"Remote admin reconnect attempt {self._reconnect_attempts} for server {self.server_id}")
-                if self._stop_event.wait(delay):
-                    return
-            self._connect()
-            if not self._connected:
-                return
-            self._reconnect_attempts = 0
-
-        try:
-            players = self._call(self._client.get_players)
-            PlayerService(self._db).update_from_remote_admin(self.server_id, players)
-            BroadcastThread.enqueue(self.server_id, 'players', {
-                'players': [p for p in players],
-                'count': len(players),
-            })
-        except ConnectionError:
-            self._connected = False
-            logger.warning(f"Remote admin connection lost for server {self.server_id}")
-        except RemoteAdminError as e:
-            logger.error(f"Remote admin adapter error for server {self.server_id}: {e}")
-            self._connected = False
-
-    def teardown(self):
-        if self._client is not None:
-            try:
-                self._client.disconnect()
-            except Exception:
-                pass
-            self._client = None
-```
-
---
-
-## Thread Lifecycle
-
-### Start Server Flow
-```
-POST /servers/{id}/start
-  │
-  ├── ServerService.start()
-  │     ├── adapter = GameAdapterRegistry.get(server.game_type)
-  │     ├── check_server_ports_available(server_id)
-  │     │     └── For ALL running servers, resolve each adapter,
-  │     │         get port conventions, check full derived port set
-  │     │         (cross-game: Arma 3 game+steam query + other games' ports)
-  │     ├── adapter.config_generator.write_configs()
-  │     │     └── Atomic write: write to .tmp files first, then os.replace()
-  │     │         On failure: .tmp files cleaned up, originals untouched
-  │     ├── launch_args = adapter.config_generator.build_launch_args()
-  │     ├── ProcessManager.start()     ← creates subprocess.Popen
-  │     └── ThreadRegistry.start_server_threads(id, db)
-  │           ├── ProcessMonitorThread(id)              ← core, always
-  │           ├── LogTailThread(id, adapter.log_parser)  ← core + adapter
-  │           ├── MetricsCollectorThread(id)             ← core, always
-  │           └── RemoteAdminPollerThread(id, adapter.remote_admin)
-  │                                                       ← core + adapter (if available)
-  │
-  └── BroadcastThread.enqueue(id, 'status', {status: 'starting'})
-
-Error paths on start:
-  ├── ConfigWriteError → rollback .tmp files, return 500 to client
-  ├── ConfigValidationError → return 422 with validation details
-  ├── LaunchArgsError → return 400 with invalid arg info
-  ├── ExeNotAllowedError → return 403 with executable name
-  └── PortInUseError → return 409 with conflicting port info
-```
-
-### Stop Server Flow
-```
-POST /servers/{id}/stop
-  │
-  ├── adapter.remote_admin.shutdown()    ← if adapter has remote_admin
-  ├── Wait up to 30s for process exit (ProcessManager.stop(timeout=30))
-  ├── If still running: ProcessManager.kill()
-  ├── ThreadRegistry.stop_server_threads(id)
-  │     ├── ProcessMonitorThread.stop()
-  │     ├── LogTailThread.stop()
-  │     ├── MetricsCollectorThread.stop()
-  │     └── RemoteAdminPollerThread.stop()  ← if present
-  │     └── Thread.join(timeout=5) for each
-  │
-  └── BroadcastThread.enqueue(id, 'status', {status: 'stopped'})
-```
-
-### App Shutdown Flow
-```
-FastAPI shutdown event
-  │
-  ├── ThreadRegistry.stop_all()    ← stop all threads for all servers
-  ├── BroadcastThread.stop()
-  ├── ConnectionManager.close_all()
-  └── database engine dispose
-```
-
---
-
-## Stop Event Pattern
-
-All background threads use a `threading.Event` for graceful shutdown:
+- **Stop event**: `threading.Event` for graceful shutdown
+- **Thread-local DB**: Creates a fresh SQLAlchemy connection per thread via `get_thread_db()`
+- **Exception backoff**: On unhandled exceptions, sleeps with exponential backoff (5s → 30s max), then retries. If stop event is set, exits cleanly.
+- **Abstract `run_loop()` method**: Subclasses implement the main loop, called repeatedly until stop event is set

 ```python
 class BaseServerThread(threading.Thread):
-    def __init__(self, server_id: int, interval: float):
-        super().__init__(name=f"{self.__class__.__name__}-{server_id}", daemon=True)
+    def __init__(self, server_id: int, ...):
+        super().__init__(daemon=True)
        self.server_id = server_id
-        self.interval = interval
        self._stop_event = threading.Event()

    def stop(self):
        self._stop_event.set()

-    def is_stopped(self) -> bool:
-        return self._stop_event.is_set()
-
-    def teardown(self):
-        """Override to release resources (close files, sockets) after the loop ends."""
-        pass
-
    def run(self):
-        try:
-            self.setup()
-        except Exception as e:
-            logger.error(f"{self.name} setup error: {e}")
-            return  # setup failed completely
-
-        try:
-            while not self._stop_event.is_set():
-                try:
-                    self.tick()
-                except Exception as e:
-                    self.on_error(e)
-                self._stop_event.wait(self.interval)
-        finally:
-            self.teardown()
-
-    def on_error(self, error: Exception):
-        """Default error handler. Adapter exceptions are typed for specific handling."""
-        if isinstance(error, RemoteAdminError):
-            logger.error(f"{self.name} remote admin error: {error}")
-            # RemoteAdminPollerThread overrides to set _connected = False
-        elif isinstance(error, ConfigWriteError):
-            logger.critical(f"{self.name} config write error (atomic write failed): {error}")
-        elif isinstance(error, ConfigValidationError):
-            logger.error(f"{self.name} config validation error: {error}")
-        else:
-            logger.error(f"{self.name} unhandled error: {error}")
-```
-
---
-
-## WebSocket Connection Manager (asyncio)
-
-**Game-agnostic.** No changes from single-game design.
-
-```python
-# core/websocket/manager.py
-class ConnectionManager:
-    def __init__(self):
-        self._connections: dict[str, set[WebSocket]] = defaultdict(set)
-        self._channel_subs: dict[WebSocket, set[str]] = defaultdict(set)
-        self._lock = asyncio.Lock()
-
-    async def connect(self, ws: WebSocket, server_id: str):
-        await ws.accept()
-        async with self._lock:
-            self._connections[server_id].add(ws)
-            self._channel_subs[ws].add('status')
-            if server_id == 'all':
-                self._connections['all'].add(ws)
-
-    async def disconnect(self, ws: WebSocket, server_id: str):
-        async with self._lock:
-            self._connections[server_id].discard(ws)
-            self._connections['all'].discard(ws)
-            self._channel_subs.pop(ws, None)
-
-    async def subscribe(self, ws: WebSocket, channels: list[str]):
-        async with self._lock:
-            self._channel_subs[ws].update(channels)
-
-    async def unsubscribe(self, ws: WebSocket, channels: list[str]):
-        async with self._lock:
-            self._channel_subs[ws].difference_update(channels)
-
-    async def broadcast(self, server_id: str, message: dict, channel: str = None):
-        targets: set[WebSocket] = set()
-        async with self._lock:
-            server_clients = self._connections.get(server_id, set())
-            all_clients = self._connections.get('all', set())
-            candidates = server_clients | all_clients
-
-            if channel:
-                targets = {ws for ws in candidates
-                           if channel in self._channel_subs.get(ws, set())}
-            else:
-                targets = candidates
-
-        dead = []
-        for ws in targets:
+        while not self._stop_event.is_set():
            try:
-                await ws.send_json(message)
+                self.run_loop()
            except Exception:
-                dead.append(ws)
-
-        if dead:
-            async with self._lock:
-                for ws in dead:
-                    for bucket in self._connections.values():
-                        bucket.discard(ws)
-                    self._channel_subs.pop(ws, None)
+                backoff = min(backoff * 2, 30)
+                self._stop_event.wait(backoff)
 ```

---
+## ThreadRegistry

-## Memory & Performance Considerations
+`ThreadRegistry` manages thread lifecycle per server:

-| Thread | Memory Impact | CPU Impact |
-|--------|--------------|-----------|
-| ProcessMonitorThread | Minimal (one `os.kill` check) | Negligible |
-| LogTailThread | Buffer for unread log lines | Low (file I/O + adapter parsing) |
-| MetricsCollectorThread | psutil subprocess scan | Low-Medium |
-| RemoteAdminPollerThread | Adapter client socket + buffer | Low (varies by adapter protocol) |
-| BroadcastThread | Queue buffer (max 10000 entries) | Low |
+- **`start_server_threads(server_id, db)`** — Creates and starts all 4 thread types for a server
+- **`stop_server_threads(server_id)`** — Sets stop events and joins all threads for a server
+- **`reattach_server_threads(server_id, db)`** — Recovers threads for a server that survived a process restart
+- **`stop_all()`** — Stops all threads for all servers (called on shutdown)

-### Recommendations
- Set all threads as `daemon=True` — they die automatically if main process exits
- `broadcast_queue.maxsize=10000` — backpressure; drop on Full (log warning)
- `LogTailThread` buffers max ~100 lines per tick before writing to DB in batch
- `MetricsCollectorThread` uses `psutil.Process.cpu_percent(interval=0.5)` — blocks 500ms, acceptable at 5s interval
- For N=10 servers: 31-41 background threads — well within Python's thread limits
- Games without remote admin skip the RemoteAdminPollerThread entirely
+Thread bundles are stored in a dict: `{server_id → ThreadBundle}`, where `ThreadBundle` is a dataclass holding all thread references.
+
+## BroadcastThread
+
+The `BroadcastThread` is the single global thread that bridges synchronous background threads to asynchronous WebSocket clients:
+
+1. Background threads push events into a `queue.Queue(maxsize=1000)`
+2. `BroadcastThread` runs a loop reading from the queue
+3. For each event, it calls `asyncio.run_coroutine_threadsafe()` to schedule a WebSocket broadcast on the main event loop
+4. If the queue is full, events are dropped (non-blocking put)
+
+Events are broadcast to WebSocket clients subscribed to the relevant `server_id` (or `None` for all servers).
+
+## ProcessManager
+
+`ProcessManager` is a singleton that manages server processes via `subprocess.Popen`:
+
+- **`start_process(server_id, cmd, cwd, env)`** — Starts a new subprocess, stores the PID
+- **`stop_process(server_id, timeout)`** — Sends terminate signal, waits for exit, force-kills after timeout
+- **`kill_process(server_id)`** — Force-kills the process immediately
+- **`recover_on_startup(db)`** — On startup, checks all stored PIDs against running processes via `psutil.pid_exists()`. If a process is still alive, marks the server as running. If not, marks it as stopped.
+- Thread-safe with per-server `threading.Lock`
+
+## LogTailThread
+
+Tails the Arma 3 .rpt log file for each server:
+
+- Resolves the latest log file path using the adapter's `LogParser.get_latest_log_file()`
+- Reads new lines from the end of the file, detecting log rotation (Windows/NTFS safe)
+- Parses each line using `RPTParser.parse_line()` to extract timestamp, level, and message
+- Persists parsed entries to the `logs` table via `LogRepository`
+- Broadcasts `log` events via the global queue
+
+## ProcessMonitorThread
+
+Monitors each server process for crashes:
+
+- Checks every 5 seconds whether the process is still alive
+- If the process has exited unexpectedly:
+  1. Updates server status to `crashed`
+  2. Logs the crash event
+  3. If `auto_restart` is enabled and restart count hasn't exceeded `max_restarts` within the `restart_window_seconds`:
+     - Triggers a restart via `ServerService.start_server()`
+     - Increments `restart_count`
+
+## MetricsCollectorThread
+
+Collects CPU and RAM metrics for each running server:
+
+- Uses `psutil.Process(pid)` to get CPU and memory usage
+- Collects every 10 seconds
+- Stores metrics in the `metrics` table via `MetricsRepository`
+- Broadcasts `metrics` events via the global queue
+
+## RemoteAdminPollerThread
+
+Polls the BattlEye RCon interface for player list updates:
+
+- Connects via `Arma3RemoteAdmin` using `BERConClient`
+- Polls player list every 10 seconds
+- Compares current players with previous state to detect joins/leaves
+- On player join: upserts to `players` table, inserts to `player_history`, broadcasts `players` event
+- On player leave: removes from `players`, updates `left_at` in `player_history`, broadcasts `players` event
+- On RCon connection failure: reconnects with exponential backoff
+
+## WebSocketManager
+
+Runs on the main asyncio event loop:
+
+- Clients connect to `/ws?token=JWT&server_id=N`
+- JWT is validated on connection; invalid tokens close with code 4001
+- Clients subscribe to specific `server_id`s or `None` (all servers)
+- `broadcast(server_id, message)` sends JSON-encoded messages to matching subscribers
+- `disconnect(websocket)` removes the client from the registry
+- Thread-safe via `asyncio.Lock`
+
+## Thread Safety Rules
+
+1. **Database access**: Each thread uses its own connection via `get_thread_db()`. No shared DB connections.
+2. **WebSocket broadcasting**: Threads write to `queue.Queue`, which is thread-safe. Only `BroadcastThread` reads from the queue.
+3. **Process management**: `ProcessManager` uses per-server locks for thread-safe start/stop operations.
+4. **SQLite WAL mode**: Enables concurrent reads from multiple threads while a single writer operates.
+5. **Asyncio locks**: `WebSocketManager` uses `asyncio.Lock` for connection registry modifications.
+
+## Scheduled Jobs
+
+APScheduler `BackgroundScheduler` runs 3 cleanup cron jobs:
+
+| Job | Schedule | Cleanup |
+|---|---|---|
+| Clean up old log entries | Daily at 03:00 | `DELETE FROM logs WHERE created_at < datetime('now', '-7 days')` |
+| Clean up old metrics | Every 6 hours | `DELETE FROM metrics WHERE timestamp < datetime('now', '-1 day')` |
+| Clean up old events | Weekly (Sunday 04:00) | `DELETE FROM server_events WHERE created_at < datetime('now', '-30 days')` |
+
+## Startup Sequence
+
+1. Init DB engine and run pending migrations
+2. Register built-in adapters (Arma 3) and scan for third-party plugins
+3. Create `WebSocketManager` (asyncio-only)
+4. Create global `BroadcastThread` (queue → asyncio bridge)
+5. Create `ThreadRegistry` with `ProcessManager` and adapter registry
+6. Recover processes that survived a restart (PID validation via psutil)
+7. Re-attach monitoring threads for running servers
+8. Seed default admin user if no users exist
+9. Register and start APScheduler cleanup jobs
+
+## Shutdown Sequence
+
+1. Stop all server threads via `ThreadRegistry.stop_all()`
+2. Stop `BroadcastThread` and join with 5s timeout
+3. Stop APScheduler