arbitrage-engine/docs/ai/04-data-model.md
fanziqi 22787b3e0a docs: add AI documentation suite and comprehensive code review report
- Generate full AI-consumable docs (docs/ai/): system overview, architecture,
  module cheatsheet, API contracts, data model, build guide, decision log,
  glossary, and open questions (deep tier coverage)
- Add PROBLEM_REPORT.md: categorized bug/risk summary
- Add DETAILED_CODE_REVIEW.md: full line-by-line review of all 15 backend
  files, documenting 4 fatal issues, 5 critical deployment bugs, 4 security
  vulnerabilities, and 6 architecture defects with prioritized fix plan
2026-03-03 19:01:18 +08:00

302 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
generated_by: repo-insight
version: 1
created: 2026-03-03
last_updated: 2026-03-03
source_commit: 0d9dffa
coverage: standard
---
# 04 — Data Model
## Purpose
Documents all PostgreSQL tables, columns, relations, constraints, storage design, and partitioning strategy.
## TL;DR
- Single PostgreSQL database `arb_engine`; 15+ tables defined in `db.py` `SCHEMA_SQL` + `auth.py` `AUTH_SCHEMA`.
- `agg_trades` is a range-partitioned table (by `time_ms` in milliseconds); monthly partitions auto-created by `ensure_partitions()`.
- Dual-write: local PG is primary; Cloud SQL at `10.106.0.3` receives same writes via a secondary psycopg2 pool (non-fatal if down).
- All timestamps: `ts` columns are Unix seconds (integer); `time_ms` columns are Unix milliseconds (bigint); `created_at` columns are PG `TIMESTAMP`.
- JSONB used for `score_factors` in `paper_trades`/`live_trades`, `detail` in `live_events`, `value` in `market_indicators`.
- Auth tokens stored in DB: refresh tokens in `refresh_tokens` table (revocable); no session table.
## Canonical Facts
### Tables
#### `rate_snapshots` — Funding Rate Snapshots
Populated every 2 s by `background_snapshot_loop()` in `main.py`.
| Column | Type | Description |
|--------|------|-------------|
| `id` | BIGSERIAL PK | |
| `ts` | BIGINT NOT NULL | Unix seconds |
| `btc_rate` | DOUBLE PRECISION | BTC funding rate (decimal) |
| `eth_rate` | DOUBLE PRECISION | ETH funding rate |
| `btc_price` | DOUBLE PRECISION | BTC mark price USD |
| `eth_price` | DOUBLE PRECISION | ETH mark price USD |
| `btc_index_price` | DOUBLE PRECISION | BTC index price |
| `eth_index_price` | DOUBLE PRECISION | ETH index price |
Index: `idx_rate_snapshots_ts` on `ts`.
---
#### `agg_trades` — Aggregate Trades (Partitioned)
Partitioned by `RANGE(time_ms)`; monthly child tables named `agg_trades_YYYYMM`.
| Column | Type | Description |
|--------|------|-------------|
| `agg_id` | BIGINT NOT NULL | Binance aggTrade ID |
| `symbol` | TEXT NOT NULL | e.g. `BTCUSDT` |
| `price` | DOUBLE PRECISION | Trade price |
| `qty` | DOUBLE PRECISION | Trade quantity (BTC/ETH/etc.) |
| `time_ms` | BIGINT NOT NULL | Trade timestamp ms |
| `is_buyer_maker` | SMALLINT | 0=taker buy, 1=taker sell |
PK: `(time_ms, symbol, agg_id)`.
Indexes: `idx_agg_trades_sym_time` on `(symbol, time_ms DESC)`, `idx_agg_trades_sym_agg` on `(symbol, agg_id)`.
Partitions auto-created for current + next 2 months. Named `agg_trades_YYYYMM`.
---
#### `agg_trades_meta` — Collection State
| Column | Type | Description |
|--------|------|-------------|
| `symbol` | TEXT PK | e.g. `BTCUSDT` |
| `last_agg_id` | BIGINT | Last processed aggTrade ID |
| `last_time_ms` | BIGINT | Timestamp of last trade |
| `earliest_agg_id` | BIGINT | Oldest buffered ID |
| `earliest_time_ms` | BIGINT | Oldest buffered timestamp |
| `updated_at` | TEXT | Human-readable update time |
---
#### `signal_indicators` — Signal Engine Output (15 s cadence)
| Column | Type | Description |
|--------|------|-------------|
| `id` | BIGSERIAL PK | |
| `ts` | BIGINT | Unix seconds |
| `symbol` | TEXT | |
| `cvd_fast` | DOUBLE PRECISION | CVD 30 m window |
| `cvd_mid` | DOUBLE PRECISION | CVD 4 h window |
| `cvd_day` | DOUBLE PRECISION | CVD UTC day |
| `cvd_fast_slope` | DOUBLE PRECISION | CVD momentum |
| `atr_5m` | DOUBLE PRECISION | ATR (5 m candles, 14 periods) |
| `atr_percentile` | DOUBLE PRECISION | ATR rank in 24 h history |
| `vwap_30m` | DOUBLE PRECISION | VWAP 30 m |
| `price` | DOUBLE PRECISION | Current mark price |
| `p95_qty` | DOUBLE PRECISION | P95 large-order threshold |
| `p99_qty` | DOUBLE PRECISION | P99 large-order threshold |
| `buy_vol_1m` | DOUBLE PRECISION | 1 m buy volume |
| `sell_vol_1m` | DOUBLE PRECISION | 1 m sell volume |
| `score` | INTEGER | Signal score 0100 |
| `signal` | TEXT | `LONG`, `SHORT`, or null |
Indexes: `idx_si_ts`, `idx_si_sym_ts`.
---
#### `signal_indicators_1m` — 1-Minute Signal Snapshot
Subset of `signal_indicators` columns; written at 1 m cadence for lightweight chart queries.
---
#### `signal_trades` — Signal Engine Trade Tracking
| Column | Type | Description |
|--------|------|-------------|
| `id` | BIGSERIAL PK | |
| `ts_open` | BIGINT | Open timestamp (Unix s) |
| `ts_close` | BIGINT | Close timestamp |
| `symbol` | TEXT | |
| `direction` | TEXT | `LONG` / `SHORT` |
| `entry_price` | DOUBLE PRECISION | |
| `exit_price` | DOUBLE PRECISION | |
| `qty` | DOUBLE PRECISION | |
| `score` | INTEGER | Signal score at entry |
| `pnl` | DOUBLE PRECISION | Realized PnL |
| `sl_price` | DOUBLE PRECISION | Stop-loss level |
| `tp1_price` | DOUBLE PRECISION | Take-profit 1 level |
| `tp2_price` | DOUBLE PRECISION | Take-profit 2 level |
| `status` | TEXT DEFAULT `open` | `open`, `closed`, `stopped` |
---
#### `paper_trades` — Paper Trading Records
| Column | Type | Description |
|--------|------|-------------|
| `id` | BIGSERIAL PK | |
| `symbol` | TEXT | |
| `direction` | TEXT | `LONG`/`SHORT` |
| `score` | INT | Signal score |
| `tier` | TEXT | `light`/`standard`/`heavy` |
| `entry_price` | DOUBLE PRECISION | |
| `entry_ts` | BIGINT | Unix ms |
| `exit_price` | DOUBLE PRECISION | |
| `exit_ts` | BIGINT | |
| `tp1_price` | DOUBLE PRECISION | |
| `tp2_price` | DOUBLE PRECISION | |
| `sl_price` | DOUBLE PRECISION | |
| `tp1_hit` | BOOLEAN DEFAULT FALSE | |
| `status` | TEXT DEFAULT `active` | `active`, `tp1`, `tp2`, `sl`, `timeout` |
| `pnl_r` | DOUBLE PRECISION | PnL in R units |
| `atr_at_entry` | DOUBLE PRECISION | ATR snapshot at entry |
| `score_factors` | JSONB | Breakdown of signal score components |
| `strategy` | VARCHAR(32) DEFAULT `v51_baseline` | Strategy name |
| `created_at` | TIMESTAMP | |
---
#### `live_trades` — Live Trading Records
| Column | Type | Description |
|--------|------|-------------|
| `id` | BIGSERIAL PK | |
| `symbol` | TEXT | |
| `strategy` | TEXT | |
| `direction` | TEXT | `LONG`/`SHORT` |
| `status` | TEXT DEFAULT `active` | |
| `entry_price` / `exit_price` | DOUBLE PRECISION | |
| `entry_ts` / `exit_ts` | BIGINT | Unix ms |
| `sl_price`, `tp1_price`, `tp2_price` | DOUBLE PRECISION | |
| `tp1_hit` | BOOLEAN | |
| `score` | DOUBLE PRECISION | |
| `tier` | TEXT | |
| `pnl_r` | DOUBLE PRECISION | |
| `fee_usdt` | DOUBLE PRECISION | Exchange fees |
| `funding_fee_usdt` | DOUBLE PRECISION | Funding fees paid while holding |
| `risk_distance` | DOUBLE PRECISION | Entry to SL distance |
| `atr_at_entry` | DOUBLE PRECISION | |
| `score_factors` | JSONB | |
| `signal_id` | BIGINT | FK → signal_indicators.id |
| `binance_order_id` | TEXT | Binance order ID |
| `fill_price` | DOUBLE PRECISION | Actual fill price |
| `slippage_bps` | DOUBLE PRECISION | Slippage in basis points |
| `protection_gap_ms` | BIGINT | Time between SL order and fill |
| `signal_to_order_ms` | BIGINT | Latency: signal → order placed |
| `order_to_fill_ms` | BIGINT | Latency: order → fill |
| `qty` | DOUBLE PRECISION | |
| `created_at` | TIMESTAMP | |
---
#### `live_config` — Runtime Configuration KV Store
| Column | Type | Description |
|--------|------|-------------|
| `key` | TEXT PK | Config key |
| `value` | TEXT | Config value (string) |
| `label` | TEXT | Human label |
| `updated_at` | TIMESTAMP | |
Known keys: `risk_per_trade_usd`, `max_positions`, `circuit_break` (inferred).
---
#### `live_events` — Trade Event Log
| Column | Type | Description |
|--------|------|-------------|
| `id` | BIGSERIAL PK | |
| `ts` | BIGINT | Unix ms (default: NOW()) |
| `level` | TEXT | `info`/`warning`/`error` |
| `category` | TEXT | Event category |
| `symbol` | TEXT | |
| `message` | TEXT | |
| `detail` | JSONB | Structured event data |
---
#### `signal_logs` — Legacy Signal Log
Kept for backwards compatibility with the original funding-rate signal system.
| Column | Type |
|--------|------|
| `id` | BIGSERIAL PK |
| `symbol` | TEXT |
| `rate` | DOUBLE PRECISION |
| `annualized` | DOUBLE PRECISION |
| `sent_at` | TEXT |
| `message` | TEXT |
---
#### Auth Tables (defined in `auth.py` AUTH_SCHEMA)
**`users`**
| Column | Type |
|--------|------|
| `id` | BIGSERIAL PK |
| `email` | TEXT UNIQUE NOT NULL |
| `password_hash` | TEXT NOT NULL |
| `discord_id` | TEXT |
| `role` | TEXT DEFAULT `user` |
| `banned` | INTEGER DEFAULT 0 |
| `created_at` | TEXT |
**`subscriptions`**
| Column | Type |
|--------|------|
| `user_id` | BIGINT PK → users |
| `tier` | TEXT DEFAULT `free` |
| `expires_at` | TEXT |
**`invite_codes`**
| Column | Type |
|--------|------|
| `id` | BIGSERIAL PK |
| `code` | TEXT UNIQUE |
| `created_by` | INTEGER |
| `max_uses` | INTEGER DEFAULT 1 |
| `used_count` | INTEGER DEFAULT 0 |
| `status` | TEXT DEFAULT `active` |
| `expires_at` | TEXT |
**`invite_usage`**
| Column | Type |
|--------|------|
| `id` | BIGSERIAL PK |
| `code_id` | BIGINT → invite_codes |
| `user_id` | BIGINT → users |
| `used_at` | TEXT |
**`refresh_tokens`**
| Column | Type |
|--------|------|
| `id` | BIGSERIAL PK |
| `user_id` | BIGINT → users |
| `token` | TEXT UNIQUE |
| `expires_at` | TEXT |
| `revoked` | INTEGER DEFAULT 0 |
---
#### `market_indicators` — Market Indicator JSONB Store
Populated by `market_data_collector.py`.
| Column | Type | Description |
|--------|------|-------------|
| `symbol` | TEXT | |
| `indicator_type` | TEXT | `long_short_ratio`, `top_trader_position`, `open_interest_hist`, `coinbase_premium`, `funding_rate` |
| `timestamp_ms` | BIGINT | |
| `value` | JSONB | Raw indicator payload |
Query pattern: `WHERE symbol=? AND indicator_type=? ORDER BY timestamp_ms DESC LIMIT 1`.
### Storage Design Decisions
- **Partitioning**: `agg_trades` partitioned by month to avoid table bloat; partition maintenance is automated.
- **Dual-write**: Cloud SQL secondary is best-effort (errors logged, never fatal).
- **JSONB `score_factors`**: allows schema-free storage of per-strategy signal breakdowns without migrations.
- **Timestamps**: mix of Unix seconds (`ts`), Unix ms (`time_ms`, `timestamp_ms`, `entry_ts`), ISO strings (`created_at` TEXT in auth tables), and PG `TIMESTAMP`; be careful when querying across tables.
## Interfaces / Dependencies
- `db.py:init_schema()` — creates all tables in `SCHEMA_SQL`
- `auth.py:ensure_tables()` — creates auth tables from `AUTH_SCHEMA`
- `db.py:ensure_partitions()` — auto-creates monthly `agg_trades_YYYYMM` partitions
## Unknowns & Risks
- [unknown] `market_indicators` table schema not in `SCHEMA_SQL`; likely created by `market_data_collector.py` separately — verify before querying.
- [risk] Timestamp inconsistency: some tables use TEXT for timestamps (auth tables), others use BIGINT, others use PG TIMESTAMP — cross-table JOINs on time fields require explicit casting.
- [inference] `live_config` circuit-break key name not confirmed from source; inferred from `risk_guard.py` behavior.
- [risk] `users` table defined in both `SCHEMA_SQL` (db.py) and `AUTH_SCHEMA` (auth.py); duplicate CREATE TABLE IF NOT EXISTS; actual schema diverges between the two definitions (db.py version lacks `discord_id`, `banned`).
## Source Refs
- `backend/db.py:166-356``SCHEMA_SQL` with all table definitions
- `backend/auth.py:28-71``AUTH_SCHEMA` auth tables
- `backend/db.py:360-414``ensure_partitions()`, `init_schema()`
- `backend/signal_engine.py:123-158``market_indicators` query pattern