Add CI deployment to app LXC
This commit is contained in:
parent
3d33a78f1f
commit
9049d367ea
6 changed files with 531 additions and 2 deletions
208
AGENTS.md
208
AGENTS.md
|
|
@ -89,7 +89,7 @@ Notes:
|
|||
- Forgejo webhooks should POST to `/api/forgejo/webhook`; when `FORGEJO_WEBHOOK_SECRET` is set, the backend validates Forgejo/Gitea-style HMAC headers.
|
||||
- API clients can query with `Authorization: token ...` or `Authorization: Bearer ...`.
|
||||
- `CALENDAR_FEED_URLS` is optional and accepts comma-separated `webcal://` or `https://` ICS feeds.
|
||||
- Do not commit `.env` or `.env.local`.
|
||||
- Do not commit `.env`, `.env.local`, or `.env.proxmox`.
|
||||
|
||||
## Main Start Command
|
||||
|
||||
|
|
@ -120,6 +120,13 @@ export FORGEJO_API_TOKEN=...
|
|||
./scripts/bootstrap_ci_clone_key.py
|
||||
```
|
||||
|
||||
Bootstrap or rotate the Forgejo Actions LXC deploy credentials:
|
||||
|
||||
```bash
|
||||
export FORGEJO_API_TOKEN=...
|
||||
./scripts/bootstrap_lxc_deploy_key.py
|
||||
```
|
||||
|
||||
Validate production environment before starting:
|
||||
|
||||
```bash
|
||||
|
|
@ -139,6 +146,205 @@ Non-container production start after building `frontend/dist`:
|
|||
HOST=0.0.0.0 PORT=8000 ./scripts/run_prod.sh
|
||||
```
|
||||
|
||||
## Current Proxmox Deployment
|
||||
|
||||
Current app host:
|
||||
|
||||
- Proxmox node: `proxmox`
|
||||
- LXC VMID: `108`
|
||||
- LXC hostname: `robotu-app`
|
||||
- LXC IP: `192.168.1.220/24`
|
||||
- LXC gateway: `192.168.1.2`
|
||||
- LXC DNS: `192.168.1.2`
|
||||
- SSH target: `root@192.168.1.220`
|
||||
- App directory on LXC: `/opt/robot-u-site`
|
||||
- Public runtime URL: `https://discourse.onl`
|
||||
- Internal app URL: `http://192.168.1.220:8800`
|
||||
- Compose service: `robot-u-site`
|
||||
- Container port mapping: host `8800` to container `8000`
|
||||
- Reverse proxy: LXC `102` routes `discourse.onl` to `192.168.1.220:8800`
|
||||
|
||||
The local `.env.proxmox` file contains Proxmox credentials and LXC settings. It is ignored by git and must not be printed, committed, or copied into the app container.
|
||||
|
||||
The deployed app uses `/opt/robot-u-site/.env` on the LXC. That file contains Forgejo OAuth settings, `AUTH_SECRET_KEY`, optional `FORGEJO_TOKEN` for the server-side public content cache, calendar feeds, and the deployed `APP_BASE_URL`. Treat it as secret material and do not print values.
|
||||
|
||||
The current deployed OAuth redirect URI is:
|
||||
|
||||
```text
|
||||
https://discourse.onl/api/auth/forgejo/callback
|
||||
```
|
||||
|
||||
Forgejo OAuth sign-in from the public URL requires that exact callback URL to be allowed in the Forgejo OAuth app.
|
||||
|
||||
Important deployment notes:
|
||||
|
||||
- The LXC was initially created with gateway/DNS `192.168.1.1`, but this network uses `192.168.1.2`. If package installs hang or outbound network fails, check `ip route` and `/etc/resolv.conf` first.
|
||||
- Proxmox persistent LXC config was updated so `net0` uses `gw=192.168.1.2`, and nameserver is `192.168.1.2`.
|
||||
- Docker inside the unprivileged LXC requires Proxmox features `nesting=1,keyctl=1`; those are set on the current container.
|
||||
- Ubuntu package installs were made reliable by adding `/etc/apt/apt.conf.d/99force-ipv4` with `Acquire::ForceIPv4 "true";`.
|
||||
- The current LXC has `512MiB` memory and `512MiB` swap. It runs the app, but large builds or future services may need more memory.
|
||||
- `FORGEJO_TOKEN` is needed server-side if anonymous Forgejo API discovery returns no content. Without that token, `/api/prototype` can return zero courses/posts/discussions even though the app is healthy.
|
||||
|
||||
Useful checks:
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.220 'cd /opt/robot-u-site && docker compose ps'
|
||||
curl -fsS http://192.168.1.220:8800/health
|
||||
curl -fsS https://discourse.onl/health
|
||||
curl -fsS https://discourse.onl/api/prototype
|
||||
```
|
||||
|
||||
Manual redeploy to the current LXC:
|
||||
|
||||
```bash
|
||||
ssh root@192.168.1.220 'mkdir -p /opt/robot-u-site'
|
||||
rsync -az --delete \
|
||||
--exclude='.git/' \
|
||||
--exclude='.venv/' \
|
||||
--exclude='__pycache__/' \
|
||||
--exclude='.pytest_cache/' \
|
||||
--exclude='.ruff_cache/' \
|
||||
--exclude='.env' \
|
||||
--exclude='.env.*' \
|
||||
--exclude='frontend/node_modules/' \
|
||||
--exclude='frontend/dist/' \
|
||||
--exclude='frontend/.vite/' \
|
||||
--exclude='examples/quadrature-encoder-course/' \
|
||||
./ root@192.168.1.220:/opt/robot-u-site/
|
||||
ssh root@192.168.1.220 'cd /opt/robot-u-site && ./scripts/check_deploy_config.py && docker compose up --build -d'
|
||||
curl -fsS http://192.168.1.220:8800/health
|
||||
```
|
||||
|
||||
Do not overwrite `/opt/robot-u-site/.env` during rsync. Update it deliberately when runtime config changes.
|
||||
|
||||
Current production env notes:
|
||||
|
||||
- `/opt/robot-u-site/.env` should use `APP_BASE_URL=https://discourse.onl`.
|
||||
- `AUTH_COOKIE_SECURE=true` is required for the public HTTPS site.
|
||||
- `CORS_ALLOW_ORIGINS=https://discourse.onl` is the current public origin.
|
||||
- A pre-domain backup exists on the app LXC at `/opt/robot-u-site/.env.backup.20260415T101957Z`.
|
||||
|
||||
CI state:
|
||||
|
||||
- `.forgejo/workflows/ci.yml` runs on `docker`.
|
||||
- The `check` job manually installs `CI_REPO_SSH_KEY`, clones `git@aksal.cloud:Robot-U/robot-u-site.git`, installs `uv` and Bun, then runs Python and frontend checks.
|
||||
- The `deploy` job runs after `check` on `push` events, installs `DEPLOY_SSH_KEY`, clones the repo, rsyncs it to `root@192.168.1.220:/opt/robot-u-site/`, rebuilds Docker Compose, and checks `/health`.
|
||||
- The repo has a read-only deploy key and matching Forgejo Actions secret for CI clone.
|
||||
- The app LXC has a CI deploy public key in `root`'s `authorized_keys`, and the matching private key is stored in the Forgejo Actions secret `DEPLOY_SSH_KEY`.
|
||||
- `scripts/bootstrap_lxc_deploy_key.py` recreates or rotates the LXC deploy key. It uses `FORGEJO_API_TOKEN`, appends the generated public key to the LXC user's `authorized_keys`, verifies SSH, and stores the generated private key in `DEPLOY_SSH_KEY`.
|
||||
- The deploy rsync excludes `.env` and `.env.*`, so production runtime secrets and backups on `/opt/robot-u-site` are preserved.
|
||||
|
||||
## Reverse Proxy LXC 102
|
||||
|
||||
The reverse proxy host is Proxmox LXC `102`:
|
||||
|
||||
- LXC hostname: `reverse-proxy`
|
||||
- LXC IP: `192.168.1.203/24`
|
||||
- Gateway: `192.168.1.2`
|
||||
- Main jobs: nginx reverse proxy, LiteLLM proxy, and custom Porkbun DDNS script
|
||||
- nginx service: `nginx.service`
|
||||
- LiteLLM service: `litellm.service`
|
||||
- Porkbun service: `porkbun-ddns.service`
|
||||
- Robot U public site: `discourse.onl`
|
||||
- Robot U nginx config: `/etc/nginx/sites-available/discourse.onl`
|
||||
- Robot U certificate: `/etc/letsencrypt/live/discourse.onl/`
|
||||
- Robot U upstream: `http://192.168.1.220:8800`
|
||||
|
||||
Do not bundle unrelated maintenance. If asked to update LiteLLM, do not change nginx or Porkbun DNS config unless explicitly requested. As of the last LiteLLM update, `porkbun-ddns.service` was failed and was intentionally left untouched.
|
||||
|
||||
The `discourse.onl` nginx site was created on April 15, 2026 following the existing `aksal.cloud` pattern:
|
||||
|
||||
```bash
|
||||
nginx -t && systemctl reload nginx
|
||||
certbot --nginx -d discourse.onl --redirect --non-interactive
|
||||
```
|
||||
|
||||
Certbot issued a Let's Encrypt certificate expiring on July 14, 2026. Validate the route with:
|
||||
|
||||
```bash
|
||||
curl -fsS https://discourse.onl/health
|
||||
curl -fsS -o /tmp/discourse-home.html -w '%{http_code} %{content_type}\n' https://discourse.onl/
|
||||
```
|
||||
|
||||
`curl -I https://discourse.onl/` returns `405` because the FastAPI app does not handle `HEAD`; use GET-based checks instead.
|
||||
|
||||
The `discourse.onl` Porkbun DDNS copy is intentionally separate from the existing `aksal.*` setup:
|
||||
|
||||
- Script directory: `/opt/porkbun-ddns-discourse-onl`
|
||||
- Service user/group: `porkbun-discourse:porkbun-discourse`
|
||||
- Service: `porkbun-ddns-discourse-onl.service`
|
||||
- Timer: `porkbun-ddns-discourse-onl.timer`
|
||||
- Managed records: `A discourse.onl` and `A *.discourse.onl`
|
||||
- Current managed IP as of setup: `64.30.74.112`
|
||||
|
||||
The `discourse.onl` copy of `updateDNS.sh` was patched locally to make Porkbun curl calls use `--fail` and stronger retries, preventing transient 503 HTML bodies from being concatenated with JSON. A PR with the same fix was opened against the upstream Porkbun DDNS repo: `https://aksal.cloud/Amargius_Commons/porkbun_ddns_script/pulls/1`.
|
||||
|
||||
Direct SSH to `root@192.168.1.203`, `litellm@192.168.1.203`, or `root@192.168.1.200` may not work from this workspace. If SSH fails, use the Proxmox API credentials in the ignored `.env.proxmox` file to open a Proxmox node terminal and run `pct exec 102 -- ...`.
|
||||
|
||||
Proxmox API terminal access pattern:
|
||||
|
||||
1. Read `.env.proxmox`; never print credentials.
|
||||
2. `POST /api2/json/access/ticket` with the Proxmox username/password.
|
||||
3. `POST /api2/json/nodes/proxmox/termproxy` using the returned ticket and CSRF token.
|
||||
4. Connect to `wss://<proxmox-host>:8006/api2/json/nodes/proxmox/vncwebsocket?port=<port>&vncticket=<ticket>`.
|
||||
5. Send binary login payload `root@pam:<term-ticket>\n`; expect `OK`.
|
||||
6. Send shell commands through the xterm websocket protocol: command payloads are framed as `0:<byte-length>:<command>`, followed by `0:1:\n`.
|
||||
7. Prefer adding a unique sentinel to each command so the runner can detect completion instead of treating websocket read timeouts as command failure.
|
||||
|
||||
Useful discovery commands from the Proxmox node shell:
|
||||
|
||||
```bash
|
||||
pct status 102
|
||||
pct config 102
|
||||
pct exec 102 -- bash -lc 'hostname; systemctl list-units --type=service --all --no-pager | grep -Ei "lite|llm|nginx|porkbun|dns"'
|
||||
pct exec 102 -- bash -lc 'systemctl status litellm --no-pager; systemctl cat litellm --no-pager'
|
||||
```
|
||||
|
||||
LiteLLM current layout:
|
||||
|
||||
- Service unit: `/etc/systemd/system/litellm.service`
|
||||
- Service user/group: `litellm:litellm`
|
||||
- Working directory: `/opt/litellm/`
|
||||
- Virtualenv: `/opt/litellm/venv`
|
||||
- Config file: `/opt/litellm/config.yaml`
|
||||
- Service command: `/opt/litellm/venv/bin/litellm --config /opt/litellm/config.yaml --port 4000`
|
||||
- Local liveliness check: `http://127.0.0.1:4000/health/liveliness`
|
||||
- Local readiness check: `http://127.0.0.1:4000/health/readiness`
|
||||
|
||||
LiteLLM update checklist:
|
||||
|
||||
1. Inspect current state and versions.
|
||||
|
||||
```bash
|
||||
pct exec 102 -- bash -lc '/opt/litellm/venv/bin/python -m pip show litellm; curl -fsS -m 5 http://127.0.0.1:4000/health/liveliness'
|
||||
```
|
||||
|
||||
2. Back up config and installed package set.
|
||||
|
||||
```bash
|
||||
pct exec 102 -- bash -lc 'set -euo pipefail; stamp=$(date -u +%Y%m%dT%H%M%SZ); mkdir -p /opt/litellm/backups; cp -a /opt/litellm/config.yaml /opt/litellm/backups/config.yaml.$stamp; /opt/litellm/venv/bin/python -m pip freeze > /opt/litellm/backups/pip-freeze.$stamp.txt; chown -R litellm:litellm /opt/litellm/backups'
|
||||
```
|
||||
|
||||
3. Stop LiteLLM before upgrading. Container `102` has only `512MiB` RAM and tends to use swap; stopping the proxy keeps pip from competing with the running process.
|
||||
|
||||
```bash
|
||||
pct exec 102 -- bash -lc 'systemctl stop litellm; systemctl is-active litellm || true'
|
||||
```
|
||||
|
||||
4. Upgrade pip and LiteLLM as the `litellm` user.
|
||||
|
||||
```bash
|
||||
pct exec 102 -- bash -lc 'set -euo pipefail; runuser -u litellm -- /opt/litellm/venv/bin/python -m pip install --upgrade pip; runuser -u litellm -- /opt/litellm/venv/bin/python -m pip install --upgrade "litellm[proxy]"'
|
||||
```
|
||||
|
||||
5. Restart and verify.
|
||||
|
||||
```bash
|
||||
pct exec 102 -- bash -lc 'set -euo pipefail; systemctl start litellm; sleep 8; systemctl is-active litellm; /opt/litellm/venv/bin/python -m pip show litellm | sed -n "1,8p"; curl -fsS -m 10 http://127.0.0.1:4000/health/liveliness; echo; curl -fsS -m 10 http://127.0.0.1:4000/health/readiness; echo; /opt/litellm/venv/bin/python -m pip check; systemctl show litellm -p ActiveState -p SubState -p NRestarts -p MainPID -p ExecMainStatus --no-pager'
|
||||
```
|
||||
|
||||
After the April 15, 2026 update, LiteLLM was upgraded from `1.81.15` to `1.83.7`, `/health/liveliness` returned `"I'm alive!"`, `/health/readiness` reported `db=connected`, and `pip check` reported no broken requirements. Startup logs may briefly print `Unable to connect to DB. DATABASE_URL found in environment, but prisma package not found.`; treat readiness and the Prisma process/import check as the source of truth before deciding it is an actual failure.
|
||||
|
||||
## Development Commands
|
||||
|
||||
### Backend only
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue