7 Deployment View

Docker Compose Stack (Development + Production)

All builds run via Docker containers - no local Go, Elixir, or buf installation required.

Port map:

ServiceExternal PortPurpose
Go Gateway8008Matrix Client-Server API + Admin UI + Admin API
Go Gateway8080 (dev) / 8443 (TLS)Health, Readiness, Prometheus metrics
Go Media Gateway8009Matrix Media API (upload + download)
Elixir Core4000Health endpoint (internal Docker network only)
Elixir Core9000gRPC CoreService (internal Docker network only)
PostgreSQL5432Database (internal Docker network only)
Dex (dev)5556OIDC provider (development only)
MinIO S3 API9000Object storage API (S3-compatible) - local dev only
MinIO Console9001Object storage management UI - local dev only

Network Boundaries

External boundary (exposed to clients):
  Internet → [TLS 1.3] → Go Gateway     (Port 8008)
  Internet → [TLS 1.3] → Go Media GW    (Port 8008 or separate)

Internal boundary (Docker network, not exposed):
  Go Gateway  → [gRPC]          → Elixir Core  (Port 9000)
  Go Media GW → [gRPC]          → Elixir Core  (Port 9000)
  Go Gateway  → [HTTP PSK]      → /internal/nodes/* (Node Registry)
  All services → [TLS]          → PostgreSQL   (Port 5432)

Makefile Targets

Database Migration Strategy (Story 13-14)

Squashed Seed Migration

The gateway ships a single seed migration (gateway/migrations/000001_init.up.sql) that represents the complete current schema in its final state. The previous 49 incremental migration files (000002–000049) have been deleted.

Rationale: Incremental migrations accumulated role-specific ownership transfers (ALTER TABLE … OWNER TO nebu_migrate) and backward-compatibility grants (GRANT EXECUTE … TO nebu) that required the nebu_migrate and nebu PostgreSQL roles to exist before migrations ran. On managed database services (Stackit PostgresFlex, AWS Aurora Serverless v2) where the gateway cannot run CREATE ROLE DDL, this caused fresh deployments to fail at migration 24.

Single-role model: The squashed migration is designed to run as nebu_app (the runtime DB user). All objects are created by nebu_app; no ownership transfers or legacy grants are needed.

Key characteristics of 000001_init.up.sql:

  • Contains all 26+ tables, indexes, triggers, functions, and RLS policies in final state
  • No references to nebu_migrate in ownership transfers or function grants
  • No GRANT EXECUTE … TO nebu (legacy backward-compat grants removed)
  • Uses CREATE INDEX IF NOT EXISTS instead of CREATE INDEX CONCURRENTLY - golang-migrate wraps each file in a transaction, and CONCURRENTLY cannot run inside a transaction block
  • NEBU_DB_URL_MIGRATE is still accepted for backward compatibility but is no longer required; NEBU_DB_URL (pointing to nebu_app) is used for both migrations and runtime on fresh deployments

000001_init.down.sql drops all tables and functions in reverse dependency order. Extensions (pgcrypto, uuid-ossp) are intentionally NOT dropped - they may be shared at the PostgreSQL server level.

Local dev reset:

After reset the gateway logs migrations complete and GET /ready returns {"migrations":{"status":"UP","version":1}}.

Stackit PostgresFlex Role Provisioning

The Stackit OpenTofu module (deploy/tofu/examples/stackit/database.tf) provisions three PostgresFlex users:

UserPurpose
nebu_appRuntime role: owns all objects in the squashed migration; used for both migrations and application queries
nebu_migrateKept for backward compatibility with existing deployments that pass NEBU_DB_URL_MIGRATE. No longer required for fresh deployments.
nebu (legacy)Placeholder required by the original migration history's GRANT EXECUTE … TO nebu statements. Not needed for fresh deployments using the squashed migration.

On fresh Stackit deployments, only nebu_app is used. The nebu_migrate and nebu users are provisioned to avoid breaking existing environments that may reference them, but they do not connect at runtime.

Multi-Stage Dockerfiles

Go Gateway pattern:

Elixir Core pattern:

Health Checks

Gateway readiness (GET :8080/ready) reflects the GRÜN/GELB/ROT core status derived from the gRPC stream health. Docker Compose uses liveness (GET :4000/health).

Secret Management

Secrets are never passed as environment variables directly. They are mounted via Docker Compose secrets and referenced via NEBU_*_FILE environment variables pointing to the mounted file:

MinIO IAM Hardening (Story 12.3)

The media gateway uses a dedicated nebu-app MinIO user with a least-privilege IAM policy:

Intentionally absent: s3:DeleteObject (soft-delete only), s3:ListBucket (prevents enumeration), s3:* (no admin ops from app). The createbuckets init container creates and attaches this policy at startup. The media gateway runs as this user, never as the MinIO root. Root credentials are never passed to the media gateway.

Policy source: dev/minio/nebu-app-policy.json.

Source: _bmad-output/planning-artifacts/architecture.md, §Infrastructure & Deployment, §Build-Container-Strategie, §Resilienz & Selbst-Heilung


Production Deployment (OpenTofu IaC)

Story 13-1 introduces a production-grade Infrastructure-as-Code layer under deploy/. Three target platforms are supported - see ADR-014 for the full decision rationale.

deploy/ Directory Structure

deploy/
  tofu/
    modules/
      nebu-core/      # Shared variables, validations, outputs (no provider resources)
      nebu-aws/       # AWS: ECS Fargate + RDS + S3 + ACM (Story 13-2)
      nebu-stackit/   # STACKIT: VMs + Docker Compose + ALB + DBaaS (Story 13-3)
      nebu-k8s/       # Kubernetes: Helm Release wrapper (Story 13-4)
    examples/
      aws/            # AWS quick-start root module
      stackit/        # STACKIT quick-start root module
      k8s/            # Kubernetes quick-start root module
  helm/
    nebu/             # Standalone Helm Chart (usable without OpenTofu)

Platform Targets

PlatformMechanismDatabaseBackend
AWSECS Fargate + Aurora Serverless v2AWS Aurora PostgreSQL 16 (Serverless v2)S3 + DynamoDB
STACKITVMs + Docker Compose + ALBSTACKIT PostgresFlex (managed PostgreSQL 16)STACKIT Object Storage (S3-compatible)
KubernetesHelm Chart (deploy/helm/nebu/)External (operator-provided)S3-compatible or PostgreSQL

Local IaC Validation

Equivalent CI gate: validate-iac job in .gitlab-ci.yml (runs on every push touching deploy/**).

Shared Module: nebu-core

deploy/tofu/modules/nebu-core/ defines shared input variables consumed by all platform modules: nebu_version, domain_name, admin_email, postgres_db_name, image_registry. All variables carry validation constraints (non-empty checks, semver regex for nebu_version).

AWS Networking Module (nebu-aws)

deploy/tofu/modules/nebu-aws/network.tf provisions the AWS network foundation: one VPC, two public and two private subnets across two AZs, a single NAT Gateway (cost-optimized; one per AZ for HA production), and scoped security groups for ALB (80/443 from internet), ECS (ports 8008 + 9000 from ALB SG), and RDS (5432 from ECS SG only, egress limited to VPC CIDR). Resource names incorporate the environment variable (e.g. nebu-prod-alb-sg) for multi-environment deployments.

AWS Database Module (nebu-aws - database.tf)

deploy/tofu/modules/nebu-aws/database.tf provisions the Aurora Serverless v2 layer (Story 13-8 - replaces the previous RDS Multi-AZ aws_db_instance):

  • aws_db_subnet_group - uses private subnets from network.tf (no public access); retained as Aurora clusters also require DB subnet groups
  • aws_rds_cluster - Aurora PostgreSQL 16.6, engine_mode = "provisioned" (Serverless v2 requirement - NOT "serverless"), storage encrypted, 7-day automated backups. Serverless v2 scaling is controlled by the serverlessv2_scaling_configuration block:
    • min_capacity = 0 (default dev) - scales to zero and auto-pauses after 1 hour idle (seconds_until_auto_pause = 3600). Set min_capacity = 0.5 for production to avoid cold-start latency.
    • max_capacity = 4 (default) - sufficient for expected Nebu MVP load (approx. 2 vCPU equivalent). Increase for high-traffic production.
  • aws_rds_cluster_instance - one instance per cluster, instance_class = "db.serverless" (required for Serverless v2); Performance Insights enabled explicitly (7-day retention).
  • DB master password: var.db_password (sensitive, minimum 8 chars). Secrets Manager integration is provided via secrets.tf.
  • db_endpoint output references aws_rds_cluster.this.endpoint (writer endpoint) - Aurora clusters expose a dedicated writer endpoint; the reader endpoint is separate and not used for Nebu's primary DB connection.

Key variables: aurora_min_capacity (default 0), aurora_max_capacity (default 4), db_password, skip_final_snapshot (default true for dev). The former db_instance_class and enable_performance_insights variables have been removed - Aurora Serverless v2 uses a fixed db.serverless class and enables Performance Insights by default.

Cost model: Aurora Serverless v2 charges per ACU-second consumed. At min_capacity = 0, a completely idle cluster costs near zero (vs. ~$50–100/month for the previous always-on db.t3.medium Multi-AZ instance).

AWS ALB Module (nebu-aws - alb.tf)

deploy/tofu/modules/nebu-aws/alb.tf provisions the internet-facing Application Load Balancer:

  • aws_lb - internet-facing ALB, placed in public subnets, drop_invalid_header_fields = true (header injection protection)
  • aws_lb_target_group.gateway - target type ip (required for Fargate awsvpc networking), port 8008, health check on GET /_matrix/client/v3/versions (HTTP 200)
  • aws_lb_listener.https - port 443, TLS policy ELBSecurityPolicy-TLS13-1-2-2021-06 (TLS 1.3 minimum), ACM certificate via var.acm_certificate_arn, forwards to gateway target group
  • aws_lb_listener.http_redirect - port 80, permanent HTTP_301 redirect to HTTPS/443; no plaintext traffic ever reaches the gateway

Module outputs added: alb_dns_name, alb_zone_id (for Route 53 ALIAS records).

Key variable: acm_certificate_arn - required for tofu apply; empty string is accepted for tofu validate only.

AWS Secrets Manager Module (nebu-aws - secrets.tf)

deploy/tofu/modules/nebu-aws/secrets.tf provisions all Nebu runtime credentials as individual Secrets Manager secrets under the nebu/${environment}/ namespace:

Secret pathPurpose
nebu/{env}/db_urlFull PostgreSQL DSN (used by gateway + core)
nebu/{env}/db_passwordRDS master password (plain string; kept separate for RDS rotation support)
nebu/{env}/internal_secretPSK for gateway ↔ core node registration (ADR-008)
nebu/{env}/oidc_client_secretOIDC client secret for identity provider registration
nebu/{env}/oidc_issuerOIDC issuer URL
nebu/{env}/release_cookieErlang distribution cookie for OTP cluster authentication

All aws_secretsmanager_secret_version resources are provisioned with PLACEHOLDER_* initial values and lifecycle { ignore_changes = [secret_string] } - preventing tofu apply from overwriting operator-rotated values. Operators must set real values before go-live (see RUNBOOK.md).

AWS Compute Module (nebu-aws - compute.tf)

deploy/tofu/modules/nebu-aws/compute.tf provisions the full ECS Fargate compute layer:

  • aws_ecs_cluster - Fargate cluster with Container Insights enabled ("nebu-${environment}")
  • aws_iam_role.ecs_task_execution - Task execution role with AmazonECSTaskExecutionRolePolicy + inline policy for secretsmanager:GetSecretValue scoped to arn:aws:secretsmanager:*:*:secret:nebu/${environment}/* (least-privilege; no wildcard resource)
  • aws_cloudwatch_log_group - /ecs/nebu-{env}-gateway and /ecs/nebu-{env}-core, 30-day retention
  • aws_ecs_task_definition.gateway - image {registry}/nebu-gateway:{version}, CPU 256 / Memory 512, port 8008, health check GET /_matrix/client/v3/versions, secrets field references individual Secrets Manager ARNs for NEBU_DB_URL, NEBU_OIDC_ISSUER, NEBU_OIDC_CLIENT_SECRET, NEBU_INTERNAL_SECRET
  • aws_ecs_task_definition.core - image {registry}/nebu-core:{version}, CPU 256 / Memory 512, port 9000, health check GET /health, secrets field references for DATABASE_URL, RELEASE_COOKIE, NEBU_INTERNAL_SECRET
  • aws_ecs_service.gateway - Fargate service in private subnets, attached to the ALB target group (port 8008), lifecycle { ignore_changes = [task_definition, desired_count] } for GitOps rolling deployments
  • aws_ecs_service.core - Fargate service in private subnets (no ALB attachment; accessed by gateway via gRPC on port 9000 within the VPC)

Security invariants: No environment field used for secrets in any task definition - all sensitive values are injected exclusively via the secrets field with Secrets Manager ARNs. IAM permissions are scoped to the namespace nebu/${environment}/* only.

Key variables: aws_region (CloudWatch Logs), image_registry, nebu_version, acm_certificate_arn, ecs_desired_count (default 1).

Module outputs: ecs_cluster_arn, db_endpoint, task_execution_role_arn, alb_dns_name, alb_zone_id.

AWS Deployment Topology (Complete)

Internet
  │ HTTPS/443 (TLS 1.3)
  ▼
aws_lb (ALB, internet-facing)
  │ port 443 → aws_lb_listener.https → aws_lb_target_group.gateway (port 8008)
  │ port 80  → aws_lb_listener.http_redirect (HTTP 301 → HTTPS)
  ▼
aws_ecs_service.gateway (private subnets, ECS Fargate, no public IP)
  │ gRPC port 9000 (VPC-internal, ECS SG)
  ▼
aws_ecs_service.core (private subnets, ECS Fargate, no public IP)
  │ port 5432 (private subnets, RDS SG)
  ▼
aws_rds_cluster (Aurora PostgreSQL 16, Serverless v2, private subnets)
  ├── aws_rds_cluster_instance (db.serverless - scales 0–4 ACUs)
  └── writer endpoint: aws_rds_cluster.this.endpoint (port 5432)

Secrets Manager (nebu/{env}/*) ──► ECS task execution role (secretsmanager:GetSecretValue)
                                        │
                              injected into gateway + core containers
                              via ECS secrets field at task start

Day-2 operations (rolling updates, secret rotation, teardown) are documented in deploy/tofu/examples/aws/RUNBOOK.md.

Stackit VM + Networking Module (nebu-stackit - 13-3a)

deploy/tofu/examples/stackit/main.tf provisions the STACKIT compute and network foundation:

  • stackit_network - private routed network with configurable CIDR (var.network_cidr, default 10.0.0.0/24)
  • stackit_security_group + rules - stateful SG with inbound rules for 443 (HTTPS), 8008 (Matrix API), 22 (SSH); egress unrestricted
  • stackit_network_interface - VM NIC attached to the network and SG
  • stackit_key_pair - account-level SSH key pair (global resource; no project_id)
  • stackit_server - Ubuntu 24.04 LTS VM; machine type via var.vm_plan_id (default g2i.2); AZ via var.availability_zone (default eu01-1)
  • stackit_public_ip - Floating IP associated to the VM NIC. If the VM is recreated, re-attach manually via STACKIT portal or stackit beta network-interface public-ip attach
  • stackit_loadbalancer - ALB with PROTOCOL_TCP listener on port 443 → target pool on VM port 8008; health check is TCP-only (no HTTP path checks); plan via var.alb_plan_id (default p10)

HTTPS at ALB (upgrade path): enable_beta_resources = true is already set in the provider block. Once stackit provider >= 0.96 exposes PROTOCOL_HTTPS in its stable schema, change the listener protocol to PROTOCOL_HTTPS and set certificate_reference.name = var.stackit_tls_certificate_arn (Stackit-managed certificate ARN). Until then, TLS is terminated at the gateway on port 8008.

Authentication: provider uses service_account_key_path (JSON key file) instead of a token. Path configured via var.stackit_key_path (sensitive).

Key variables: stackit_project_id, stackit_key_path, ssh_public_key, ubuntu_image_id, network_cidr, availability_zone, vm_plan_id, alb_plan_id, stackit_tls_certificate_arn.

Stackit Managed Database (nebu-stackit - Story 13-8)

deploy/tofu/examples/stackit/main.tf provisions a dedicated STACKIT PostgresFlex instance alongside the application VM. This replaces the previous postgres:16-alpine container bundled in docker-compose.

Resources provisioned:

  • stackit_postgresflex_instance.nebu - managed PostgreSQL 16 cluster with daily automated backups (0 2 * * * UTC), configurable replicas (var.postgres_replicas, default 1; use 3 for production HA), and ACL locked to the VM's private network CIDR (var.network_cidr) - no public exposure
  • stackit_postgresflex_user.nebu (nebu_app) - application user with ["login"] role; owns the nebu database
  • stackit_postgresflex_user.keycloak (keycloak_app) - dedicated Keycloak user with ["login"] role; owns the keycloak database. Using separate DB users per application is a defence-in-depth measure: a compromise of the Nebu application user does not grant access to Keycloak data.
  • stackit_postgresflex_database.nebu - nebu database, owner nebu_app
  • stackit_postgresflex_database.keycloak - keycloak database, owner keycloak_app

PostgresFlex connection details (host, port, username, password) are computed by the provider after instance creation and passed directly into the cloud-init template as pg_* / kc_* variables. The password is not exposed as a Tofu output - operators retrieve it from Terraform state or via the STACKIT portal.

Stackit Sizing Variables:

VariableDefaultProduction recommendation
postgres_replicas13 (HA replication)
postgres_cpu12+ (depending on load)
postgres_ram4 GB8+ GB
postgres_storage_size20 GB50+ GB

Stackit OIDC Deployment Profiles (nebu-stackit - Story 13-9)

The Stackit deployment supports two OIDC profiles selected via var.oidc_mode:

oidc_modeBundled IdPUse case
"dex"Dex IdP sidecar (dexidp/dex:v2.45.1)Test, demo, integration environments
"external" (default)NoneProduction - operator provides an external OIDC provider

"dex" profile:

  • Dex runs as a Docker Compose sidecar on port 5556 with a static in-memory config (no database).
  • /opt/nebu/dex/config.yaml is written at boot (mode 0600) with a static client (nebu-gateway) and a static password user (operator@example.com).
  • effective_oidc_issuer is auto-derived as http://<server_name>:5556/dex; the gateway resolves Dex via Docker hairpin NAT (Linux SNAT/masquerade - no extra routing required).
  • A conditional SG rule inbound_dex opens port 5556 to var.dex_allowed_cidr (default 0.0.0.0/0 for demos; restrict to VPN/developer CIDR in shared environments).
  • var.dex_static_password_hash (required when oidc_mode = "dex") must be a bcrypt hash ($2a$/$2b$/$2y$ prefix). Generate with: htpasswd -bnBC 12 '' 'yourpassword' | tr -d ':' | sed 's/$2y/$2a/'.
  • Expected docker compose ps output: dex (healthy), core (healthy), gateway (healthy).

"external" profile:

  • No bundled IdP is deployed.
  • var.oidc_issuer (required, non-empty) is passed directly into .env as NEBU_OIDC_ISSUER.
  • var.oidc_client_secret must match the secret registered in the external OIDC provider.
  • Expected docker compose ps output: core (healthy), gateway (healthy).

Keycloak is fully removed in both profiles. stackit_postgresflex_user.keycloak and stackit_postgresflex_database.keycloak no longer exist. There are no Keycloak-specific docker-compose services or secrets.

New variables (Story 13-9):

VariableTypeDefaultPurpose
oidc_modestring"external"Profile selection; validated against ["dex", "external"]
dex_allowed_cidrstring"0.0.0.0/0"Source CIDR for SG rule on Dex port 5556 (dex mode only)
dex_static_password_hashstring (sensitive)nullbcrypt hash for Dex static user; required when oidc_mode = "dex"

A lifecycle { precondition } block on stackit_server.nebu enforces: oidc_mode == "dex" → dex_static_password_hash != null and oidc_mode == "external" → length(oidc_issuer) > 0.

DNS Mode - dns_mode variable (nebu-aws + nebu-stackit - Story 13-10)

Both the AWS and Stackit deployment examples support a dns_mode variable that controls whether OpenTofu creates DNS records automatically or leaves DNS registration to the operator.

dns_modeAWS behaviourStackit behaviour
"external" (default)No DNS resources created. dns_name output holds the ALB DNS hostname for manual CNAME/ALIAS registration.No DNS resources created. dns_name output holds the floating IP for manual A-record registration.
"default"Creates data.aws_route53_zone + aws_route53_record (ALIAS A-record) in Route 53, guarded by count = 1. Requires the hosted zone for var.domain_name to exist in the AWS account with an exact name match.Creates stackit_dns_zone + stackit_dns_record_set.nebu (A-record) for var.server_name → floating IP, guarded by count = 1.

Default is "external" to prevent accidental DNS changes on existing deployments.

Stackit-only: dex_subdomain_enabled

When dns_mode = "default" and dex_subdomain_enabled = true, an additional stackit_dns_record_set.dex resource creates a dex.<server_name> A-record pointing to the same floating IP. This lays groundwork for future host-based Dex routing (see story 13-9 dev notes). The variable is validated - setting it to true when dns_mode = "external" raises a validation error at plan time.

Stackit-only: dns_contact_email

An optional dns_contact_email variable (default "") sets the contact_email on the stackit_dns_zone resource. When empty, the field is omitted via null-coalescing (var.dns_contact_email != "" ? var.dns_contact_email : null).

AWS: dns_name output

A new deploy/tofu/examples/aws/outputs.tf exports dns_name = module.nebu_aws.alb_dns_name. Operators using dns_mode = "external" run tofu output dns_name to retrieve the ALB hostname to register as a CNAME. Note: CNAME is not supported at the zone apex - use Route 53 ALIAS or an ALIAS/ANAME record at your DNS provider for apex domains.

RUNBOOK coverage: Both deploy/tofu/examples/aws/RUNBOOK.md and deploy/tofu/examples/stackit/RUNBOOK.md include a "DNS Configuration" section describing both modes, manual registration steps for external mode, and import instructions for existing Stackit DNS zones.

New variables (Story 13-10):

VariableExampleTypeDefaultPurpose
dns_modeAWS + Stackitstring"external"DNS record creation mode; validated against ["default", "external"]
dex_subdomain_enabledStackit onlyboolfalseCreate dex.<server_name> DNS record when dns_mode = "default"
dns_contact_emailStackit onlystring""Contact email for Stackit DNS zone (omitted when empty)

Stackit cloud-init Bootstrap (nebu-stackit - 13-3b, updated 13-8, 13-9)

deploy/tofu/examples/stackit/cloud-init.tftpl is rendered by templatefile() in main.tf and injected as user_data (base64-encoded) into the VM at provision time. On first boot, the VM:

  1. Applies OS security patches (package_upgrade: true)
  2. Installs Docker CE + Docker Compose plugin via the official Docker apt repository
  3. Writes /opt/nebu/ directory tree (permissions 0700) with:
    • .secrets/internal_secret (mode 0600, quoted scalar - no trailing newline)
    • .env (mode 0600) - all runtime secrets injected from OpenTofu variables, including NEBU_DB_URL and NEBU_DB_URL_MIGRATE pointing to the PostgresFlex managed instance
    • /opt/nebu/dex/config.yaml (mode 0600, oidc_mode = "dex" only) - Dex static configuration
    • docker-compose.yml (mode 0640) - two or three services: core, gateway, and conditionally dex (when oidc_mode = "dex"). No postgres service (managed PostgresFlex); no keycloak service.
  4. Installs /etc/systemd/system/nebu.service - Type=simple, no -d flag; systemd owns the process lifecycle with Restart=on-failure RestartSec=10
  5. Starts Docker (systemctl start docker.service && sleep 2) before starting nebu.service

Changes in Story 13-9:

  • keycloak docker-compose service removed in both profiles - Keycloak is no longer deployed
  • KC_DB_PASSWORD removed from .env - no Keycloak credentials
  • Conditional Dex write_files entry and dex: service block added (guarded by %{ if oidc_mode == "dex" ~} template directive)
  • templatefile() call no longer passes kc_user/kc_password; now passes oidc_mode and dex_static_password_hash
  • local.effective_oidc_issuer computed in main.tf - auto-set to http://<server_name>:5556/dex for dex mode; uses var.oidc_issuer for external mode

Changes in Story 13-15:

  • Conditional element Docker service added (guarded by %{ if enable_element_web ~} template directive); see below for full details
  • /opt/nebu/element-config.json write_files block added (conditional on enable_element_web)
  • nginx routing updated: added /api/v1/ → gateway and location / → Element blocks; WebSocket Upgrade/Connection headers added to the Element location

Security invariants:

  • pg_password appears only inside .env (mode 0600), never in plain environment variables or log output
  • dex_static_password_hash is written only to /opt/nebu/dex/config.yaml (mode 0600)
  • ACL on PostgresFlex is set to [var.network_cidr] - never 0.0.0.0/0
  • stackit_postgresflex_user.nebu.password is stored in Terraform state - operators MUST use encrypted Stackit Object Storage backend (see RUNBOOK.md)
  • /opt/nebu/ is 0700 (root-only); docker-compose.yml is 0640
  • nebu_version variable rejects "latest" - a specific semver tag is required
  • oidc_client_secret must not contain " or \ characters (YAML interpolation constraint, enforced by variable validation)

Day-2 operations (updates, OIDC profile switching, backup, teardown) are documented in deploy/tofu/examples/stackit/RUNBOOK.md.

Element Web Client (nebu-stackit - Story 13-15)

deploy/tofu/examples/stackit/ supports an optional Element Web Matrix client deployment controlled by the enable_element_web variable. This allows end-users to access Nebu via a browser without installing a separate client.

Variable:

VariableTypeDefaultPurpose
enable_element_webboolfalseDeploy vectorim/element-web:v1.12.15 as a Docker Compose service and configure nginx routing

CORS constraint - enable_tls = true is required:

When enable_element_web = true but enable_tls = false, Element Web runs on port 7070 and the gateway runs on port 8008 - different ports mean different origins. Browsers enforce the Same-Origin Policy: every Matrix API call from Element triggers a CORS preflight that fails because the gateway does not emit Access-Control-Allow-Origin for the Element origin. All Matrix API calls from Element silently fail.

When enable_tls = true, nginx terminates TLS and serves both services under a single origin (https://<server_name>), eliminating the CORS problem entirely.

A lifecycle { precondition } block in deploy/tofu/examples/stackit/server.tf enforces this constraint:

tofu apply fails immediately if an operator sets enable_element_web = true with enable_tls = false.

Docker service topology (when enable_element_web = true):

vectorim/element-web:v1.12.15
  image:    vectorim/element-web:v1.12.15
  ports:    127.0.0.1:7070:80  (loopback only - nginx is the public entry point)
  volumes:  /opt/nebu/element-config.json:/app/config.json:ro
  depends_on: gateway (service_healthy)
  healthcheck: wget -qO- http://localhost/

Element binds to loopback only (127.0.0.1:7070) - it is never reachable directly from the internet. Port 7070 is intentionally absent from the Stackit security group rules. Only nginx (port 443) is the public entry point.

nginx routing table (when enable_tls = true AND enable_element_web = true):

Path prefixProxied toPurpose
/_matrix/http://127.0.0.1:8008Matrix Client-Server API
/.well-known/matrix/http://127.0.0.1:8008Matrix server discovery
/admin/http://127.0.0.1:8008Nebu Admin UI and API
/api/v1/http://127.0.0.1:8008Internal/compliance API
/dex/http://127.0.0.1:5556Dex IdP (only when oidc_mode = "dex")
/http://127.0.0.1:7070Element Web client (catch-all)

The location / block for Element is always listed last - nginx evaluates prefix locations in longest-match order, ensuring Matrix API paths are never accidentally routed to Element.

The Element location includes WebSocket upgrade headers (Upgrade, Connection) to support future Matrix /sync long-polling and WebSocket transports.

element-config.json - Security hardening:

/opt/nebu/element-config.json is written by cloud-init at boot time with values interpolated from OpenTofu variables:

disable_custom_urls: true prevents users from pointing Element at a different server. UIFeature.registration and UIFeature.passwordReset are disabled because Nebu uses OIDC-only authentication. The file is absent when enable_element_web = false.

element_url output:

deploy/tofu/examples/stackit/outputs.tf exposes an element_url output:

ConditionOutput
enable_element_web = true AND enable_tls = truehttps://<server_name>/
enable_element_web = true AND enable_tls = falsehttp://<server_name>:7070/ (blocked by precondition)
enable_element_web = false"disabled"

terraform.tfvars.example documentation section:

Helm Chart

deploy/helm/nebu/ is a standalone Helm Chart usable independently of OpenTofu.

Kubernetes resource topology (Story 13-4a + 13-4b):

ResourceName patternConditionalPurpose
Deployment{release}-nebu-gatewayalwaysGo gateway; loads ConfigMap + optional Secret via envFrom
Deployment{release}-nebu-corealwaysElixir/OTP core
Service (ClusterIP){release}-nebu-gatewayalwaysPort 8008 → gateway pods
Service (ClusterIP){release}-nebu-corealwaysPort 9000 (gRPC) + 4000 (HTTP) → core pods
ConfigMap{release}-nebu-configalwaysNEBU_OIDC_ISSUER, NEBU_SERVER_NAME, NEBU_CORE_GRPC_ADDR (non-secret)
Ingress{release}-nebuingress.enabled: truenginx Ingress with configurable hostname and optional TLS secret
PersistentVolumeClaim{release}-nebu-postgrespostgres.external: falsePostgres storage when running in-cluster
HorizontalPodAutoscaler{release}-nebu-gatewayautoscaling.gateway.enabled: trueCPU-based HPA for gateway Deployment

NEBU_CORE_GRPC_ADDR is derived deterministically from the release name as {release}-nebu-core:9000 - rendered by the ConfigMap template, not a values entry.

Secret management (ExistingSecret pattern):

The chart never creates a Kubernetes Secret. Sensitive NEBU_* variables (NEBU_DB_URL, NEBU_INTERNAL_SECRET, NEBU_OIDC_CLIENT_SECRET) must be stored in a pre-existing Kubernetes Secret, created out-of-band by the operator:

The Secret name is referenced in values.yaml as existingSecret.name. When set, the gateway Deployment loads it via envFrom.secretRef. If the name is empty, no Secret is mounted and the gateway will fail to start - this is intentional (GitOps pre-deploy pattern) and documented in NOTES.txt.

Image tags must be set independently per component. Omitting either tag causes helm install to fail with an explicit error - preventing silent deployment of unversioned images.

OpenTofu Kubernetes Wrapper (nebu-k8s - Story 13-4c)

deploy/tofu/modules/nebu-k8s/ is a thin OpenTofu wrapper around a single helm_release resource. It manages the Nebu Helm release on any Kubernetes cluster. The Kubernetes and Helm providers must be configured in the calling root module.

Module interface:

VariableTypeDefaultPurpose
release_namestring"nebu"Helm release name
chart_pathstringrequiredPath to the Nebu Helm chart directory
namespacestring"nebu"Kubernetes namespace (created if absent)
gateway_image_tagstringrequiredContainer image tag for the gateway component
core_image_tagstringrequiredContainer image tag for the core component
ingress_enabledboolfalseEnable the Ingress resource
helm_timeoutnumber300Seconds to wait for all pods to reach Ready state
values_fileslist(string)[]Extra values files; must be non-empty absolute paths

wait = true is set explicitly - tofu apply blocks until all pods are Ready or helm_timeout seconds elapse. This prevents silent partial deployments.

values_files paths must be absolute (use "${path.module}/..." in the calling root module) - relative paths are resolved from the tofu apply working directory, not the module directory.

Quick-start example (deploy/tofu/examples/k8s/):

Local smoke test (kind):

Full operator procedures (upgrade, rollback, HPA configuration, teardown) are in deploy/tofu/examples/k8s/RUNBOOK.md.


Load Test Topology (Story 13-5)

docker-compose.scale.yml is a Compose override that removes the fixed host-port binding from the gateway service so Docker Compose can start multiple replicas without port conflicts.

Multi-Gateway Scale-Up

Both replicas connect to the same Core instance via core:9000 (gRPC) and share the internal_secret PSK. For 2-core clustering, see Story 13-6 below.

k6 load generator
  │ HTTP  (nebu_default Docker network)
  ▼
[gateway replica 1] ──gRPC:9000──▶ core
[gateway replica 2] ──gRPC:9000──▶ core
                                     │
                                     ▼
                               PostgreSQL 16

Core Clustering (Story 13-6): libcluster + Horde Failover

The Elixir Core supports horizontal scaling via libcluster for node discovery and Horde for distributed Room GenServer supervision. When a Core node fails, Horde automatically restarts Room GenServers on the surviving node within seconds (no manual intervention).

Architecture

[gateway] ──gRPC:9000──▶ [core (nebu@core)]   ←── Horde CRDT cluster ───▶ [core2 (nebu@core2)]
                               │                                                    │
                          Room GenServers                                   Room GenServers
                           (Horde Registry)                               (Horde Registry)
                               │                                                    │
                               └───────────────┬──────────────────────────────────┘
                                               ▼
                                        PostgreSQL 16

When core (nebu@core) is stopped:

  1. Horde CRDT reconciliation detects the node departure
  2. Room GenServers that were running on core are restarted on core2
  3. Horde Registry re-registers the rooms under core2's node
  4. Gateway's next gRPC call succeeds (Room GenServer is alive on core2)

Clustering Strategies

EnvironmentStrategyDiscovery
Docker ComposeCluster.Strategy.GossipUDP broadcast (port 45892) within Docker network
KubernetesCluster.Strategy.Kubernetes.DNSHeadless Service DNS lookup (core-headless)

Environment Variables

VariablePurposeExample
CLUSTER_STRATEGYSelects libcluster strategy (gossip / kubernetes)gossip
RELEASE_NODEErlang node name for distributionnebu@core
RELEASE_DISTRIBUTIONErlang distribution mode (name = long names)name
CLUSTER_NODESComma-separated peer list (informational; Gossip discovers dynamically)nebu@core

Local 2-Core Stack

The docker-compose.scale.yml override adds:

  • core service: sets RELEASE_NODE=nebu@core and CLUSTER_STRATEGY=gossip
  • core2 service: same image, RELEASE_NODE=nebu@core2, connects to core1 via Gossip multicast

Kubernetes Clustering

When core.replicaCount > 1 in values.yaml, the Helm chart automatically injects:

  • CLUSTER_STRATEGY=kubernetes → activates Cluster.Strategy.Kubernetes.DNS
  • RELEASE_NODE=nebu@$(MY_POD_IP) → unique long-form node name per pod
  • RELEASE_DISTRIBUTION=name → enables Erlang long-name distribution
  • KUBERNETES_SERVICE_NAME pointing to the headless Service (core-headless)

A headless Service (core-deployment-headless) is automatically provisioned alongside the standard ClusterIP Service when replicaCount > 1. The headless Service returns A records for each pod, enabling libcluster to discover peers via DNS.

Health Endpoint - Cluster Status

GET http://core:4000/health now includes cluster_nodes in its JSON response:

cluster_nodes is an empty list in single-node mode (no libcluster configured).

Erlang distribution requires all nodes in a cluster to share the same cookie. In production deployments:

  • Docker Compose: uses Erlang's default auto-generated cookie (acceptable for dev; nodes on the same Docker network)
  • Kubernetes/AWS: RELEASE_COOKIE is injected from Secrets Manager (see AWS secrets.tf) to ensure all Core pods use the same pre-shared cookie
  • STACKIT: operators must set a consistent RELEASE_COOKIE in the cloud-init bootstrap across all Core VMs

k6 Scenario Files

FileTierVUsDurationSend p95 threshold
k6/scenarios/gold-tier.jsGold10005 min< 500 ms
k6/scenarios/silver-tier.jsSilver5005 min< 800 ms

Each scenario exercises three endpoints per VU per iteration: POST /_matrix/client/v3/login, GET /_matrix/client/v3/sync, PUT /_matrix/client/v3/rooms/{roomId}/send/m.room.message/{txnId}.

Custom metrics reported: nebu_login_duration, nebu_sync_duration, nebu_send_duration (each as a Trend with p50/p95/p99), plus per-endpoint error rates. See k6/README.md for full operator instructions.

CI Syntax Gate

This gate runs as part of make test-iac-validate (no running stack required).


Release Pipeline (Story 13-11)

Story 13-11 introduces a versioned release track that runs alongside the existing SHA-tagged build track. The two tracks are intentionally separate and do not interfere with each other.

Two-Track Image Strategy

TrackTriggerImage tag patternPurpose
Build (existing)Every push / MRgateway:<SHA>, core:<SHA>Feeds integration tests; ephemeral
Release (new)SemVer Git tag (v*.*.* )nebu-gateway:<semver>, nebu-core:<semver>Production images; referenced by operators

Image naming: Release images use the nebu-gateway / nebu-core prefix (not gateway / core as in build images). This distinguishes them at the registry level and prevents accidental cross-track references.

Version stripping convention: Git tag v1.0.0 → image tag 1.0.0 (the v prefix is stripped in both the CI pipeline and the make release target).

CI Release Stage

Two Kaniko jobs in the new release stage of .gitlab-ci.yml:

Both jobs:

  • Are gated by if: '$CI_COMMIT_TAG =~ /^v\d+\.\d+\.\d+$/' - stable SemVer only; RC/beta tags (v1.0.0-rc1) are excluded
  • Set interruptible: false and needs: [] - run independently of branch-only build/integration jobs
  • Use --cache=false - no reuse of branch-build Kaniko cache layers, ensuring deterministic and reproducible production images
  • Pass GIT_COMMIT, BUILD_TIME, and RELEASE_VERSION as --build-arg to the Dockerfile

The workflow: block in .gitlab-ci.yml has been extended to allow tag pipelines:

Makefile Release Targets

make release validates that TAG matches vN.N.N before building. CI_REGISTRY_IMAGE defaults to registry.gitlab.com/philippb/open-chat and can be overridden for forks.

Operator Integration

Operators reference release images via the nebu_version variable in OpenTofu or image tags in Helm:

OpenTofu (all three platforms):

Helm:

The nebu_version variable in the nebu-core shared module already enforces a semver regex and rejects "latest" - this constraint remains in force for all release images.

Source: Story 13-11 staged changes - .gitlab-ci.yml, Makefile, deploy/helm/nebu/values.yaml, deploy/tofu/examples/*/terraform.tfvars.example; Story 13-14 (squashed seed migration - single 000001_init.up.sql, make dev-reset-db target, Stackit nebu_migrate/nebu_legacy backward-compat users)