Skip to content

Network Policies

Every tenant namespace gets a default-deny baseline plus a small allow-list: pods can reach DNS, traefik can reach pods on 80/443/8080, OTel telemetry leaves through the observability namespace, HTTPS to the open internet works, and pods inside the same tenant can talk to each other. Anything outside that list is dropped at the CNI layer.

The 6 NetworkPolicies are computed by paas_deploy::network_policy::build_default_network_policies (a pure function) and applied via ensure_tenant_network_policies(client, ns, tenant_id) — server-side apply with field manager paas-control-plane so a re-run after a hand-edit re-converges to the canonical spec.

The 7 default policies

# Name Effect
1 default-deny-all policyTypes: [Ingress, Egress], no rules — everything denied unless another policy allows it
2 allow-dns-egress egress to kube-system on UDP/TCP port 53
3 allow-traefik-ingress ingress from the traefik namespace on TCP 80 / 443 / 8080
4 allow-otel-egress egress to the observability namespace on TCP 4318 (OTLP/HTTP)
5 allow-https-egress egress to 0.0.0.0/0 on TCP 443 (external API calls)
6 allow-intra-tenant Ingress + Egress symmetric — both halves required because rule 1 blocks egress too. podSelector: {} matches same-namespace pods only
7 deny-paas-system-egress egress with namespaceSelector.matchExpressions[NotIn paas-system] — partial backbone isolation (see "K8s NetworkPolicy semantic limit" below)

Tenant isolation comes from the namespace boundary plus the namespaceSelector clauses in rules 2/3/4 — a pod in paas-tenant-acme can NOT reach a pod in paas-tenant-foo because neither side has a NetworkPolicy that allows it.

Why allow-intra-tenant matters

Without rule 6, a Procfile like

web: node server.js
worker: node worker.js

would silently break: rule 1 (default-deny-all) drops intra-namespace traffic too, so web calling worker:8080 over the in-cluster Service would time out. Rule 6 reopens that path without weakening cross-tenant isolation — podSelector: {} matches pods in the same namespace only, never pods in another tenant's namespace.

Add-on connectivity

Add-ons (Postgres, Redis, OpenSearch) live in their own namespace and get a per-app allow-app-{type} policy emitted by build_addon_connect_policy(addon_type, addon_ns, app_ns):

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-app-postgres
  namespace: addon-postgres-…
spec:
  podSelector: {}
  policyTypes: ["Ingress"]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: paas-tenant-acme
      ports:
        - protocol: TCP
          port: 5432

Ports follow the addon type (postgres → 5432, redis → 6379, opensearch → 9200); unknown types fall back to 80.

Idempotency

ensure_tenant_network_policies uses Patch::Apply(...).force() with the platform's field manager. Consequences:

  • Re-running on a tenant that already has the policies applies zero changes (Kubernetes diffs the desired state against the manager's last-applied set).
  • An operator hand-edit on a managed field is taken back on the next reconcile (the platform owns the spec).
  • Hand-edits on other fields (annotations, labels not in our set) are preserved — server-side apply only owns what it sets.

Drift detection

The control-plane helper paas_control_plane::network_policy_helper::policy_names_for_tenant(tenant, ns) returns the 7 expected names (cycle 2). A future cycle's drift-check job compares this against kubectl get networkpolicy -n <ns> and flags missing / extra rows. Cycles 1+2 ship the contract; the scheduled job is out of scope.

Operator recipe — one-shot apply

use paas_deploy::network_policy::ensure_tenant_network_policies;
let client = kube::Client::try_default().await?;
ensure_tenant_network_policies(&client, "paas-tenant-acme", "acme").await?;

Applied at tenant-namespace creation (cycle 2 wires it into pg_app_service::create_tenant_namespace), then re-run by the drift-check job on a schedule.

K8s NetworkPolicy semantic limit (rule 7)

Rule 7 looks like a "deny" but K8s NetworkPolicy is union-of-allows only — there's no native deny primitive. A rule with namespaceSelector.matchExpressions[NotIn paas-system] reads as "allow egress to every namespace except paas-system", which:

  • ✅ blocks ports not covered by any other allow (e.g. :8080 to paas-system pods — verified LIVE in bilans/ad33-cycle2-smoke-isolation.md).
  • ⚠️ does not block :443 to paas-system, because rule 5 (allow-https-egress on 0.0.0.0/0:443) already covers it and the union still allows.
  • ⚠️ adds permission for ports/namespaces previously not in any allow (e.g. :80 cross-tenant): the smoke caught this for web → server-b:80.

Net: rule 7 is a partial backbone isolation. It's better than nothing for non-443 traffic to paas-system, but it's not the hard guarantee its name suggests.

Phase 2 — Cilium Hubble audit + CiliumNetworkPolicy deny

The production cluster runs Cilium, which ships two pieces beyond vanilla K8s:

  • Hubble — flow log of every connection attempt, allowed and denied. hubble observe --namespace paas-tenant-acme --verdict DROPPED --since 10m gives the on-call a real-time view of which pod tried to reach what and got blocked. The platform doesn't ingest Hubble flows yet; cycle 3 will pipe a filtered subset to the operator dashboard.

  • CiliumNetworkPolicy with explicit deny: — the proper fix for rule 7. The rewritten 7th rule would look like:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: deny-paas-system
spec:
  endpointSelector: {}
  egressDeny:
    - toEndpoints:
        - matchLabels:
            k8s:io.kubernetes.pod.namespace: paas-system

egressDeny is evaluated before any allow, so it overrides rule 5 (:443 external) and gives the hard guarantee. A follow-up cycle will switch the platform's per-tenant policy emitter to CNP when the namespace's CNI advertises the v2 API.

CNI requirement

NetworkPolicies are enforced only when the cluster's CNI supports them — Calico, Cilium, weave-net all do. K3s ships with flannel by default, which does not enforce NetworkPolicies; the production cluster runs Cilium so the policies bite. The dev / sandbox cluster is fine to ship without enforcement; rules are declarative and become active the moment the CNI starts honoring them.

  • Custom Domainallow-traefik-ingress is what lets the public Ingress reach app pods
  • TLS Auto — cert-manager's HTTP-01 challenge goes through the same traefik namespace
  • Apps — per-tenant namespace model the policies hang off