Network Policies¶
Every tenant namespace gets a default-deny baseline plus a small
allow-list: pods can reach DNS, traefik can reach pods on
80/443/8080, OTel telemetry leaves through the observability
namespace, HTTPS to the open internet works, and pods inside the
same tenant can talk to each other. Anything outside that list is
dropped at the CNI layer.
The 6 NetworkPolicies are computed by
paas_deploy::network_policy::build_default_network_policies (a
pure function) and applied via
ensure_tenant_network_policies(client, ns, tenant_id) —
server-side apply with field manager paas-control-plane so a
re-run after a hand-edit re-converges to the canonical spec.
The 7 default policies¶
| # | Name | Effect |
|---|---|---|
| 1 | default-deny-all |
policyTypes: [Ingress, Egress], no rules — everything denied unless another policy allows it |
| 2 | allow-dns-egress |
egress to kube-system on UDP/TCP port 53 |
| 3 | allow-traefik-ingress |
ingress from the traefik namespace on TCP 80 / 443 / 8080 |
| 4 | allow-otel-egress |
egress to the observability namespace on TCP 4318 (OTLP/HTTP) |
| 5 | allow-https-egress |
egress to 0.0.0.0/0 on TCP 443 (external API calls) |
| 6 | allow-intra-tenant |
Ingress + Egress symmetric — both halves required because rule 1 blocks egress too. podSelector: {} matches same-namespace pods only |
| 7 | deny-paas-system-egress |
egress with namespaceSelector.matchExpressions[NotIn paas-system] — partial backbone isolation (see "K8s NetworkPolicy semantic limit" below) |
Tenant isolation comes from the namespace boundary plus the
namespaceSelector clauses in rules 2/3/4 — a pod in
paas-tenant-acme can NOT reach a pod in paas-tenant-foo because
neither side has a NetworkPolicy that allows it.
Why allow-intra-tenant matters¶
Without rule 6, a Procfile like
would silently break: rule 1 (default-deny-all) drops
intra-namespace traffic too, so web calling worker:8080 over
the in-cluster Service would time out. Rule 6 reopens that path
without weakening cross-tenant isolation — podSelector: {}
matches pods in the same namespace only, never pods in another
tenant's namespace.
Add-on connectivity¶
Add-ons (Postgres, Redis, OpenSearch) live in their own namespace
and get a per-app allow-app-{type} policy emitted by
build_addon_connect_policy(addon_type, addon_ns, app_ns):
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-app-postgres
namespace: addon-postgres-…
spec:
podSelector: {}
policyTypes: ["Ingress"]
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: paas-tenant-acme
ports:
- protocol: TCP
port: 5432
Ports follow the addon type (postgres → 5432, redis → 6379,
opensearch → 9200); unknown types fall back to 80.
Idempotency¶
ensure_tenant_network_policies uses
Patch::Apply(...).force() with the platform's field manager.
Consequences:
- Re-running on a tenant that already has the policies applies zero changes (Kubernetes diffs the desired state against the manager's last-applied set).
- An operator hand-edit on a managed field is taken back on the next reconcile (the platform owns the spec).
- Hand-edits on other fields (annotations, labels not in our set) are preserved — server-side apply only owns what it sets.
Drift detection¶
The control-plane helper
paas_control_plane::network_policy_helper::policy_names_for_tenant(tenant, ns)
returns the 7 expected names (cycle 2). A future cycle's drift-check
job compares this against kubectl get networkpolicy -n <ns> and
flags missing / extra rows. Cycles 1+2 ship the contract; the
scheduled job is out of scope.
Operator recipe — one-shot apply¶
use paas_deploy::network_policy::ensure_tenant_network_policies;
let client = kube::Client::try_default().await?;
ensure_tenant_network_policies(&client, "paas-tenant-acme", "acme").await?;
Applied at tenant-namespace creation (cycle 2 wires it into
pg_app_service::create_tenant_namespace), then re-run by the
drift-check job on a schedule.
K8s NetworkPolicy semantic limit (rule 7)¶
Rule 7 looks like a "deny" but K8s NetworkPolicy is union-of-allows
only — there's no native deny primitive. A rule with
namespaceSelector.matchExpressions[NotIn paas-system] reads as
"allow egress to every namespace except paas-system", which:
- ✅ blocks ports not covered by any other allow (e.g.
:8080to paas-system pods — verified LIVE inbilans/ad33-cycle2-smoke-isolation.md). - ⚠️ does not block
:443to paas-system, because rule 5 (allow-https-egresson0.0.0.0/0:443) already covers it and the union still allows. - ⚠️ adds permission for ports/namespaces previously not in any
allow (e.g.
:80cross-tenant): the smoke caught this forweb → server-b:80.
Net: rule 7 is a partial backbone isolation. It's better than
nothing for non-443 traffic to paas-system, but it's not the
hard guarantee its name suggests.
Phase 2 — Cilium Hubble audit + CiliumNetworkPolicy deny¶
The production cluster runs Cilium, which ships two pieces beyond vanilla K8s:
-
Hubble — flow log of every connection attempt, allowed and denied.
hubble observe --namespace paas-tenant-acme --verdict DROPPED --since 10mgives the on-call a real-time view of which pod tried to reach what and got blocked. The platform doesn't ingest Hubble flows yet; cycle 3 will pipe a filtered subset to the operator dashboard. -
CiliumNetworkPolicy with explicit
deny:— the proper fix for rule 7. The rewritten 7th rule would look like:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: deny-paas-system
spec:
endpointSelector: {}
egressDeny:
- toEndpoints:
- matchLabels:
k8s:io.kubernetes.pod.namespace: paas-system
egressDeny is evaluated before any allow, so it overrides
rule 5 (:443 external) and gives the hard guarantee. A
follow-up cycle will switch the platform's per-tenant policy
emitter to CNP when the namespace's CNI advertises the v2 API.
CNI requirement¶
NetworkPolicies are enforced only when the cluster's CNI supports them — Calico, Cilium, weave-net all do. K3s ships with flannel by default, which does not enforce NetworkPolicies; the production cluster runs Cilium so the policies bite. The dev / sandbox cluster is fine to ship without enforcement; rules are declarative and become active the moment the CNI starts honoring them.
Related¶
- Custom Domain —
allow-traefik-ingressis what lets the public Ingress reach app pods - TLS Auto — cert-manager's HTTP-01 challenge
goes through the same
traefiknamespace - Apps — per-tenant namespace model the policies hang off