Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
322 changes: 322 additions & 0 deletions content/posts/2026-03-26-network-health/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,322 @@
---
layout: :theme/post
title: "A single dashboard for your cluster network with Network Health"
description: "Network Health in NetObserv: built-in rules, alerts vs recording rules, custom PrometheusRules, and an Istio-based demo surfacing 5xx errors across services."
tags: network,health,observability,prometheus,alerts,recording,rules,dashboard,istio,bookinfo
authors: [lberetta]
---

Understanding the health of your cluster network is not always straightforward.

Issues like packet drops, DNS failures, or policy denials often require digging through multiple dashboards, metrics, or logs before you can even identify where the problem is.

**Network Health in NetObserv aims to simplify this** by surfacing these signals in a single, unified view.

NetObserv now features a dedicated **Network Health** section designed to provide a high-level overview of your cluster's networking status. This interface relies on a set of predefined health rules that automatically surface potential issues by analyzing NetObserv metrics.

Out of the box, these rules monitor several key signals such as:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true if you have certain eBPF features enabled (e.g. must enable DNSTracking to get DNS-related rules). Suggest:

"Out of the box and assuming the specific eBPF feature is enabled, these rules monitor several key signals such as:"


- **DNS errors and NXDOMAIN responses**
- **packet drops**
- **network policy denials**
- **latency trends**
- **ingress errors**

These built-in rules provide immediate diagnostic value without requiring users to write complex PromQL queries.

But in real-world environments, every application behaves differently. What is considered “healthy” for one workload might not apply to another.

This is where Network Health becomes particularly powerful: it allows you to define **custom health rules** tailored to the specific behavior and expectations of your applications.

The dashboard is organized by scope: **Global**, **Nodes**, **Namespaces**, and **Workloads**. The tab counts show how many items you have in each scope, so you know at a glance where to look.

You can find Network Health in the NetObserv console (standalone or OpenShift at **Observe > Network Traffic**).

The following images describe some health rules in two different scopes:

![Alert rule in Network Health](alert.png)
*Alert rule showing as pending or firing in the dashboard at the Namespace scope*
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the number of errors, warnings, and info at the top be in sync with the sum of those amounts in the Global, Nodes, Namespaces, and Workloads tabs? Currently, it is not. In this screenshot, the Namespaces tab show 2 errors, yet the total number of errors is shown as 1 at the top section. This could use some explanation.

Also, if a tab, like Namespaces, has multiple types of issues (e.g. an error and a warning), the icon on the tab appears to only show the most severe issue (e.g. error only). It could also explain this.


![Recording rule in Network Health](recording-rule.png)
*Recording rule continuously tracking metric values across severity thresholds at the Global scope*

## Understanding Health Rules: Alerts vs Recording Rules

Behind the scenes, the Network Health section is powered by **PrometheusRule** resources. NetObserv supports two different rule modes, each designed for a different monitoring strategy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest:

"The PrometheusRule supports two different rule modes, ..."


### Alert mode

**Alert rules** trigger when a metric exceeds a defined threshold.

For example: *Packet loss > 10%*

These rules are useful for detecting immediate issues that require action, and they integrate with the existing Prometheus and Alertmanager alerting pipeline. In the Network Health dashboard, alert rules appear when they are **pending** (before the threshold is sustained) or **actively firing**.
Copy link
Copy Markdown
Contributor

@stleerh stleerh Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what it means "before the threshold is sustained". I know what it means, but there's an implied "threshold" field here that it should explain.


### Recording mode

**Recording rules** continuously compute and store metric values in Prometheus without generating alerts.

In the Network Health dashboard, these metrics become visible as soon as the value reaches the lowest configured severity threshold (for example the *info* level). As the value evolves, the rule may move between *info*, *warning*, and *critical* states according to the thresholds defined in its configuration.

Recording rules are particularly useful for:

- continuously monitoring health indicators
- tracking performance trends over time
- reducing alert fatigue

### When to use each

In practice:

- Use **alert rules** when you need to be notified of immediate issues
- Use **recording rules** when you want continuous visibility into how a metric evolves over time

For Network Health, recording rules are often a better fit, as they allow you to observe degradation trends before they become critical.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implies that you should use one or the other. Would you ever want to create an alert and a recording rule for the same issue? Why or why not?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with flowcollector config, it's not possible to have both. You could have Alert mode for a variant and then add your own recording rule with "netobserv" label.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Network Health, recording rules are often a better fit, as they allow you to observe degradation trends before they become critical.

isn't having alert vs recording rule about user preference? The metrics will still be there even if they're for alerts


## Health in the topology

Network Health is also integrated with the **Topology** view.

When you select a node, namespace, or workload, the side panel can display a **Health** tab if there are active violations. This allows you to move seamlessly from a high-level signal (for example, “this namespace has DNS issues”) to a contextual view of the affected resources.

![Topology view with health violations](topology.png)
*Topology side panel showing health violations for a selected resource*

## Configuring custom health rules

Custom health rules can be integrated into the Network Health dashboard by creating a **PrometheusRule** resource.

You can define:

- **custom alert rules**, for event-driven detection
- **custom recording rules**, for continuous visibility
- or a combination of both

The way metadata is attached differs between alert and recording rules, as the CRD treats them differently.

### Custom alerts

Alert rules allow annotations directly on each rule. This is where you define:

- `summary`
- `description`
- optionally `netobserv_io_network_health`

The `netobserv_io_network_health` annotation contains a JSON string describing how the signal should appear in the dashboard (unit, thresholds, scope, etc.).

### Custom recording rules

Recording rules do not support annotations at the rule level. Instead, NetObserv requires a single annotation on the **PrometheusRule metadata**:

`netobserv.io/network-health`

This annotation is a JSON object that acts as a map:

- **keys** → metric names (matching the `record:` field)
- **values** → metadata (summary, description, thresholds, etc.)

Each recorded metric must have a corresponding entry in this map, as this is how Network Health associates metadata with the metric.

In both cases, you must include the label:

```yaml
netobserv: "true"
```

on both the `PrometheusRule` and each rule’s `labels`.

### Example: custom recording rule

The following example defines a simple recording rule and shows it in the Global tab with custom thresholds:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-recording-rules
namespace: netobserv
labels:
netobserv: "true"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emphasize "true" needs to be quoted if that is the case.

annotations:
netobserv.io/network-health: |
{
"my_simple_number": {
"summary": "Test metric (value: {{ $value }})",
"description": "Numeric value to test thresholds.",
"netobserv_io_network_health": "{\"unit\":\"\",\"upperBound\":\"100\",\"recordingThresholds\":{\"info\":\"10\",\"warning\":\"25\",\"critical\":\"50\"}}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should use single quotes for the entire value, so it doesn't have to escape every single double quote.

These could use more explanation:

  • What does "unit": "" mean?
  • What does "upperBound": "100" mean? Do numbers have to be quoted?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}
spec:
groups:
- name: SimpleNumber
interval: 30s
rules:
- record: my_simple_number
expr: vector(25)
labels:
netobserv: "true"
```
Comment on lines +132 to +158
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Search NetObserv codebase for annotation and label conventions

# Search for the annotation key format
rg -n "netobserv\.io/network-health" -A3 -B3

# Search for the label requirement pattern  
rg -n 'netobserv.*"true"' -A3 -B3

# Search for documentation about PrometheusRule integration
fd -e md -e yaml | xargs rg -l "PrometheusRule" | head -10

Repository: netobserv/netobserv.github.io

Length of output: 5638


🌐 Web query:

NetObserv custom health rules PrometheusRule annotation format netobserv.io/network-health

💡 Result:

The netobserv_io_network_health annotation in NetObserv PrometheusRule resources is an optional JSON string that controls how alerts are rendered on the Network Health page. It consists of the following fields: - severity: String, valid values: "critical", "warning", or "info". - message: String, alert description using output labels from the PromQL expression (e.g., "{{ $labels.DstK8S_Namespace }} namespace has high incoming bandwidth"). - namespaceLabels: List of strings, labels holding namespaces. Alert appears under Namespaces tab. - nodeLabels: List of strings, labels holding nodes. Alert appears under Nodes tab. Mutually exclusive with namespaceLabels. - kindLabels: List of strings, labels holding workload kinds. Used with workloadLabels for Owners tab. - workloadLabels: List of strings, labels holding workload/owner names. Used with kindLabels for Owners tab. - threshold: String, matches PromQL threshold. - unit: String, display unit (e.g., "bytes/s"). - upperBound: String, optional upper bound. - links: List of objects with name (display) and url. - trafficLink: String, for Network Traffic page URL with filters (e.g., node, namespace). - extraFilter: String, additional filter (e.g., DNS response code). If neither namespaceLabels nor nodeLabels provided, alert under Global tab. Example in PrometheusRule: annotations: netobserv_io_network_health: | {"severity": "warning", "message": "{{ $labels.DstK8S_Namespace }} has high incoming bandwidth", "namespaceLabels": ["DstK8S_Namespace"], "threshold": "1GB", "unit": "bytes/s"} Health rules defined in FlowCollector spec.processor.metrics.healthRules generate these PrometheusRules automatically. Custom rules created via AlertingRule/PrometheusRule CRs in netobserv namespace.

Citations:


Fix annotation key format in all YAML examples.

The annotation key must be netobserv_io_network_health (underscores), not netobserv.io/network-health (dots and hyphens). This affects all three PrometheusRule examples in the post (lines 132–158, 167–229, and 232–309). Update each to:

annotations:
  netobserv_io_network_health: |
    { ... }

The netobserv: "true" label requirement on both PrometheusRule and individual rules is correct and properly documented.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@content/posts/2026-03-26-network-health/index.md` around lines 132 - 158,
Replace the incorrect annotation key "netobserv.io/network-health" with the
underscore-form "netobserv_io_network_health" in the metadata.annotations block
of every PrometheusRule example (the three PrometheusRule snippets named/typed
as kind: PrometheusRule), i.e., update the annotations key in each
metadata.annotations section so it reads netobserv_io_network_health: | followed
by the existing JSON payload; leave the netobserv: "true" labels on the
PrometheusRule and on individual rules unchanged.


While this example is intentionally simple, the same mechanism applies to more complex metrics, including real application signals.

So far, we've looked at how Network Health works and how to extend it.

Let’s now put this into practice with a concrete example.

## Demo: Surfacing service failures with Network Health

Let’s walk through a realistic scenario.

Imagine you're running a microservices application (bookinfo) in your cluster using a service mesh like Istio. Everything looks healthy at first glance, but suddenly users start reporting that some parts of the application are failing intermittently.

Now the question becomes:

> *How do you make this visible at a glance for cluster administrators, without digging into Prometheus queries?*

This is exactly where **Network Health** comes into play.

### Step 1 — Define the health signal

We want to continuously track the **percentage of 5xx errors** affecting services in the application, and surface it directly in the **Network Health dashboard**.

Since we are running with Istio, we can rely on the standard metric:

`istio_requests_total`

This metric is emitted by the **Envoy sidecar proxies**, which means it captures traffic *at the network layer*, independently of the application itself.

In this example, we compute the error rate using the **`reporter="source"`** perspective.

This is an important detail:

- With Istio, metrics can be reported from the **source** or the **destination**
- Using `reporter="source"` ensures we capture **failed requests even when they are not successfully handled by the destination workload** (for example, connection failures, early aborts, or fault injections)

We use the following **recording rule**:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: bookinfo-service-5xx-network-health
namespace: bookinfo
labels:
netobserv: "true"
annotations:
netobserv.io/network-health: |
{
"bookinfo_service_5xx_rate_percent": {
"summary": "Service {{ $labels.destination_service_name }} is generating {{ $value }}% of 5xx errors",
"description": "Percentage of HTTP 5xx server errors for requests to the {{ $labels.destination_service_name }} service, measured from source reporter over a 5-minute window.",
"netobserv_io_network_health": "{\"unit\":\"%\",\"upperBound\":\"100\",\"namespaceLabels\":[\"destination_service_namespace\"],\"workloadLabels\":[\"destination_service_name\"],\"recordingThresholds\":{\"info\":\"1\",\"warning\":\"25\",\"critical\":\"90\"}}"
}
}
spec:
groups:
- name: bookinfo-service-5xx
interval: 30s
rules:
- record: bookinfo_service_5xx_rate_percent
expr: |
(
sum(rate(istio_requests_total{ reporter="source", response_code=~"5.."}[5m])) by (destination_service, destination_service_name, destination_service_namespace)
/
sum(rate(istio_requests_total{ reporter="source"}[5m])) by (destination_service, destination_service_name, destination_service_namespace)
* 100
)
labels:
netobserv: "true"
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think explaining how one goes about creating a custom rule is important, because looking at this YAML, it seems extremely difficult for someone to be able to successfully create their own rule.

One way to approach this is to provide a template you could use to create a rule and then highlight the fields that you need to fill in. When you created this rule, did you test the PromQL separately to make sure it works before inserting it into this YAML? Walk through how you created the rule and tested it (assuming you used the testing steps below). How would you do this in a live environment?


Unlike a service-specific rule, this version does not filter on a single destination.
Instead, it captures 5xx errors across all services, allowing Network Health to surface multiple affected workloads.

### Step 2 — Simulate a real failure

To reproduce the issue, we inject a fault using Istio.

In this case, we force **100% of requests to the reviews service** to return HTTP 500 errors:

```yaml
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
name: reviews-fault-500
namespace: bookinfo
spec:
hosts:
- reviews
http:
- fault:
abort:
percentage:
value: 100
httpStatus: 500
route:
- destination:
host: reviews
```

Now the application is effectively broken from the user’s perspective.

### Step 3 — Generate traffic

To observe the effect, we generate traffic through the application:

```bash
for i in \{1..100\}; do curl -s http://<bookinfo-url>/productpage > /dev/null; done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix bash loop brace expansion in the traffic-generation command.

Line 267 escapes braces (\{1..100\}), so brace expansion won’t happen in bash. The loop won’t run 100 times as documented.

Proposed fix
-for i in \{1..100\}; do curl -s http://<bookinfo-url>/productpage > /dev/null; done
+for i in {1..100}; do curl -s http://<bookinfo-url>/productpage > /dev/null; done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i in \{1..100\}; do curl -s http://<bookinfo-url>/productpage > /dev/null; done
for i in {1..100}; do curl -s http://<bookinfo-url>/productpage > /dev/null; done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@content/posts/2026-03-26-network-health/index.md` at line 267, The
traffic-generation loop currently escapes the brace range so bash won't perform
brace expansion; locate the command line that reads "for i in \{1..100\}; do
curl -s http://<bookinfo-url>/productpage > /dev/null; done" and remove the
backslashes before the braces so the shell sees an unescaped brace-range
expression and the loop runs 100 iterations as intended.

```

At this point:

- Requests are flowing through the Istio data plane
- The Envoy proxies are emitting metrics
- All calls to reviews are failing

### Step 4 — Observe Network Health

After a short delay (typically 1–2 minutes), the recording rule is evaluated.

Now, head to Network Health:

You should see:

* The bookinfo namespace marked as critical
* A health indicator showing the 5xx error rate
* The issue surfaced automatically, without querying Prometheus

![Network Health showing 5xx errors](reviews-5xx.png)
*The **bookinfo** namespace marked as critical in Network Health, surfacing a high (up to 100%) percentage of HTTP 5xx errors across services without requiring manual queries.*

### Step 5 — Drill down into the issue

From here, you can:

- Navigate to **Topology** and select the `reviews` service
- Inspect the health signal in context

![Topology view showing health issue](reviews-5xx-topology.png)
*From Network Health to Topology: selecting the **bookinfo** namespace reveals the same critical 5xx error signal in context.*

This allows you to go from:

> “Something is wrong in this namespace”

to:

> “The affected service can be quickly identified as generating 5xx errors”

in just a few clicks.

## Wrapping it up

We've seen:

- What the Network Health dashboard is and how it surfaces built-in rules (DNS, packet drops, latency, ingress errors, and more).
- The difference between **alert** and **recording** rules, and when to use each.
- How to configure custom health rules (alerts and recording rules) so they appear in the dashboard.
- A **BookInfo** walkthrough: **`PrometheusRule`** with Istio metrics plus **VirtualService** fault injection (**100% / HTTP 500** on **reviews**); **Network Health → Namespaces** marks **bookinfo** as **critical** showing the HTTP 5xx error rate.

Ultimately, Network Health helps bridge the gap between raw metrics and actionable insights, making it easier to understand and troubleshoot network behavior in real time.

As always, you can reach out to the development team on Slack (#netobserv-project on [slack.cncf.io](https://slack.cncf.io/)) or via our [discussion pages](https://github.com/netobserv/netobserv-operator/discussions).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
7 changes: 7 additions & 0 deletions data/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,10 @@ stleerh:
profile: "https://github.com/stleerh"
nickname: "stleerh"
bio: ""
lberetta:
name: "Leandro Beretta"
avatar: "https://github.com/leandroberetta.png"
profile: "https://github.com/leandroberetta"
nickname: "lberetta"
bio: |
I'm a software engineer working at Red Hat since 2015. I'm from Buenos Aires, Argentina.
Loading