grafana image renderer

Overview

Grafana Image Renderer is a separate service that renders dashboard panels as PNG images using a headless Chromium browser. It powers features like PDF exports, shared panel snapshots, and — most importantly — image attachments on alert notifications.

This post covers enabling the renderer via the Grafana subchart inside kube-prometheus-stack, wiring it into Grafana Unified Alerting, and shipping Slack alerts with inline dashboard screenshots — all within an air-gapped cluster.

Why Image Renderer

Text-only alerts tell you what broke; an attached dashboard image tells you what it looks like right now. The difference matters when:

Grafana has no built-in rendering; it delegates to a separate grafana-image-renderer service.

Architecture

Grafana Image Renderer architecture — Alert Rule fires in Grafana, Grafana calls the Image Renderer Pod over ClusterIP to capture a panel screenshot, then delivers the image to a Slack channel via the Slack Web API. The only external traffic is Grafana to Slack API.

StepDetail
Grafana evaluates alert ruleOn fire, takes screenshot of attached dashboardUid + panelId
Renderer receives HTTP requestURL contains target dashboard URL, timeout, dimensions
Chromium loads Grafana pageUses internal Service DNS (grafana.monitoring:80)
Renderer returns PNG bytesGrafana attaches to notification payload
Grafana calls Slack Web APIBot token uploads file via 2-step upload API

Key insight: the renderer never talks to the outside world. It calls Grafana via ClusterIP, which means it works fine in air-gapped clusters. The only external traffic is the final Slack API call — and that's from Grafana Pod, not the renderer.

Why a Bot Token (not Incoming Webhook)

Slack exposes two alerting paths, and only one supports file uploads:

MethodAuthText messageImage file upload
Incoming WebhookURL itself (hooks.slack.com/...)OKNot supported
Bot TokenAuthorization: Bearer xoxb-...OK (chat.postMessage)OK (files.getUploadURLExternal + files.completeUploadExternal)

In an air-gapped environment you cannot serve a public image URL for webhooks to reference, so the bot token path is the only option that actually delivers images. Required Slack Bot Token Scopes:

The bot must also be invited to the target channel (/invite @<bot-name>). File upload API rejects with not_in_channel otherwise, even if chat:write.public is granted.

Enabling Image Renderer

All config lives under the Grafana subchart in kube-prometheus-stack values.yaml.

kube-prometheus-stack:
  grafana:
    # existing grafana config above ...

    imageRenderer:
      enabled: true
      replicas: 1

      image:
        repository: grafana/grafana-image-renderer
        tag: v5.8.2
        pullPolicy: IfNotPresent

      resources:
        requests:
          cpu: 100m
          memory: 512Mi
        limits:
          memory: 1Gi

      serviceMonitor:
        enabled: true

Setting imageRenderer.enabled: true alone is sufficient — the Grafana chart auto-wires the renderer URL into Grafana Pod env:

GF_RENDERING_SERVER_URL=http://<release>-grafana-image-renderer.<ns>:8081/render
GF_RENDERING_CALLBACK_URL=http://<release>-grafana.<ns>:80/

No manual URL plumbing needed.

Tag naming convention

Image Renderer v5+ tags are prefixed with v: v5.8.2, not 5.8.2. This differs from the Grafana server image itself (grafana/grafana:11.4.0 has no v). Pulling 5.8.2 will fail.

Healthcheck path

v5+ serves /healthz; older versions serve /. The Grafana chart defaults to /healthz, which aligns with v5+ naturally. No override needed when you pin a v5+ tag.

Enabling Screenshot Capture

Grafana doesn't call the renderer on fire by default. Two grafana.ini keys enable it:

grafana:
  grafana.ini:
    unified_alerting.screenshots:
      capture: true
      capture_timeout: 30s
      upload_external_image_storage: false
KeyRole
capture: trueRequest a PNG on every transition to alerting
capture_timeout: 30sMax time allowed for renderer to return (default 10s is too short for heavy dashboards)
upload_external_image_storage: falseSkip S3/GCS upload; keep images internal for Slack API delivery

Why 30s

Default 10s fails on complex dashboards (many panels, long-range Mimir queries, slow data source). Symptom in Grafana logs:

level=error msg="Failed to send request to remote rendering service"
error="...: context deadline exceeded"
level=warn msg="Failed to take an image"
reason="transition to alerting"
error="failed to take screenshot: [rendering.serverTimeout] "

And matching renderer-side:

status=408 status_text="Request Timeout" duration=10.033s

The fix is purely a Grafana config — the timeout URL parameter sent to the renderer is controlled by capture_timeout. The renderer itself has no default cap.

For dashboards that still time out at 30s, the real fix is query optimization or pointing the alert at a lighter, purpose-built panel instead of a complex overview.

Slack Contact Point Provisioning

What is a Contact Point

In Grafana Unified Alerting, a Contact Point is the destination an alert gets delivered to. It's the object that holds "how do I reach this channel": receiver type (slack, email, pagerduty, webhook, teams, ...), authentication (token, URL, integration key), and optional message formatting overrides (title, text).

A Contact Point does not decide which alerts are sent to it — that's the job of the Notification Policy, which matches alert labels and routes them to a contact point by name. Keeping the two concerns separate means a single contact point definition (e.g. slack-aws-major-alarm) can be reused by any number of alert rules simply by labeling them appropriately.

ConceptRoleAnalogy
Alert Rule"When to fire" — condition + labels + annotationsEvent source
Notification Policy"Where to send" — label matchers → contact point by nameRouter
Contact Point"How to deliver" — destination + auth + formatDestination + delivery config
Notification Template"What the message looks like" — reusable Go templatesMessage formatter

One contact point = one delivery pipeline. Each receiver inside a contact point is a physical send target; most contact points have a single receiver, but a contact point can fan out to multiple receivers (e.g. Slack and PagerDuty together) when you always want dual delivery for a class of alerts.

Relationship between Alert Rule, Contact Point, Notification Template, and Slack — An Alert Rule fires and is routed to a Contact Point via Notification Policy label matchers. The Contact Point references a Notification Template for message formatting and delivers the rendered message to Slack using the bot token.

Provisioning as code

Define contact points as file-based provisioning so they live in git. The chart mounts contactpoints.yaml as a Secret (not a ConfigMap) when declared under alerting.contactpoints.yaml.secret:, keeping the bot token out of ConfigMap plaintext.

grafana:
  alerting:
    contactpoints.yaml:
      secret:
        apiVersion: 1
        contactPoints:
          - orgId: 1
            name: slack-hook-test
            receivers:
              - uid: slack-hook-test
                type: slack
                settings:
                  token: xoxb-...
                  recipient: "#hook-test"
                  title: '{{ `{{ template "slack.title" . }}` }}'
                  text: '{{ `{{ template "slack.body" . }}` }}'
          - orgId: 1
            name: slack-aws-major-alarm
            receivers:
              - uid: slack-aws-major-alarm
                type: slack
                settings:
                  token: xoxb-...
                  recipient: "#aws-major-alarm"
                  title: '{{ `{{ template "slack.title" . }}` }}'
                  text: '{{ `{{ template "slack.body" . }}` }}'

Long term, the token should move to an external secret (ESO / Vault). Hardcoded tokens in values.yaml are a stopgap.

One bot, many channels

A single bot token can drive any number of contact points — recipient scopes each to a channel. Slack bot scopes are workspace-level, so one install covers them all. Just remember to invite the bot into every target channel.

Notification Template

Contact points get message format from notification templates, defined in templates.yaml. Templates are referenced by name, so multiple contact points can share one template for consistent formatting.

grafana:
  alerting:
    templates.yaml:
      apiVersion: 1
      templates:
        - orgId: 1
          name: slack_common
          template: |
            {{ `{{ define "slack.title" -}}
            {{ if eq .Status "firing" }}🚨 [FIRING]{{ else }}✅ [RESOLVED]{{ end }} {{ .CommonLabels.alertname }}
            {{- end }}

            {{ define "slack.body" -}}
            {{ range .Alerts -}}
            *Severity:* {{ if eq .Labels.severity "emergency" }}🚨🚨 emergency{{ else if eq .Labels.severity "critical" }}🔴 critical{{ else if eq .Labels.severity "warning" }}🟡 warning{{ else if eq .Labels.severity "info" }}🔵 info{{ else if .Labels.severity }}⚪ {{ .Labels.severity }}{{ else }}⚪ unknown{{ end }}
            *Summary:* {{ .Annotations.summary }}
            *Description:* {{ .Annotations.description }}
            {{ if .Annotations.usage }}*Value:* {{ .Annotations.usage }}
            {{ end }}{{ if .Labels.destination_service_name }}*Service:* {{ .Labels.destination_service_name }}
            {{ end }}{{ if .Labels.host }}*Host:* {{ .Labels.host }}
            {{ end }}{{ end -}}
            {{- end }}` }}

Escaping Helm tpl

The Grafana subchart runs values through Helm's tpl function, which collides with Grafana's own {{ ... }} template syntax. Wrap the whole block in Helm backtick literals ({{ `...` }}) so the inner {{ }} passes through verbatim to Grafana.

Without the backticks, Helm attempts to evaluate {{ define "slack.title" }} as a Helm function and fails:

error calling tpl: ... template: gotpl: unexpected "\\" in define clause

Severity-based emoji

Title stays minimal ([FIRING] / [RESOLVED]); severity differentiation moves to the body field so the Slack preview list stays scannable:

severitybody rendering
emergency🚨🚨 emergency
critical🔴 critical
warning🟡 warning
info🔵 info
non-standard value⚪ <value> (passthrough)
missing / empty⚪ unknown (fallback)

Optional fields with if guards

Fields that only appear on specific alert types (usage annotation, destination_service_name label from Istio metrics, host label from node alerts) are wrapped in {{ if }} guards so they're emitted only when present. Missing fields render as empty without erroring, but an empty *Usage:* line looks sloppy — the guard suppresses the whole line.

{{ if .Annotations.usage }}*Value:* {{ .Annotations.usage }}
{{ end }}

Severity Convention

A consistent severity label set is a prerequisite for clean routing and templating. Four-level model:

severityCriteriaResponse
emergencyTotal outage, direct revenue loss, security incidentPage on-call (PagerDuty) — wake someone up
criticalPartial outage, some customers affected, SLO burn rate highSlack critical channel with mention, business-hours immediate
warningTrend anomaly, resource pressure, about-to-be-criticalSlack warning channel, investigate during business hours
infoInformational, auto-recovery, deploy/scale eventsSlack info channel or digest

Judgement question: "Would this wake someone at 3 AM?"

This maps cleanly to Notification Policy matchers with continue: true for dual routing (e.g., emergency → Slack + PagerDuty).

Verification

After deploy, confirm the wiring in order:

# 1. Renderer Pod healthy
kubectl -n monitoring get pods -l app.kubernetes.io/name=grafana-image-renderer

# 2. Grafana picked up env vars
kubectl -n monitoring logs deployment/kube-prometheus-stack-grafana -c grafana \
  | grep -E "GF_RENDERING_SERVER_URL|Backend rendering"

# 3. Provisioned contact points + templates loaded
kubectl -n monitoring logs deployment/kube-prometheus-stack-grafana -c grafana \
  | grep -E "template definitions loaded|ngalert.notifier"

# 4. Secret contains contactpoints.yaml
kubectl -n monitoring get secret kube-prometheus-stack-grafana-config-secret \
  -o jsonpath='{.data.contactpoints\.yaml}' | base64 -d

# 5. Trigger a real alert (Contact Point "Test" button skips screenshots,
#    synthetic alert has no dashboardUid/panelId attached)

Gotchas

not_in_channel with bot token

body="{\"ok\":false,\"error\":\"not_in_channel\"}"
msg="Failed to upload image" err="failed to finalize upload: ... not_in_channel"

The bot successfully authenticated, but file sharing requires channel membership regardless of scopes. /invite @<bot-name> resolves it. chat:write.public does not — that only permits text messages, not file uploads.

Contact Point Test button has no image

Grafana's Test button creates a synthetic alert with no dashboardUid/panelId bound. The renderer is never invoked. To verify image attachment, force a real alert rule to fire (temporarily lower a threshold, or create a dummy 1 > 0 rule pointing at any panel).

Stale Firing after condition clears

Seeing [FIRING] notifications when the query value has dropped below threshold is usually one of:

Fix by preferring Reduce: last and a 1-minute evaluation interval for rules where recency matters, and setting Keep firing for: 0s unless flap protection is genuinely needed.

Rendering timeout on heavy dashboards

Symptoms in logs: status=408 Request Timeout, duration=10.033s from the renderer. The fix is capture_timeout: 30s as shown above. If 30s still fails, the dashboard itself is the problem — point the alert at a simpler, dedicated panel instead.

Takeaways