Monitoring

The shoot-rsyslog-relp extension exposes metrics for the rsyslog service running on a Shoot's nodes so that they can be easily viewed by cluster owners and operators in the Shoot's Prometheus and Plutono instances. The exposed monitoring data offers valuable insights into the operation of the rsyslog service and can be used to detect and debug ongoing issues. This guide describes the various metrics, alerts and logs available to cluster owners and operators.

Metrics

Metrics for the rsyslog service originate from its impstats module. These include the number of messages in the various queues, the number of ingested messages, the number of processed messages by configured actions, system resources used by the rsyslog service, and others. More information about them can be found in the impstats documentation and the statistics counter documentation. They are exposed via the node-exporter running on each Shoot node and are scraped by the Shoot's Prometheus instance.

These metrics can also be viewed in a dedicated dashboard named Rsyslog Stats in the Shoot's Plutono instance. You can select the node for which you wish the metrics to be displayed from the Node dropdown menu (by default metrics are summed over all nodes).

Following is a list of all exposed rsyslog metrics. The name and origin labels can be used to determine wether the metric is for: a queue, an action, plugins or system stats; the node label can be used to determine the node the metric originates from:

rsyslog_pstat_submitted

Number of messages that were submitted to the rsyslog service from its input. Currently rsyslog uses the /run/systemd/journal/syslog socket as input.

Type: Counter
Labels: name node origin

rsyslog_pstat_processed

Number of messages that are successfully processed by an action and sent to the target server.

Type: Counter
Labels: name node origin

rsyslog_pstat_failed

Number of messages that could not be processed by an action nor sent to the target server.

Type: Counter
Labels: name node origin

rsyslog_pstat_suspended

Total number of times an action suspended itself. Note that this counts the number of times the action transitioned from active to suspended state. The counter is no indication of how long the action was suspended or how often it was retried.

Type: Counter
Labels: name node origin

rsyslog_pstat_suspended_duration

The total number of seconds this action was disabled.

Type: Counter
Labels: name node origin

rsyslog_pstat_resumed

The total number of times this action resumed itself. A resumption occurs after the action has detected that a failure condition does no longer exist.

Type: Counter
Labels: name node origin

rsyslog_pstat_utime

User time used in microseconds.

Type: Counter
Labels: name node origin

rsyslog_pstat_stime

System time used in microsends.

Type: Counter
Labels: name node origin

rsyslog_pstat_maxrss

Maximum resident set size

Type: Gauge
Labels: name node origin

rsyslog_pstat_minflt

Total number of minor faults the task has made per second, those which have not required loading a memory page from disk.

Type: Counter
Labels: name node origin

rsyslog_pstat_majflt

Total number of major faults the task has made per second, those which have required loading a memory page from disk.

Type: Counter
Labels: name node origin

rsyslog_pstat_inblock

Filesystem input operations.

Type: Counter
Labels: name node origin

rsyslog_pstat_oublock

Filesystem output operations.

Type: Counter
Labels: name node origin

rsyslog_pstat_nvcsw

Voluntary context switches.

Type: Counter
Labels: name node origin

rsyslog_pstat_nivcsw

Involuntary context switches.

Type: Counter
Labels: name node origin

rsyslog_pstat_openfiles

Number of open files.

Type: Counter
Labels: name node origin

rsyslog_pstat_size

Messages currently in queue.

Type: Gauge
Labels: name node origin

rsyslog_pstat_enqueued

Total messages enqueued.

Type: Counter
Labels: name node origin

rsyslog_pstat_full

Times queue was full.

Type: Counter
Labels: name node origin

rsyslog_pstat_discarded_full

Messages discarded due to queue being full.

Type: Counter
Labels: name node origin

rsyslog_pstat_discarded_nf

Messages discarded when queue not full.

Type: Counter
Labels: name node origin

rsyslog_pstat_maxqsize

Maximum size queue has reached.

Type: Gauge
Labels: name node origin

rsyslog_augenrules_load_success

Shows whether the augenrules --load command was executed successfully or not on the node.

Type: Gauge
Labels: node

Alerts

There are three alerts defined for the rsyslog service in the Shoot's Prometheus instance:

RsyslogTooManyRelpActionFailures

This indicates that the cumulative failure rate in processing relp action messages is greater than 2%. In other words, it compares the rate of processed relp action messages to the rate of failed relp action messages and fires an alert when the following expression evaluates to true:

sum(rate(rsyslog_pstat_failed{origin="core.action",name="rsyslg-relp"}[5m])) / sum(rate(rsyslog_pstat_processed{origin="core.action",name="rsyslog-relp"}[5m])) > bool 0.02`

RsyslogRelpActionProcessingRateIsZero

This indicates that no messages are being sent to the upstream rsyslog target by the relp action. An alert is fired when the following expression evaluates to true:

rate(rsyslog_pstat_processed{origin="core.action",name="rsyslog-relp"}[5m]) == 0

RsyslogRelpAuditRulesNotLoadedSuccessfully

This indicates that augenrules --load was not executed successfully when called to load the configured audit rules. You should check if the auditd configuration you provided is valid. An alert is fired when the following expression evaluates to true:

absent(rsyslog_augenrules_load_success == 1)

Users can subscribe to these alerts by following the Gardener alerting guide.

Logging

There are two ways to view the logs of the rsyslog service running on the Shoot's nodes - either using the Explore tab of the Shoot's Plutono instance, or ssh-ing directly to a node.

To view logs in Plutono, navigate to the Explore tab and select vali from the Explore dropdown menu. Afterwards enter the following vali query:

{nodename="<name-of-node>"} |~ "\"unit\":\"rsyslog.service\""

Notice that you cannot use the unit label to filter for the rsyslog.service unit logs. Instead, you have to grep for the service as displayed in the example above.

To view logs when directly ssh-ing to a node in the Shoot cluster, use either of the following commands on the node:

systemctl status rsyslog

journalctl -u rsyslog

Resources

Provider Alibaba Cloud

Tutorials

Kubernetes cluster on alicloud with gardener

Provider AWS

Tutorials

Provider Azure

Tutorials

Kubernetes cluster on azure with gardener

Provider Equinix Metal

Provider GCP

Tutorials

Kubernetes cluster on gcp with gardener

proposals

Provider OpenStack

CoreOS/FlatCar OS

SUSE CHost OS

Ubuntu OS

Calico CNI

Cilium CNI

Kubernetes Auditing

Api Reference

Registry Cache

Registry cache

registry-mirror

Certificate Services

Tutorials

DNS Services

Tutorials

Lakom Service

Egress Filtering

Networking Problem Detector

OpenID Connect Services

Node Audit Logging

Concepts

Deployment

Concepts

Deployment

Getting started locally

Proposals

Documents

Proposals

To-Do

Style Guide

Monitoring ​

Metrics ​

rsyslog_pstat_submitted ​

rsyslog_pstat_processed ​

rsyslog_pstat_failed ​

rsyslog_pstat_suspended ​

rsyslog_pstat_suspended_duration ​

rsyslog_pstat_resumed ​

rsyslog_pstat_utime ​

rsyslog_pstat_stime ​

rsyslog_pstat_maxrss ​

rsyslog_pstat_minflt ​

rsyslog_pstat_majflt ​

rsyslog_pstat_inblock ​

rsyslog_pstat_oublock ​

rsyslog_pstat_nvcsw ​

rsyslog_pstat_nivcsw ​

rsyslog_pstat_openfiles ​

rsyslog_pstat_size ​

rsyslog_pstat_enqueued ​

rsyslog_pstat_full ​

rsyslog_pstat_discarded_full ​

rsyslog_pstat_discarded_nf ​

rsyslog_pstat_maxqsize ​

rsyslog_augenrules_load_success ​

Alerts ​

RsyslogTooManyRelpActionFailures ​

RsyslogRelpActionProcessingRateIsZero ​

RsyslogRelpAuditRulesNotLoadedSuccessfully ​

Logging ​

Monitoring

Metrics

rsyslog_pstat_submitted

rsyslog_pstat_processed

rsyslog_pstat_failed

rsyslog_pstat_suspended

rsyslog_pstat_suspended_duration

rsyslog_pstat_resumed

rsyslog_pstat_utime

rsyslog_pstat_stime

rsyslog_pstat_maxrss

rsyslog_pstat_minflt

rsyslog_pstat_majflt

rsyslog_pstat_inblock

rsyslog_pstat_oublock

rsyslog_pstat_nvcsw

rsyslog_pstat_nivcsw

rsyslog_pstat_openfiles

rsyslog_pstat_size

rsyslog_pstat_enqueued

rsyslog_pstat_full

rsyslog_pstat_discarded_full

rsyslog_pstat_discarded_nf

rsyslog_pstat_maxqsize

rsyslog_augenrules_load_success

Alerts

RsyslogTooManyRelpActionFailures

RsyslogRelpActionProcessingRateIsZero

RsyslogRelpAuditRulesNotLoadedSuccessfully

Logging