Monitoring ​
The shoot-rsyslog-relp extension exposes metrics for the rsyslog service running on a Shoot's nodes so that they can be easily viewed by cluster owners and operators in the Shoot's Prometheus and Plutono instances. The exposed monitoring data offers valuable insights into the operation of the rsyslog service and can be used to detect and debug ongoing issues. This guide describes the various metrics, alerts and logs available to cluster owners and operators.
Metrics ​
Metrics for the rsyslog service originate from its impstats module. These include the number of messages in the various queues, the number of ingested messages, the number of processed messages by configured actions, system resources used by the rsyslog service, and others. More information about them can be found in the impstats documentation and the statistics counter documentation. They are exposed via the node-exporter running on each Shoot node and are scraped by the Shoot's Prometheus instance.
These metrics can also be viewed in a dedicated dashboard named Rsyslog Stats in the Shoot's Plutono instance. You can select the node for which you wish the metrics to be displayed from the Node dropdown menu (by default metrics are summed over all nodes).
Following is a list of all exposed rsyslog metrics. The name and origin labels can be used to determine wether the metric is for: a queue, an action, plugins or system stats; the node label can be used to determine the node the metric originates from:
rsyslog_pstat_submitted ​
Number of messages that were submitted to the rsyslog service from its input. Currently rsyslog uses the /run/systemd/journal/syslog socket as input.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_processed ​
Number of messages that are successfully processed by an action and sent to the target server.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_failed ​
Number of messages that could not be processed by an action nor sent to the target server.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_suspended ​
Total number of times an action suspended itself. Note that this counts the number of times the action transitioned from active to suspended state. The counter is no indication of how long the action was suspended or how often it was retried.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_suspended_duration ​
The total number of seconds this action was disabled.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_resumed ​
The total number of times this action resumed itself. A resumption occurs after the action has detected that a failure condition does no longer exist.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_utime ​
User time used in microseconds.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_stime ​
System time used in microsends.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_maxrss ​
Maximum resident set size
- Type: Gauge
- Labels:
namenodeorigin
rsyslog_pstat_minflt ​
Total number of minor faults the task has made per second, those which have not required loading a memory page from disk.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_majflt ​
Total number of major faults the task has made per second, those which have required loading a memory page from disk.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_inblock ​
Filesystem input operations.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_oublock ​
Filesystem output operations.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_nvcsw ​
Voluntary context switches.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_nivcsw ​
Involuntary context switches.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_openfiles ​
Number of open files.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_size ​
Messages currently in queue.
- Type: Gauge
- Labels:
namenodeorigin
rsyslog_pstat_enqueued ​
Total messages enqueued.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_full ​
Times queue was full.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_discarded_full ​
Messages discarded due to queue being full.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_discarded_nf ​
Messages discarded when queue not full.
- Type: Counter
- Labels:
namenodeorigin
rsyslog_pstat_maxqsize ​
Maximum size queue has reached.
- Type: Gauge
- Labels:
namenodeorigin
rsyslog_augenrules_load_success ​
Shows whether the augenrules --load command was executed successfully or not on the node.
- Type: Gauge
- Labels:
node
Alerts ​
There are three alerts defined for the rsyslog service in the Shoot's Prometheus instance:
RsyslogTooManyRelpActionFailures ​
This indicates that the cumulative failure rate in processing relp action messages is greater than 2%. In other words, it compares the rate of processed relp action messages to the rate of failed relp action messages and fires an alert when the following expression evaluates to true:
sum(rate(rsyslog_pstat_failed{origin="core.action",name="rsyslg-relp"}[5m])) / sum(rate(rsyslog_pstat_processed{origin="core.action",name="rsyslog-relp"}[5m])) > bool 0.02`RsyslogRelpActionProcessingRateIsZero ​
This indicates that no messages are being sent to the upstream rsyslog target by the relp action. An alert is fired when the following expression evaluates to true:
rate(rsyslog_pstat_processed{origin="core.action",name="rsyslog-relp"}[5m]) == 0RsyslogRelpAuditRulesNotLoadedSuccessfully ​
This indicates that augenrules --load was not executed successfully when called to load the configured audit rules. You should check if the auditd configuration you provided is valid. An alert is fired when the following expression evaluates to true:
absent(rsyslog_augenrules_load_success == 1)Users can subscribe to these alerts by following the Gardener alerting guide.
Logging ​
There are two ways to view the logs of the rsyslog service running on the Shoot's nodes - either using the Explore tab of the Shoot's Plutono instance, or ssh-ing directly to a node.
To view logs in Plutono, navigate to the Explore tab and select vali from the Explore dropdown menu. Afterwards enter the following vali query:
{nodename="<name-of-node>"} |~ "\"unit\":\"rsyslog.service\""
Notice that you cannot use the unit label to filter for the rsyslog.service unit logs. Instead, you have to grep for the service as displayed in the example above.
To view logs when directly ssh-ing to a node in the Shoot cluster, use either of the following commands on the node:
systemctl status rsyslog
journalctl -u rsyslog