==================== Metrics & monitoring ==================== In order to support monitoring of replication relationships, VolSync exports a number of metrics that can be scraped with Prometheus. These metrics permit monitoring whether volumes are "in sync" and how long the synchronization iterations take. Available metrics ================= The following metrics are provided by VolSync for each replication object (source or destination): volsync_missed_intervals_total This is a count of the number of times that a replication iteration failed to complete before the next scheduled start. This metric is only valid for objects that have a schedule (``.spec.trigger.schedule``) specified. For example, when using the rsync mover with a schedule on the source but not on the destination, only the metric for the source side is meaningful. volsync_sync_duration_seconds This is a summary of the time required for each sync iteration. By monitoring this value it is possible to determine how much "slack" exists in the synchronization schedule (i.e., how much less is the sync duration than the schedule frequency). volsync_volume_out_of_sync This is a gauge that has the value of either "0" or "1", with a "1" indicating that the volumes are not currently synchronized. This may be due to an error that is preventing synchronization or because the most recent synchronization iteration failed to complete prior to when the next should have started. This metric also requires a schedule to be defined. Each of the above metrics include the following labels to assist with monitoring and alerting: obj_name This is the name of the VolSync CustomResource obj_namespace This is the Kubernetes Namespace that contains the CustomResource role This contains the value of either "source" or "destination" depending on whether the CR is a ReplicationSource or a ReplicationDestination. method This indicates the synchronization method being used. Currently, "rsync", "rclone", or "kopia". As an example, the below raw data comes from a single rsync-based relationship that is replicating data using the ReplicationSource ``dsrc`` in the ``srcns`` namespace to the ReplicationDestination ``dest`` in the ``dstns`` namespace. .. code-block:: none :caption: Example raw metrics data $ curl -s http://127.0.0.1:8080/metrics | grep volsync # HELP volsync_missed_intervals_total The number of times a synchronization failed to complete before the next scheduled start # TYPE volsync_missed_intervals_total counter volsync_missed_intervals_total{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0 volsync_missed_intervals_total{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0 # HELP volsync_sync_duration_seconds Duration of the synchronization interval in seconds # TYPE volsync_sync_duration_seconds summary volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.5"} 179.725047058 volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.9"} 544.86628289 volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.99"} 544.86628289 volsync_sync_duration_seconds_sum{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 828.711667153 volsync_sync_duration_seconds_count{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 3 volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.5"} 11.547060835 volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.9"} 12.013468222 volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.99"} 12.013468222 volsync_sync_duration_seconds_sum{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 33.317039014 volsync_sync_duration_seconds_count{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 3 # HELP volsync_volume_out_of_sync Set to 1 if the volume is not properly synchronized # TYPE volsync_volume_out_of_sync gauge volsync_volume_out_of_sync{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0 volsync_volume_out_of_sync{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0 Obtaining metrics ================= The above metrics can be collected by Prometheus. If the cluster does not already have a running instance set to scrape metrics, one will need to be started. Configuring Prometheus ---------------------- .. tabs:: .. tab:: Kubernetes The following steps start a simple Prometheus instance to scrape metrics from VolSync. Some platforms may already have a running Prometheus operator or instance, making these steps unnecessary. Start the Prometheus operator: .. code-block:: none $ kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.46.0/bundle.yaml Start Prometheus by applying the following block of yaml via: .. code-block:: none $ kubectl create ns volsync-system $ kubectl -n volsync-system apply -f - .. code-block:: yaml apiVersion: v1 kind: ServiceAccount metadata: name: prometheus --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: prometheus rules: - apiGroups: [""] resources: - nodes - services - endpoints - pods verbs: ["get", "list", "watch"] - apiGroups: [""] resources: - configmaps verbs: ["get"] - nonResourceURLs: ["/metrics"] verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: prometheus roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: prometheus subjects: - kind: ServiceAccount name: prometheus namespace: volsync-system # Change if necessary! --- apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus spec: serviceAccountName: prometheus serviceMonitorSelector: matchLabels: control-plane: volsync-controller resources: requests: memory: 400Mi .. tab:: OpenShift If necessary, `create a monitoring configuration `_ in the ``openshift-user-workload-monitoring`` namespace and `enable user workload monitoring `_: .. code-block:: yaml :caption: Example user workload monitoring configuration --- apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | # Allocate persistent storage for user Prometheus prometheus: volumeClaimTemplate: spec: resources: requests: storage: 40Gi # Allocate persistent storage for user Thanos Ruler thanosRuler: volumeClaimTemplate: spec: resources: requests: storage: 40Gi .. code-block:: yaml :caption: Enabling user workload monitoring --- apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | # Allocate persistent storage for alertmanager alertmanagerMain: volumeClaimTemplate: spec: resources: requests: storage: 40Gi # Enable user workload monitoring stack enableUserWorkload: true # Allocate persistent storage for cluster prometheus prometheusK8s: volumeClaimTemplate: spec: resources: requests: storage: 40Gi Monitoring VolSync ------------------ The metrics port for VolSync is (by default) `protected via kube-auth-proxy `_. In order to grant Prometheus the ability to scrape the metrics, its ServiceAccount must be granted access to the ``volsync-metrics-reader`` ClusterRole. This can be accomplished by (substitute in the namespace & SA name of the Prometheus server): .. code-block:: none $ kubectl create clusterrolebinding metrics --clusterrole=volsync-metrics-reader --serviceaccount=: Optionally, authentication of the metrics port can be disabled by setting the Helm chart value ``metrics.disableAuth`` to ``false`` when deploying VolSync. A ServiceMonitor needs to be defined in order to scrape metrics. If the ServiceMonitor CRD was defined in the cluster when the VolSync chart was deployed, this has already been added. If not, apply the following into the namespace where VolSync is deployed. Note that the ``control-plane`` labels may need to be adjusted. .. code-block:: yaml :caption: VolSync ServiceMonitor --- apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: volsync-monitor namespace: volsync-system labels: control-plane: volsync-controller spec: endpoints: - interval: 30s path: /metrics port: https scheme: https tlsConfig: # Using self-signed cert for connection insecureSkipVerify: true selector: matchLabels: control-plane: volsync-controller Kopia-specific metrics ====================== In addition to the standard VolSync metrics above, the Kopia mover provides comprehensive metrics for monitoring backup and restore operations, repository health, and performance characteristics. These metrics use the ``volsync_kopia`` namespace to distinguish them from general VolSync metrics. Common Kopia labels ------------------- All Kopia-specific metrics include these additional labels beyond the standard VolSync labels: repository Name of the Kopia repository operation Type of operation being performed (backup, restore, maintenance) Kopia metrics categories ------------------------ Backup/Restore Performance Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ volsync_kopia_operation_duration_seconds **Type:** Summary Duration of Kopia operations (backup, restore, maintenance) in seconds. Use for monitoring operation performance trends, setting SLA alerts for backup/restore times, and identifying performance degradation. volsync_kopia_data_processed_bytes **Type:** Summary Amount of data processed during Kopia operations in bytes. Use for tracking data growth over time, capacity planning for storage backends, and correlating data size with operation duration. volsync_kopia_data_transfer_rate_bytes_per_second **Type:** Summary Data transfer rate during Kopia operations in bytes per second. Use for monitoring network performance, detecting bandwidth constraints, and comparing performance across different repositories. volsync_kopia_compression_ratio **Type:** Summary Compression ratio achieved during backup operations (compressed_size/original_size). Use for monitoring compression effectiveness, optimizing compression settings, and estimating storage savings. volsync_kopia_operation_success_total **Type:** Counter Total number of successful Kopia operations. Use for calculating success rates, tracking operation volume, and generating availability reports. volsync_kopia_operation_failure_total **Type:** Counter Total number of failed Kopia operations with additional ``failure_reason`` label indicating the cause: * ``prerequisites_failed``: Repository connectivity or configuration issues * ``job_creation_failed``: Kubernetes job creation failure * ``job_execution_failed``: Job runtime failure Repository Health Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~ volsync_kopia_repository_connectivity **Type:** Gauge Repository connectivity status (1 if connected, 0 if not). Use for alerting on repository connectivity issues, monitoring repository availability, and tracking uptime statistics. volsync_kopia_maintenance_operations_total **Type:** Counter Total number of repository maintenance operations performed with ``maintenance_type`` label (scheduled, manual). Use for tracking maintenance operation frequency, verifying maintenance scheduling, and monitoring repository health activities. volsync_kopia_maintenance_duration_seconds **Type:** Summary Duration of repository maintenance operations in seconds. Use for monitoring maintenance performance, planning maintenance windows, and detecting maintenance issues. volsync_kopia_repository_size_bytes **Type:** Gauge Total size of the Kopia repository in bytes. Use for monitoring repository growth, capacity planning, and cost optimization for cloud storage. volsync_kopia_repository_objects_total **Type:** Gauge Total number of objects in the Kopia repository. Use for tracking repository complexity, monitoring deduplication effectiveness, and performance correlation analysis. Snapshot Management Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ volsync_kopia_snapshot_count **Type:** Gauge Current number of snapshots in the repository. Use for monitoring snapshot accumulation, verifying retention policy effectiveness, and capacity planning. volsync_kopia_snapshot_creation_success_total **Type:** Counter Total number of successful snapshot creations. Use for tracking backup success rates, generating backup reports, and monitoring backup reliability. volsync_kopia_snapshot_creation_failure_total **Type:** Counter Total number of failed snapshot creations with ``failure_reason`` label. Use for alerting on backup failures, troubleshooting backup issues, and tracking backup reliability trends. volsync_kopia_snapshot_size_bytes **Type:** Summary Size of individual snapshots in bytes. Use for monitoring snapshot size distribution, identifying data growth patterns, and optimizing backup strategies. volsync_kopia_deduplication_ratio **Type:** Summary Deduplication efficiency ratio (deduplicated_size/original_size). Use for monitoring deduplication effectiveness, optimizing storage efficiency, and calculating storage savings. volsync_kopia_retention_compliance **Type:** Gauge Retention policy compliance status (1 if compliant, 0 if not) with ``retention_type`` label (hourly, daily, weekly, monthly, yearly). Use for monitoring retention policy compliance, alerting on retention violations, and auditing backup retention. Cache and Performance Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ volsync_kopia_cache_hit_rate **Type:** Gauge Cache hit rate as a percentage (0-100). Use for monitoring cache effectiveness, optimizing cache configuration, and performance troubleshooting. volsync_kopia_cache_size_bytes **Type:** Gauge Current size of the Kopia cache in bytes. Use for monitoring cache utilization, optimizing cache capacity allocation, and tracking cache growth. volsync_kopia_parallel_operations_active **Type:** Gauge Number of currently active parallel operations. Use for monitoring parallelism utilization, detecting resource contention, and optimizing parallelism settings. volsync_kopia_job_retries_total **Type:** Counter Total number of job retries due to failures with ``retry_reason`` label (``job_pod_failure``). Use for monitoring job reliability, identifying transient vs persistent issues, and optimizing retry strategies. volsync_kopia_queue_depth **Type:** Gauge Current depth of the operation queue. Use for monitoring operation queuing, detecting processing bottlenecks, and optimizing queue processing. Policy and Configuration Metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ volsync_kopia_policy_compliance **Type:** Gauge Policy compliance status (1 if compliant, 0 if not) with ``policy_type`` label (global, retention, compression). Use for monitoring policy compliance, alerting on policy violations, and auditing configuration compliance. volsync_kopia_configuration_errors_total **Type:** Counter Total number of configuration errors encountered with ``error_type`` label: * ``repository_validation_failed``: Repository secret validation failure * ``custom_ca_validation_failed``: Custom CA configuration failure * ``policy_config_validation_failed``: Policy configuration failure volsync_kopia_custom_actions_executed_total **Type:** Counter Total number of custom actions executed with ``action_type`` label (before_snapshot, after_snapshot). Use for monitoring custom action usage, tracking action execution frequency, and verifying action configuration. volsync_kopia_custom_actions_duration_seconds **Type:** Summary Duration of custom action execution in seconds. Use for monitoring action performance, optimizing action scripts, and detecting action timeouts. Kopia alerting recommendations ------------------------------ Critical Alerts ~~~~~~~~~~~~~~~~ * ``volsync_kopia_repository_connectivity == 0`` - Repository unreachable * ``rate(volsync_kopia_operation_failure_total[5m]) > 0.1`` - High failure rate * ``volsync_kopia_retention_compliance == 0`` - Retention policy violation Warning Alerts ~~~~~~~~~~~~~~~ * ``volsync_kopia_operation_duration_seconds{quantile="0.9"} > 3600`` - Slow operations (>1 hour) * ``rate(volsync_kopia_job_retries_total[10m]) > 0.05`` - Increased retry rate * ``volsync_kopia_cache_hit_rate < 50`` - Poor cache performance Informational Alerts ~~~~~~~~~~~~~~~~~~~~~ * ``increase(volsync_kopia_maintenance_operations_total[24h]) == 0`` - No maintenance in 24h * ``volsync_kopia_repository_size_bytes > 1e12`` - Repository size > 1TB