Metrics & monitoring

In order to support monitoring of replication relationships, VolSync exports a number of metrics that can be scraped with Prometheus. These metrics permit monitoring whether volumes are “in sync” and how long the synchronization iterations take.

Available metrics

The following metrics are provided by VolSync for each replication object (source or destination):

volsync_missed_intervals_total: This is a count of the number of times that a replication iteration failed to complete before the next scheduled start. This metric is only valid for objects that have a schedule (.spec.trigger.schedule) specified. For example, when using the rsync mover with a schedule on the source but not on the destination, only the metric for the source side is meaningful.
volsync_sync_duration_seconds: This is a summary of the time required for each sync iteration. By monitoring this value it is possible to determine how much “slack” exists in the synchronization schedule (i.e., how much less is the sync duration than the schedule frequency).
volsync_volume_out_of_sync: This is a gauge that has the value of either “0” or “1”, with a “1” indicating that the volumes are not currently synchronized. This may be due to an error that is preventing synchronization or because the most recent synchronization iteration failed to complete prior to when the next should have started. This metric also requires a schedule to be defined.

Each of the above metrics include the following labels to assist with monitoring and alerting:

obj_name: This is the name of the VolSync CustomResource
obj_namespace: This is the Kubernetes Namespace that contains the CustomResource
role: This contains the value of either “source” or “destination” depending on whether the CR is a ReplicationSource or a ReplicationDestination.
method: This indicates the synchronization method being used. Currently, “rsync”, “rclone”, or “kopia”.

As an example, the below raw data comes from a single rsync-based relationship that is replicating data using the ReplicationSource dsrc in the srcns namespace to the ReplicationDestination dest in the dstns namespace.

Example raw metrics data

 $ curl -s http://127.0.0.1:8080/metrics | grep volsync

 # HELP volsync_missed_intervals_total The number of times a synchronization failed to complete before the next scheduled start
 # TYPE volsync_missed_intervals_total counter
 volsync_missed_intervals_total{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0
 volsync_missed_intervals_total{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0
 # HELP volsync_sync_duration_seconds Duration of the synchronization interval in seconds
 # TYPE volsync_sync_duration_seconds summary
 volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.5"} 179.725047058
 volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.9"} 544.86628289
 volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.99"} 544.86628289
 volsync_sync_duration_seconds_sum{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 828.711667153
 volsync_sync_duration_seconds_count{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 3
 volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.5"} 11.547060835
 volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.9"} 12.013468222
 volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.99"} 12.013468222
 volsync_sync_duration_seconds_sum{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 33.317039014
 volsync_sync_duration_seconds_count{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 3
 # HELP volsync_volume_out_of_sync Set to 1 if the volume is not properly synchronized
 # TYPE volsync_volume_out_of_sync gauge
 volsync_volume_out_of_sync{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0
 volsync_volume_out_of_sync{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0

Obtaining metrics

The above metrics can be collected by Prometheus. If the cluster does not already have a running instance set to scrape metrics, one will need to be started.

Configuring Prometheus

The following steps start a simple Prometheus instance to scrape metrics from VolSync. Some platforms may already have a running Prometheus operator or instance, making these steps unnecessary.

Start the Prometheus operator:

$ kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.46.0/bundle.yaml

Start Prometheus by applying the following block of yaml via:

$ kubectl create ns volsync-system
$ kubectl -n volsync-system apply -f -

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - services
      - endpoints
      - pods
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources:
      - configmaps
    verbs: ["get"]
  - nonResourceURLs: ["/metrics"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: volsync-system  # Change if necessary!
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
spec:
  serviceAccountName: prometheus
  serviceMonitorSelector:
    matchLabels:
      control-plane: volsync-controller
  resources:
    requests:
      memory: 400Mi

If necessary, create a monitoring configuration in the openshift-user-workload-monitoring namespace and enable user workload monitoring:

Example user workload monitoring configuration

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    # Allocate persistent storage for user Prometheus
    prometheus:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 40Gi
    # Allocate persistent storage for user Thanos Ruler
    thanosRuler:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 40Gi

Enabling user workload monitoring

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    # Allocate persistent storage for alertmanager
    alertmanagerMain:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 40Gi
    # Enable user workload monitoring stack
    enableUserWorkload: true
    # Allocate persistent storage for cluster prometheus
    prometheusK8s:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 40Gi

Monitoring VolSync

The metrics port for VolSync is (by default) protected via kube-auth-proxy. In order to grant Prometheus the ability to scrape the metrics, its ServiceAccount must be granted access to the volsync-metrics-reader ClusterRole. This can be accomplished by (substitute in the namespace & SA name of the Prometheus server):

$ kubectl create clusterrolebinding metrics --clusterrole=volsync-metrics-reader --serviceaccount=<namespace>:<service-account-name>

Optionally, authentication of the metrics port can be disabled by setting the Helm chart value metrics.disableAuth to false when deploying VolSync.

A ServiceMonitor needs to be defined in order to scrape metrics. If the ServiceMonitor CRD was defined in the cluster when the VolSync chart was deployed, this has already been added. If not, apply the following into the namespace where VolSync is deployed. Note that the control-plane labels may need to be adjusted.

VolSync ServiceMonitor

---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: volsync-monitor
  namespace: volsync-system
  labels:
    control-plane: volsync-controller
spec:
  endpoints:
    - interval: 30s
      path: /metrics
      port: https
      scheme: https
      tlsConfig:
        # Using self-signed cert for connection
        insecureSkipVerify: true
  selector:
    matchLabels:
      control-plane: volsync-controller

Kopia-specific metrics

In addition to the standard VolSync metrics above, the Kopia mover provides comprehensive metrics for monitoring backup and restore operations, repository health, and performance characteristics. These metrics use the volsync_kopia namespace to distinguish them from general VolSync metrics.

Common Kopia labels

All Kopia-specific metrics include these additional labels beyond the standard VolSync labels:

repository: Name of the Kopia repository
operation: Type of operation being performed (backup, restore, maintenance)

Kopia metrics categories

Backup/Restore Performance Metrics

volsync_kopia_operation_duration_seconds

Type: Summary

Duration of Kopia operations (backup, restore, maintenance) in seconds. Use for monitoring operation performance trends, setting SLA alerts for backup/restore times, and identifying performance degradation.

volsync_kopia_data_processed_bytes

Type: Summary

Amount of data processed during Kopia operations in bytes. Use for tracking data growth over time, capacity planning for storage backends, and correlating data size with operation duration.

volsync_kopia_data_transfer_rate_bytes_per_second

Type: Summary

Data transfer rate during Kopia operations in bytes per second. Use for monitoring network performance, detecting bandwidth constraints, and comparing performance across different repositories.

volsync_kopia_compression_ratio

Type: Summary

Compression ratio achieved during backup operations (compressed_size/original_size). Use for monitoring compression effectiveness, optimizing compression settings, and estimating storage savings.

volsync_kopia_operation_success_total

Type: Counter

Total number of successful Kopia operations. Use for calculating success rates, tracking operation volume, and generating availability reports.

volsync_kopia_operation_failure_total

Type: Counter

Total number of failed Kopia operations with additional failure_reason label indicating the cause:

prerequisites_failed: Repository connectivity or configuration issues
job_creation_failed: Kubernetes job creation failure
job_execution_failed: Job runtime failure

Repository Health Metrics

volsync_kopia_repository_connectivity

Type: Gauge

Repository connectivity status (1 if connected, 0 if not). Use for alerting on repository connectivity issues, monitoring repository availability, and tracking uptime statistics.

volsync_kopia_maintenance_operations_total

Type: Counter

Total number of repository maintenance operations performed with maintenance_type label (scheduled, manual). Use for tracking maintenance operation frequency, verifying maintenance scheduling, and monitoring repository health activities.

volsync_kopia_maintenance_duration_seconds

Type: Summary

Duration of repository maintenance operations in seconds. Use for monitoring maintenance performance, planning maintenance windows, and detecting maintenance issues.

volsync_kopia_repository_size_bytes

Type: Gauge

Total size of the Kopia repository in bytes. Use for monitoring repository growth, capacity planning, and cost optimization for cloud storage.

volsync_kopia_repository_objects_total

Type: Gauge

Total number of objects in the Kopia repository. Use for tracking repository complexity, monitoring deduplication effectiveness, and performance correlation analysis.

Snapshot Management Metrics

volsync_kopia_snapshot_count

Type: Gauge

Current number of snapshots in the repository. Use for monitoring snapshot accumulation, verifying retention policy effectiveness, and capacity planning.

volsync_kopia_snapshot_creation_success_total

Type: Counter

Total number of successful snapshot creations. Use for tracking backup success rates, generating backup reports, and monitoring backup reliability.

volsync_kopia_snapshot_creation_failure_total

Type: Counter

Total number of failed snapshot creations with failure_reason label. Use for alerting on backup failures, troubleshooting backup issues, and tracking backup reliability trends.

volsync_kopia_snapshot_size_bytes

Type: Summary

Size of individual snapshots in bytes. Use for monitoring snapshot size distribution, identifying data growth patterns, and optimizing backup strategies.

volsync_kopia_deduplication_ratio

Type: Summary

Deduplication efficiency ratio (deduplicated_size/original_size). Use for monitoring deduplication effectiveness, optimizing storage efficiency, and calculating storage savings.

volsync_kopia_retention_compliance

Type: Gauge

Retention policy compliance status (1 if compliant, 0 if not) with retention_type label (hourly, daily, weekly, monthly, yearly). Use for monitoring retention policy compliance, alerting on retention violations, and auditing backup retention.

Cache and Performance Metrics

volsync_kopia_cache_hit_rate

Type: Gauge

Cache hit rate as a percentage (0-100). Use for monitoring cache effectiveness, optimizing cache configuration, and performance troubleshooting.

volsync_kopia_cache_size_bytes

Type: Gauge

Current size of the Kopia cache in bytes. Use for monitoring cache utilization, optimizing cache capacity allocation, and tracking cache growth.

volsync_kopia_parallel_operations_active

Type: Gauge

Number of currently active parallel operations. Use for monitoring parallelism utilization, detecting resource contention, and optimizing parallelism settings.

volsync_kopia_job_retries_total

Type: Counter

Total number of job retries due to failures with retry_reason label (job_pod_failure). Use for monitoring job reliability, identifying transient vs persistent issues, and optimizing retry strategies.

volsync_kopia_queue_depth

Type: Gauge

Current depth of the operation queue. Use for monitoring operation queuing, detecting processing bottlenecks, and optimizing queue processing.

Policy and Configuration Metrics

volsync_kopia_policy_compliance

Type: Gauge

Policy compliance status (1 if compliant, 0 if not) with policy_type label (global, retention, compression). Use for monitoring policy compliance, alerting on policy violations, and auditing configuration compliance.

volsync_kopia_configuration_errors_total

Type: Counter

Total number of configuration errors encountered with error_type label:

repository_validation_failed: Repository secret validation failure
custom_ca_validation_failed: Custom CA configuration failure
policy_config_validation_failed: Policy configuration failure

volsync_kopia_custom_actions_executed_total

Type: Counter

Total number of custom actions executed with action_type label (before_snapshot, after_snapshot). Use for monitoring custom action usage, tracking action execution frequency, and verifying action configuration.

volsync_kopia_custom_actions_duration_seconds

Type: Summary

Duration of custom action execution in seconds. Use for monitoring action performance, optimizing action scripts, and detecting action timeouts.

Kopia alerting recommendations

Critical Alerts

volsync_kopia_repository_connectivity == 0 - Repository unreachable
rate(volsync_kopia_operation_failure_total[5m]) > 0.1 - High failure rate
volsync_kopia_retention_compliance == 0 - Retention policy violation

Warning Alerts

volsync_kopia_operation_duration_seconds{quantile="0.9"} > 3600 - Slow operations (>1 hour)
rate(volsync_kopia_job_retries_total[10m]) > 0.05 - Increased retry rate
volsync_kopia_cache_hit_rate < 50 - Poor cache performance

Informational Alerts

increase(volsync_kopia_maintenance_operations_total[24h]) == 0 - No maintenance in 24h
volsync_kopia_repository_size_bytes > 1e12 - Repository size > 1TB