Metrics & monitoring
In order to support monitoring of replication relationships, VolSync exports a number of metrics that can be scraped with Prometheus. These metrics permit monitoring whether volumes are “in sync” and how long the synchronization iterations take.
Available metrics
The following metrics are provided by VolSync for each replication object (source or destination):
- volsync_missed_intervals_total
This is a count of the number of times that a replication iteration failed to complete before the next scheduled start. This metric is only valid for objects that have a schedule (
.spec.trigger.schedule
) specified. For example, when using the rsync mover with a schedule on the source but not on the destination, only the metric for the source side is meaningful.- volsync_sync_duration_seconds
This is a summary of the time required for each sync iteration. By monitoring this value it is possible to determine how much “slack” exists in the synchronization schedule (i.e., how much less is the sync duration than the schedule frequency).
- volsync_volume_out_of_sync
This is a gauge that has the value of either “0” or “1”, with a “1” indicating that the volumes are not currently synchronized. This may be due to an error that is preventing synchronization or because the most recent synchronization iteration failed to complete prior to when the next should have started. This metric also requires a schedule to be defined.
Each of the above metrics include the following labels to assist with monitoring and alerting:
- obj_name
This is the name of the VolSync CustomResource
- obj_namespace
This is the Kubernetes Namespace that contains the CustomResource
- role
This contains the value of either “source” or “destination” depending on whether the CR is a ReplicationSource or a ReplicationDestination.
- method
This indicates the synchronization method being used. Currently, “rsync”, “rclone”, or “kopia”.
As an example, the below raw data comes from a single rsync-based relationship
that is replicating data using the ReplicationSource dsrc
in the srcns
namespace to the ReplicationDestination dest
in the dstns
namespace.
$ curl -s http://127.0.0.1:8080/metrics | grep volsync
# HELP volsync_missed_intervals_total The number of times a synchronization failed to complete before the next scheduled start
# TYPE volsync_missed_intervals_total counter
volsync_missed_intervals_total{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0
volsync_missed_intervals_total{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0
# HELP volsync_sync_duration_seconds Duration of the synchronization interval in seconds
# TYPE volsync_sync_duration_seconds summary
volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.5"} 179.725047058
volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.9"} 544.86628289
volsync_sync_duration_seconds{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination",quantile="0.99"} 544.86628289
volsync_sync_duration_seconds_sum{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 828.711667153
volsync_sync_duration_seconds_count{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 3
volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.5"} 11.547060835
volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.9"} 12.013468222
volsync_sync_duration_seconds{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source",quantile="0.99"} 12.013468222
volsync_sync_duration_seconds_sum{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 33.317039014
volsync_sync_duration_seconds_count{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 3
# HELP volsync_volume_out_of_sync Set to 1 if the volume is not properly synchronized
# TYPE volsync_volume_out_of_sync gauge
volsync_volume_out_of_sync{method="rsync",obj_name="dest",obj_namespace="dstns",role="destination"} 0
volsync_volume_out_of_sync{method="rsync",obj_name="dsrc",obj_namespace="srcns",role="source"} 0
Obtaining metrics
The above metrics can be collected by Prometheus. If the cluster does not already have a running instance set to scrape metrics, one will need to be started.
Configuring Prometheus
The following steps start a simple Prometheus instance to scrape metrics from VolSync. Some platforms may already have a running Prometheus operator or instance, making these steps unnecessary.
Start the Prometheus operator:
$ kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/v0.46.0/bundle.yaml
Start Prometheus by applying the following block of yaml via:
$ kubectl create ns volsync-system
$ kubectl -n volsync-system apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: volsync-system # Change if necessary!
---
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
serviceAccountName: prometheus
serviceMonitorSelector:
matchLabels:
control-plane: volsync-controller
resources:
requests:
memory: 400Mi
If necessary, create a monitoring configuration
in the openshift-user-workload-monitoring
namespace and enable user
workload monitoring:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
# Allocate persistent storage for user Prometheus
prometheus:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 40Gi
# Allocate persistent storage for user Thanos Ruler
thanosRuler:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 40Gi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
# Allocate persistent storage for alertmanager
alertmanagerMain:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 40Gi
# Enable user workload monitoring stack
enableUserWorkload: true
# Allocate persistent storage for cluster prometheus
prometheusK8s:
volumeClaimTemplate:
spec:
resources:
requests:
storage: 40Gi
Monitoring VolSync
The metrics port for VolSync is (by default) protected via kube-auth-proxy. In order to grant
Prometheus the ability to scrape the metrics, its ServiceAccount must be granted
access to the volsync-metrics-reader
ClusterRole. This can be accomplished by
(substitute in the namespace & SA name of the Prometheus server):
$ kubectl create clusterrolebinding metrics --clusterrole=volsync-metrics-reader --serviceaccount=<namespace>:<service-account-name>
Optionally, authentication of the metrics port can be disabled by setting the
Helm chart value metrics.disableAuth
to false
when deploying VolSync.
A ServiceMonitor needs to be defined in order to scrape metrics. If the
ServiceMonitor CRD was defined in the cluster when the VolSync chart was
deployed, this has already been added. If not, apply the following into the
namespace where VolSync is deployed. Note that the control-plane
labels may
need to be adjusted.
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: volsync-monitor
namespace: volsync-system
labels:
control-plane: volsync-controller
spec:
endpoints:
- interval: 30s
path: /metrics
port: https
scheme: https
tlsConfig:
# Using self-signed cert for connection
insecureSkipVerify: true
selector:
matchLabels:
control-plane: volsync-controller
Kopia-specific metrics
In addition to the standard VolSync metrics above, the Kopia mover provides
comprehensive metrics for monitoring backup and restore operations, repository
health, and performance characteristics. These metrics use the
volsync_kopia
namespace to distinguish them from general VolSync metrics.
Common Kopia labels
All Kopia-specific metrics include these additional labels beyond the standard VolSync labels:
- repository
Name of the Kopia repository
- operation
Type of operation being performed (backup, restore, maintenance)
Kopia metrics categories
Backup/Restore Performance Metrics
- volsync_kopia_operation_duration_seconds
Type: Summary
Duration of Kopia operations (backup, restore, maintenance) in seconds. Use for monitoring operation performance trends, setting SLA alerts for backup/restore times, and identifying performance degradation.
- volsync_kopia_data_processed_bytes
Type: Summary
Amount of data processed during Kopia operations in bytes. Use for tracking data growth over time, capacity planning for storage backends, and correlating data size with operation duration.
- volsync_kopia_data_transfer_rate_bytes_per_second
Type: Summary
Data transfer rate during Kopia operations in bytes per second. Use for monitoring network performance, detecting bandwidth constraints, and comparing performance across different repositories.
- volsync_kopia_compression_ratio
Type: Summary
Compression ratio achieved during backup operations (compressed_size/original_size). Use for monitoring compression effectiveness, optimizing compression settings, and estimating storage savings.
- volsync_kopia_operation_success_total
Type: Counter
Total number of successful Kopia operations. Use for calculating success rates, tracking operation volume, and generating availability reports.
- volsync_kopia_operation_failure_total
Type: Counter
Total number of failed Kopia operations with additional
failure_reason
label indicating the cause:prerequisites_failed
: Repository connectivity or configuration issuesjob_creation_failed
: Kubernetes job creation failurejob_execution_failed
: Job runtime failure
Repository Health Metrics
- volsync_kopia_repository_connectivity
Type: Gauge
Repository connectivity status (1 if connected, 0 if not). Use for alerting on repository connectivity issues, monitoring repository availability, and tracking uptime statistics.
- volsync_kopia_maintenance_operations_total
Type: Counter
Total number of repository maintenance operations performed with
maintenance_type
label (scheduled, manual). Use for tracking maintenance operation frequency, verifying maintenance scheduling, and monitoring repository health activities.- volsync_kopia_maintenance_duration_seconds
Type: Summary
Duration of repository maintenance operations in seconds. Use for monitoring maintenance performance, planning maintenance windows, and detecting maintenance issues.
- volsync_kopia_repository_size_bytes
Type: Gauge
Total size of the Kopia repository in bytes. Use for monitoring repository growth, capacity planning, and cost optimization for cloud storage.
- volsync_kopia_repository_objects_total
Type: Gauge
Total number of objects in the Kopia repository. Use for tracking repository complexity, monitoring deduplication effectiveness, and performance correlation analysis.
Snapshot Management Metrics
- volsync_kopia_snapshot_count
Type: Gauge
Current number of snapshots in the repository. Use for monitoring snapshot accumulation, verifying retention policy effectiveness, and capacity planning.
- volsync_kopia_snapshot_creation_success_total
Type: Counter
Total number of successful snapshot creations. Use for tracking backup success rates, generating backup reports, and monitoring backup reliability.
- volsync_kopia_snapshot_creation_failure_total
Type: Counter
Total number of failed snapshot creations with
failure_reason
label. Use for alerting on backup failures, troubleshooting backup issues, and tracking backup reliability trends.- volsync_kopia_snapshot_size_bytes
Type: Summary
Size of individual snapshots in bytes. Use for monitoring snapshot size distribution, identifying data growth patterns, and optimizing backup strategies.
- volsync_kopia_deduplication_ratio
Type: Summary
Deduplication efficiency ratio (deduplicated_size/original_size). Use for monitoring deduplication effectiveness, optimizing storage efficiency, and calculating storage savings.
- volsync_kopia_retention_compliance
Type: Gauge
Retention policy compliance status (1 if compliant, 0 if not) with
retention_type
label (hourly, daily, weekly, monthly, yearly). Use for monitoring retention policy compliance, alerting on retention violations, and auditing backup retention.
Cache and Performance Metrics
- volsync_kopia_cache_hit_rate
Type: Gauge
Cache hit rate as a percentage (0-100). Use for monitoring cache effectiveness, optimizing cache configuration, and performance troubleshooting.
- volsync_kopia_cache_size_bytes
Type: Gauge
Current size of the Kopia cache in bytes. Use for monitoring cache utilization, optimizing cache capacity allocation, and tracking cache growth.
- volsync_kopia_parallel_operations_active
Type: Gauge
Number of currently active parallel operations. Use for monitoring parallelism utilization, detecting resource contention, and optimizing parallelism settings.
- volsync_kopia_job_retries_total
Type: Counter
Total number of job retries due to failures with
retry_reason
label (job_pod_failure
). Use for monitoring job reliability, identifying transient vs persistent issues, and optimizing retry strategies.- volsync_kopia_queue_depth
Type: Gauge
Current depth of the operation queue. Use for monitoring operation queuing, detecting processing bottlenecks, and optimizing queue processing.
Policy and Configuration Metrics
- volsync_kopia_policy_compliance
Type: Gauge
Policy compliance status (1 if compliant, 0 if not) with
policy_type
label (global, retention, compression). Use for monitoring policy compliance, alerting on policy violations, and auditing configuration compliance.- volsync_kopia_configuration_errors_total
Type: Counter
Total number of configuration errors encountered with
error_type
label:repository_validation_failed
: Repository secret validation failurecustom_ca_validation_failed
: Custom CA configuration failurepolicy_config_validation_failed
: Policy configuration failure
- volsync_kopia_custom_actions_executed_total
Type: Counter
Total number of custom actions executed with
action_type
label (before_snapshot, after_snapshot). Use for monitoring custom action usage, tracking action execution frequency, and verifying action configuration.- volsync_kopia_custom_actions_duration_seconds
Type: Summary
Duration of custom action execution in seconds. Use for monitoring action performance, optimizing action scripts, and detecting action timeouts.
Kopia alerting recommendations
Critical Alerts
volsync_kopia_repository_connectivity == 0
- Repository unreachablerate(volsync_kopia_operation_failure_total[5m]) > 0.1
- High failure ratevolsync_kopia_retention_compliance == 0
- Retention policy violation
Warning Alerts
volsync_kopia_operation_duration_seconds{quantile="0.9"} > 3600
- Slow operations (>1 hour)rate(volsync_kopia_job_retries_total[10m]) > 0.05
- Increased retry ratevolsync_kopia_cache_hit_rate < 50
- Poor cache performance
Informational Alerts
increase(volsync_kopia_maintenance_operations_total[24h]) == 0
- No maintenance in 24hvolsync_kopia_repository_size_bytes > 1e12
- Repository size > 1TB