Repository Organization and Deduplication

This guide explains how to organize your Kopia repositories for optimal deduplication and storage efficiency. Understanding Kopia’s deduplication capabilities will help you make informed decisions about repository structure that can significantly reduce storage costs.

Understanding Kopia’s Deduplication 

Kopia uses content-addressable storage with content-defined chunking to achieve excellent deduplication. This means that identical data blocks are stored only once, regardless of which PVC they come from or where they appear in the file tree.

How Deduplication Works 

When Kopia backs up your data:

Content Chunking: Files are split into variable-sized chunks using a rolling hash algorithm. This means that if you insert data at the beginning of a file, only the new chunks need to be stored - the rest remain unchanged.
Content Addressing: Each chunk is identified by its cryptographic hash (SHA-256). If the same data appears anywhere in any backup, it has the same hash and is stored only once.
Repository-Wide: Deduplication happens across the entire repository. All PVCs backing up to the same repository share the same chunk storage pool.
Automatic: You don’t need to configure anything special. Deduplication happens automatically whenever duplicate data is detected.

What Gets Deduplicated 

Kopia’s deduplication is particularly effective for:

Common System Files

Operating system libraries and binaries shared across containers
Base container image layers (Alpine, Ubuntu, etc.)
Standard application frameworks and dependencies

Application Data

Configuration files that are similar across environments
Log files with repeated patterns
Database dumps with common schemas
Repeated documents or templates

Incremental Backups

Only changed blocks are stored between snapshots
File moves and renames are handled efficiently
Small changes to large files only store the changed chunks

Real-World Example

Imagine you have three PVCs, each containing a WordPress installation:

Without deduplication (separate repositories): Each backup stores the complete WordPress files, PHP libraries, and plugins independently. Total: ~150MB × 3 = 450MB
With deduplication (shared repository): Common WordPress core files, PHP libraries, and identical plugins are stored once. Only unique themes, uploads, and configuration differ. Total: ~150MB + 30MB + 30MB = 210MB

Storage savings: 53% - and this is before compression!

The Single Repository Advantage 

Using a single Kopia repository for all your PVCs is the recommended approach that maximizes deduplication benefits while maintaining security and isolation.

Recommended Configuration 

Single S3 Bucket, No Prefixes

apiVersion: v1
kind: Secret
metadata:
  name: kopia-shared-repo
  namespace: backup-system
type: Opaque
stringData:
  # Single repository for ALL PVCs - no path prefix
  KOPIA_REPOSITORY: s3://company-backups
  KOPIA_PASSWORD: secure-repository-password
  AWS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE
  AWS_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  # For MinIO or other S3-compatible storage
  AWS_S3_ENDPOINT: https://s3.example.com

Important

Use the bucket root without any path prefixes. This is the key to maximum deduplication.

Multiple PVCs Using the Same Repository

# Application 1 - Web Frontend
apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: webapp-data
  namespace: production
spec:
  sourcePVC: webapp-storage
  trigger:
    schedule: "0 2 * * *"
  kopia:
    repository: kopia-shared-repo
    # Automatic identity: webapp-data@production
---
# Application 2 - Database
apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: database-backup
  namespace: production
spec:
  sourcePVC: postgres-data
  trigger:
    schedule: "0 3 * * *"
  kopia:
    repository: kopia-shared-repo
    # Automatic identity: database-backup@production
---
# Application 3 - File Storage
apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: shared-files
  namespace: production
spec:
  sourcePVC: nfs-storage
  trigger:
    schedule: "0 4 * * *"
  kopia:
    repository: kopia-shared-repo
    # Automatic identity: shared-files@production

All three backups share the same repository and benefit from deduplication, yet each maintains its own snapshot history and identity.

How Identity Ensures Isolation 

Even though all PVCs share the same repository, Kopia maintains complete isolation between them through unique identities:

Automatic Identity Generation

Each ReplicationSource automatically gets a unique identity based on:

Username: Derived from the ReplicationSource name (e.g., webapp-data)
Hostname: Set to the namespace (e.g., production)
Combined Identity: webapp-data@production

Security Guarantees

Separate snapshots: Each identity has its own snapshot history
No data leakage: One client cannot see or restore another client’s snapshots
Independent retention: Each identity can have different retention policies
Concurrent access: Multiple clients can write to the repository simultaneously

For detailed information about identity management, see Multi-tenancy and Shared Repositories and Kopia Hostname Design Explained.

Benefits Summary 

Using a single repository provides:

Storage Efficiency

50-80% reduction in storage usage is common for similar workloads
Deduplication across all PVCs, not just within each PVC
Lower cloud storage costs
Reduced backup windows due to less data transfer

Operational Simplicity

One repository to monitor and maintain
Single maintenance schedule for the entire backup infrastructure
Unified repository policies
Simplified capacity planning

Performance

Kopia efficiently handles thousands of clients in a single repository
Shared cache benefits all backup operations
Concurrent access without conflicts
No synchronization overhead between repositories

Cost Optimization

For a real-world example with 10 PVCs, each containing similar application stacks:

Separate repositories: 10 × 100GB = 1000GB
Single repository with deduplication: ~400GB (60% savings)
Monthly savings (at $0.023/GB for S3): $13.80/month
Annual savings: $165.60/year

The savings scale with the number of PVCs and similarity of data.

When to Use Alternative Organizations 

While a single repository is recommended, there are valid scenarios where you might need different repository structures.

Using S3 Prefixes 

If organizational requirements dictate logical separation within a single bucket, you can use S3 path prefixes. However, this comes with trade-offs.

When Prefixes Make Sense

Compliance requirements: Regulations mandate separation (HIPAA, PCI-DSS, GDPR)
Organizational boundaries: Different departments with separate budgets
Billing separation: Need to track storage costs per team or project
Access control: Different teams need different S3 bucket policies
Gradual migration: Transitioning from separate repositories

Prefix Configuration 

Important: Prefix Format

Kopia requires S3 prefixes to be treated as directories, which requires a trailing slash. VolSync automatically normalizes prefixes to ensure they always have a trailing slash, and collapses multiple consecutive slashes for consistency.

You can specify prefixes in any format - VolSync will normalize them:

s3://bucket/finance → normalized to s3://bucket/finance/
s3://bucket/finance/ → already correct, unchanged
s3://bucket/finance// → normalized to s3://bucket/finance/
s3://bucket/a//b///c → normalized to s3://bucket/a/b/c/

Prefix Configurations (all formats work)

# Application 1 - Finance Department
apiVersion: v1
kind: Secret
metadata:
  name: kopia-finance
  namespace: finance
type: Opaque
stringData:
  # Any format works - trailing slash is added automatically
  KOPIA_REPOSITORY: s3://company-backups/finance
  # Or explicitly with slash: s3://company-backups/finance/
  KOPIA_PASSWORD: finance-repo-password
  AWS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE
  AWS_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
---
# Application 2 - HR Department
apiVersion: v1
kind: Secret
metadata:
  name: kopia-hr
  namespace: hr
type: Opaque
stringData:
  # Different prefix, same bucket
  KOPIA_REPOSITORY: s3://company-backups/hr
  KOPIA_PASSWORD: hr-repo-password
  AWS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE
  AWS_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Important

Automatic Slash Normalization

VolSync automatically ensures all S3 prefixes have a trailing slash to treat them as directories. Per Kopia’s documentation: “Put trailing slash (/) if you want to use prefix as directory.”

Without the trailing slash, Kopia would concatenate the prefix with repository files, creating incorrect paths like financekopia.repository instead of finance/kopia.repository.

This normalization happens automatically - you don’t need to worry about trailing slashes in your configuration. The system handles it for you and logs the normalization during repository connection.

Common Prefix Patterns

# By department
KOPIA_REPOSITORY: s3://backups/finance
KOPIA_REPOSITORY: s3://backups/engineering
KOPIA_REPOSITORY: s3://backups/operations

# By environment
KOPIA_REPOSITORY: s3://backups/production
KOPIA_REPOSITORY: s3://backups/staging
KOPIA_REPOSITORY: s3://backups/development

# By application
KOPIA_REPOSITORY: s3://backups/webapp
KOPIA_REPOSITORY: s3://backups/database
KOPIA_REPOSITORY: s3://backups/cache

# Nested structure
KOPIA_REPOSITORY: s3://backups/production/webapp
KOPIA_REPOSITORY: s3://backups/production/database
KOPIA_REPOSITORY: s3://backups/staging/webapp

Understanding the Trade-Offs 

What You Lose

Using S3 prefixes creates separate repositories, which means:

No cross-prefix deduplication: Duplicate data between s3://bucket/app1 and s3://bucket/app2 is stored twice
Higher storage costs: Each prefix stores its own complete chunk pool
Multiple maintenance operations: Each prefix requires separate maintenance
No shared cache: Benefits of shared repository cache are lost

What You Gain

Clear organizational boundaries: Easy to see which team owns which data
Independent lifecycle: Delete or archive one department’s data without affecting others
Billing clarity: S3 cost reports can break down storage by prefix
Access control: Apply different IAM policies to different prefixes
Compliance: Meet separation requirements for regulated data

Quantifying the Cost

Consider two applications with 80% data overlap:

Single repository: 100GB (base) + 20GB (unique) = 120GB
Separate prefixes: 100GB + 100GB = 200GB
Extra cost: 80GB × $0.023/GB = $1.84/month per duplicate application

For 10 similar applications, this could mean hundreds of dollars per month in unnecessary storage costs.

Common Mistakes to Avoid 

Mistake 1: Adding Prefixes “Just in Case”

# Don't do this without a good reason!
KOPIA_REPOSITORY: s3://backups/app1
KOPIA_REPOSITORY: s3://backups/app2

If you don’t have a specific compliance or organizational requirement, use the bucket root:

# Better - maximizes deduplication
KOPIA_REPOSITORY: s3://backups

Mistake 2: Per-PVC Prefixes

# This destroys deduplication benefits
KOPIA_REPOSITORY: s3://backups/webapp-pvc
KOPIA_REPOSITORY: s3://backups/database-pvc
KOPIA_REPOSITORY: s3://backups/cache-pvc

Each PVC already has a unique identity in Kopia. Prefixes are unnecessary and costly.

Mistake 3: Inconsistent Prefix Usage

# Mixing prefixes and non-prefixed repositories
KOPIA_REPOSITORY: s3://backups          # App 1
KOPIA_REPOSITORY: s3://backups/special  # App 2
KOPIA_REPOSITORY: s3://backups/test     # App 3

This creates confusion and reduces deduplication. Choose one approach and be consistent.

Using Multiple Buckets 

In some cases, you might use completely separate S3 buckets:

When Multiple Buckets Make Sense

Geographic distribution: US bucket, EU bucket, APAC bucket for data residency
Security levels: High-security data in a locked-down bucket, general data in standard bucket
Storage tiers: Hot data in one bucket, cold archive data in Glacier bucket
Different cloud providers: AWS bucket, Azure container, GCS bucket

Configuration Example

# US Production Data
apiVersion: v1
kind: Secret
metadata:
  name: kopia-us-production
stringData:
  KOPIA_REPOSITORY: s3://backups-us-prod
  AWS_REGION: us-east-1
  # ... credentials
---
# EU Production Data (GDPR compliance)
apiVersion: v1
kind: Secret
metadata:
  name: kopia-eu-production
stringData:
  KOPIA_REPOSITORY: s3://backups-eu-prod
  AWS_REGION: eu-west-1
  # ... credentials

This approach has the same deduplication limitations as using prefixes, but may be necessary for regulatory or architectural reasons.

Migration Scenarios 

Moving Between Repository Structures 

Migrating from Prefixes to Single Repository

If you started with prefixes and want to consolidate for better deduplication:

Create the new single repository secret

apiVersion: v1
kind: Secret
metadata:
  name: kopia-unified
stringData:
  KOPIA_REPOSITORY: s3://new-unified-backups
  KOPIA_PASSWORD: new-repo-password
  # ... credentials

Update ReplicationSources gradually

Start with non-critical PVCs to verify the configuration:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: test-app
spec:
  kopia:
    repository: kopia-unified  # Changed from kopia-finance
    # Identity is automatically maintained

Run initial backup

The first backup to the new repository will be a full backup, but subsequent backups will deduplicate against all other data in the unified repository.

Monitor storage usage

# Watch repository grow and observe deduplication
kubectl logs -f <mover-pod-name>

Migrate remaining PVCs

Once confident, update all ReplicationSources to use the unified repository.

Clean up old repositories

After verifying backups and performing test restores from the new repository, you can delete the old prefixed repositories.

Warning

Data Migration Note

There is no automatic way to migrate existing snapshots from one repository to another while preserving snapshot history. When you change repositories, you start with a fresh backup history. Plan for:

Initial full backups to the new repository
Retention of old repository until retention periods expire
Higher storage usage temporarily while both repositories exist

Migrating from Single Repository to Prefixes

If compliance requirements force you to separate data:

Create new prefix-based secrets

# One secret per prefix/department
apiVersion: v1
kind: Secret
metadata:
  name: kopia-finance-only
stringData:
  KOPIA_REPOSITORY: s3://backups/finance
  KOPIA_PASSWORD: finance-password
  # ... credentials

Update affected ReplicationSources

# Change repository reference
spec:
  kopia:
    repository: kopia-finance-only  # Changed from kopia-shared

Accept storage cost increase

The first backup to each prefixed repository will be a full backup. Monitor storage costs carefully as deduplication benefits are lost.

Best Practices Summary 

Repository Organization Decision Tree 

Use this flowchart to decide on your repository structure:

Do you have compliance requirements for data separation?
- Yes → Use separate buckets or prefixes per compliance boundary
- No → Continue to question 2
Do you need separate billing or cost tracking?
- Yes → Use S3 prefixes with cost allocation tags
- No → Continue to question 3
Do different teams need different access controls?
- Yes → Use S3 prefixes with IAM policies
- No → Continue to question 4
Are you backing up similar workloads?
- Yes → Use a single repository (maximum deduplication)
- No → Single repository still works, but benefits are smaller

Recommended Default: Single Repository

Unless you answered “yes” to questions 1-3, use a single S3 bucket without prefixes.

Configuration Checklist 

For Single Repository (Recommended)

✓ Use bucket root: s3://my-backups
✓ No path prefixes
✓ Same secret shared across namespaces (if appropriate)
✓ Let automatic identity generation handle isolation
✓ Single maintenance schedule for the repository

For Prefixed Repositories (When Required)

✓ Specify prefixes in any format (slashes added automatically)
✓ Consistent prefix naming scheme
✓ Document the reason for separation
✓ Separate maintenance schedules per prefix
✓ Monitor storage costs per prefix
✗ Don't use per-PVC prefixes
✗ Don't mix prefixed and non-prefixed in same bucket

For Multiple Buckets (When Necessary)

✓ Geographic or regulatory reasons documented
✓ Separate secrets per bucket
✓ Clear naming convention
✓ Independent maintenance schedules
✓ Cost tracking per bucket

Monitoring and Verification 

Check Deduplication Effectiveness

While Kopia doesn’t expose per-client deduplication stats, you can monitor overall repository efficiency:

# Enable debug logging to see deduplication in action
# Add to your repository secret:
# KOPIA_LOG_LEVEL: "debug"

# Watch backup logs
kubectl logs -f <replicationsource-pod>

# Look for messages like:
# "Hashing file example.txt"
# "Stored 50 blocks (1.2 MB)"
# "Deduplicated 150 blocks (3.8 MB)"

Monitor Storage Growth

Track your S3 bucket size over time:

# AWS CLI
aws s3 ls s3://my-backups --recursive --summarize --human-readable

# Check storage growth rate
# Initial backup: 500GB
# After 10 similar PVCs: 800GB (instead of 5000GB without deduplication)
# Deduplication ratio: 84%

Verify Repository Health

Regular maintenance keeps the repository optimized:

# Configure KopiaMaintenance for repository optimization
apiVersion: volsync.backube/v1alpha1
kind: KopiaMaintenance
metadata:
  name: repo-maintenance
spec:
  repository: kopia-shared-repo
  trigger:
    schedule: "0 0 * * 0"  # Weekly on Sunday
  cachePVC: kopia-cache

See KopiaMaintenance CRD Reference for detailed maintenance configuration.

Additional Resources 

Storage Backends - Detailed S3 and other storage backend configuration
Multi-tenancy and Shared Repositories - Understanding identity management in shared repositories
Kopia Hostname Design Explained - How VolSync ensures unique identities
Backup Configuration - Complete backup configuration options
KopiaMaintenance CRD Reference - Repository maintenance and optimization
Troubleshooting Guide - Debugging repository connection and backup issues

For questions about Kopia’s deduplication algorithm and performance characteristics, see the official Kopia documentation.