Repository Organization and Deduplication

This guide explains how to organize your Kopia repositories for optimal deduplication and storage efficiency. Understanding Kopia’s deduplication capabilities will help you make informed decisions about repository structure that can significantly reduce storage costs.

Understanding Kopia’s Deduplication

Kopia uses content-addressable storage with content-defined chunking to achieve excellent deduplication. This means that identical data blocks are stored only once, regardless of which PVC they come from or where they appear in the file tree.

How Deduplication Works

When Kopia backs up your data:

  1. Content Chunking: Files are split into variable-sized chunks using a rolling hash algorithm. This means that if you insert data at the beginning of a file, only the new chunks need to be stored - the rest remain unchanged.

  2. Content Addressing: Each chunk is identified by its cryptographic hash (SHA-256). If the same data appears anywhere in any backup, it has the same hash and is stored only once.

  3. Repository-Wide: Deduplication happens across the entire repository. All PVCs backing up to the same repository share the same chunk storage pool.

  4. Automatic: You don’t need to configure anything special. Deduplication happens automatically whenever duplicate data is detected.

What Gets Deduplicated

Kopia’s deduplication is particularly effective for:

Common System Files

  • Operating system libraries and binaries shared across containers

  • Base container image layers (Alpine, Ubuntu, etc.)

  • Standard application frameworks and dependencies

Application Data

  • Configuration files that are similar across environments

  • Log files with repeated patterns

  • Database dumps with common schemas

  • Repeated documents or templates

Incremental Backups

  • Only changed blocks are stored between snapshots

  • File moves and renames are handled efficiently

  • Small changes to large files only store the changed chunks

Real-World Example

Imagine you have three PVCs, each containing a WordPress installation:

  • Without deduplication (separate repositories): Each backup stores the complete WordPress files, PHP libraries, and plugins independently. Total: ~150MB × 3 = 450MB

  • With deduplication (shared repository): Common WordPress core files, PHP libraries, and identical plugins are stored once. Only unique themes, uploads, and configuration differ. Total: ~150MB + 30MB + 30MB = 210MB

Storage savings: 53% - and this is before compression!

The Single Repository Advantage

Using a single Kopia repository for all your PVCs is the recommended approach that maximizes deduplication benefits while maintaining security and isolation.

How Identity Ensures Isolation

Even though all PVCs share the same repository, Kopia maintains complete isolation between them through unique identities:

Automatic Identity Generation

Each ReplicationSource automatically gets a unique identity based on:

  • Username: Derived from the ReplicationSource name (e.g., webapp-data)

  • Hostname: Set to the namespace (e.g., production)

  • Combined Identity: webapp-data@production

Security Guarantees

  • Separate snapshots: Each identity has its own snapshot history

  • No data leakage: One client cannot see or restore another client’s snapshots

  • Independent retention: Each identity can have different retention policies

  • Concurrent access: Multiple clients can write to the repository simultaneously

For detailed information about identity management, see Multi-tenancy and Shared Repositories and Kopia Hostname Design Explained.

Benefits Summary

Using a single repository provides:

Storage Efficiency

  • 50-80% reduction in storage usage is common for similar workloads

  • Deduplication across all PVCs, not just within each PVC

  • Lower cloud storage costs

  • Reduced backup windows due to less data transfer

Operational Simplicity

  • One repository to monitor and maintain

  • Single maintenance schedule for the entire backup infrastructure

  • Unified repository policies

  • Simplified capacity planning

Performance

  • Kopia efficiently handles thousands of clients in a single repository

  • Shared cache benefits all backup operations

  • Concurrent access without conflicts

  • No synchronization overhead between repositories

Cost Optimization

For a real-world example with 10 PVCs, each containing similar application stacks:

  • Separate repositories: 10 × 100GB = 1000GB

  • Single repository with deduplication: ~400GB (60% savings)

  • Monthly savings (at $0.023/GB for S3): $13.80/month

  • Annual savings: $165.60/year

The savings scale with the number of PVCs and similarity of data.

When to Use Alternative Organizations

While a single repository is recommended, there are valid scenarios where you might need different repository structures.

Using S3 Prefixes

If organizational requirements dictate logical separation within a single bucket, you can use S3 path prefixes. However, this comes with trade-offs.

When Prefixes Make Sense

  • Compliance requirements: Regulations mandate separation (HIPAA, PCI-DSS, GDPR)

  • Organizational boundaries: Different departments with separate budgets

  • Billing separation: Need to track storage costs per team or project

  • Access control: Different teams need different S3 bucket policies

  • Gradual migration: Transitioning from separate repositories

Prefix Configuration

Important: Prefix Format

Kopia requires S3 prefixes to be treated as directories, which requires a trailing slash. VolSync automatically normalizes prefixes to ensure they always have a trailing slash, and collapses multiple consecutive slashes for consistency.

You can specify prefixes in any format - VolSync will normalize them:

  • s3://bucket/finance → normalized to s3://bucket/finance/

  • s3://bucket/finance/ → already correct, unchanged

  • s3://bucket/finance// → normalized to s3://bucket/finance/

  • s3://bucket/a//b///c → normalized to s3://bucket/a/b/c/

Prefix Configurations (all formats work)

# Application 1 - Finance Department
apiVersion: v1
kind: Secret
metadata:
  name: kopia-finance
  namespace: finance
type: Opaque
stringData:
  # Any format works - trailing slash is added automatically
  KOPIA_REPOSITORY: s3://company-backups/finance
  # Or explicitly with slash: s3://company-backups/finance/
  KOPIA_PASSWORD: finance-repo-password
  AWS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE
  AWS_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
---
# Application 2 - HR Department
apiVersion: v1
kind: Secret
metadata:
  name: kopia-hr
  namespace: hr
type: Opaque
stringData:
  # Different prefix, same bucket
  KOPIA_REPOSITORY: s3://company-backups/hr
  KOPIA_PASSWORD: hr-repo-password
  AWS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE
  AWS_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Important

Automatic Slash Normalization

VolSync automatically ensures all S3 prefixes have a trailing slash to treat them as directories. Per Kopia’s documentation: “Put trailing slash (/) if you want to use prefix as directory.”

Without the trailing slash, Kopia would concatenate the prefix with repository files, creating incorrect paths like financekopia.repository instead of finance/kopia.repository.

This normalization happens automatically - you don’t need to worry about trailing slashes in your configuration. The system handles it for you and logs the normalization during repository connection.

Common Prefix Patterns

# By department
KOPIA_REPOSITORY: s3://backups/finance
KOPIA_REPOSITORY: s3://backups/engineering
KOPIA_REPOSITORY: s3://backups/operations

# By environment
KOPIA_REPOSITORY: s3://backups/production
KOPIA_REPOSITORY: s3://backups/staging
KOPIA_REPOSITORY: s3://backups/development

# By application
KOPIA_REPOSITORY: s3://backups/webapp
KOPIA_REPOSITORY: s3://backups/database
KOPIA_REPOSITORY: s3://backups/cache

# Nested structure
KOPIA_REPOSITORY: s3://backups/production/webapp
KOPIA_REPOSITORY: s3://backups/production/database
KOPIA_REPOSITORY: s3://backups/staging/webapp

Understanding the Trade-Offs

What You Lose

Using S3 prefixes creates separate repositories, which means:

  • No cross-prefix deduplication: Duplicate data between s3://bucket/app1 and s3://bucket/app2 is stored twice

  • Higher storage costs: Each prefix stores its own complete chunk pool

  • Multiple maintenance operations: Each prefix requires separate maintenance

  • No shared cache: Benefits of shared repository cache are lost

What You Gain

  • Clear organizational boundaries: Easy to see which team owns which data

  • Independent lifecycle: Delete or archive one department’s data without affecting others

  • Billing clarity: S3 cost reports can break down storage by prefix

  • Access control: Apply different IAM policies to different prefixes

  • Compliance: Meet separation requirements for regulated data

Quantifying the Cost

Consider two applications with 80% data overlap:

  • Single repository: 100GB (base) + 20GB (unique) = 120GB

  • Separate prefixes: 100GB + 100GB = 200GB

  • Extra cost: 80GB × $0.023/GB = $1.84/month per duplicate application

For 10 similar applications, this could mean hundreds of dollars per month in unnecessary storage costs.

Common Mistakes to Avoid

Mistake 1: Adding Prefixes “Just in Case”

# Don't do this without a good reason!
KOPIA_REPOSITORY: s3://backups/app1
KOPIA_REPOSITORY: s3://backups/app2

If you don’t have a specific compliance or organizational requirement, use the bucket root:

# Better - maximizes deduplication
KOPIA_REPOSITORY: s3://backups

Mistake 2: Per-PVC Prefixes

# This destroys deduplication benefits
KOPIA_REPOSITORY: s3://backups/webapp-pvc
KOPIA_REPOSITORY: s3://backups/database-pvc
KOPIA_REPOSITORY: s3://backups/cache-pvc

Each PVC already has a unique identity in Kopia. Prefixes are unnecessary and costly.

Mistake 3: Inconsistent Prefix Usage

# Mixing prefixes and non-prefixed repositories
KOPIA_REPOSITORY: s3://backups          # App 1
KOPIA_REPOSITORY: s3://backups/special  # App 2
KOPIA_REPOSITORY: s3://backups/test     # App 3

This creates confusion and reduces deduplication. Choose one approach and be consistent.

Using Multiple Buckets

In some cases, you might use completely separate S3 buckets:

When Multiple Buckets Make Sense

  • Geographic distribution: US bucket, EU bucket, APAC bucket for data residency

  • Security levels: High-security data in a locked-down bucket, general data in standard bucket

  • Storage tiers: Hot data in one bucket, cold archive data in Glacier bucket

  • Different cloud providers: AWS bucket, Azure container, GCS bucket

Configuration Example

# US Production Data
apiVersion: v1
kind: Secret
metadata:
  name: kopia-us-production
stringData:
  KOPIA_REPOSITORY: s3://backups-us-prod
  AWS_REGION: us-east-1
  # ... credentials
---
# EU Production Data (GDPR compliance)
apiVersion: v1
kind: Secret
metadata:
  name: kopia-eu-production
stringData:
  KOPIA_REPOSITORY: s3://backups-eu-prod
  AWS_REGION: eu-west-1
  # ... credentials

This approach has the same deduplication limitations as using prefixes, but may be necessary for regulatory or architectural reasons.

Migration Scenarios

Moving Between Repository Structures

Migrating from Prefixes to Single Repository

If you started with prefixes and want to consolidate for better deduplication:

  1. Create the new single repository secret

apiVersion: v1
kind: Secret
metadata:
  name: kopia-unified
stringData:
  KOPIA_REPOSITORY: s3://new-unified-backups
  KOPIA_PASSWORD: new-repo-password
  # ... credentials
  1. Update ReplicationSources gradually

Start with non-critical PVCs to verify the configuration:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  name: test-app
spec:
  kopia:
    repository: kopia-unified  # Changed from kopia-finance
    # Identity is automatically maintained
  1. Run initial backup

The first backup to the new repository will be a full backup, but subsequent backups will deduplicate against all other data in the unified repository.

  1. Monitor storage usage

# Watch repository grow and observe deduplication
kubectl logs -f <mover-pod-name>
  1. Migrate remaining PVCs

Once confident, update all ReplicationSources to use the unified repository.

  1. Clean up old repositories

After verifying backups and performing test restores from the new repository, you can delete the old prefixed repositories.

Warning

Data Migration Note

There is no automatic way to migrate existing snapshots from one repository to another while preserving snapshot history. When you change repositories, you start with a fresh backup history. Plan for:

  • Initial full backups to the new repository

  • Retention of old repository until retention periods expire

  • Higher storage usage temporarily while both repositories exist

Migrating from Single Repository to Prefixes

If compliance requirements force you to separate data:

  1. Create new prefix-based secrets

# One secret per prefix/department
apiVersion: v1
kind: Secret
metadata:
  name: kopia-finance-only
stringData:
  KOPIA_REPOSITORY: s3://backups/finance
  KOPIA_PASSWORD: finance-password
  # ... credentials
  1. Update affected ReplicationSources

# Change repository reference
spec:
  kopia:
    repository: kopia-finance-only  # Changed from kopia-shared
  1. Accept storage cost increase

The first backup to each prefixed repository will be a full backup. Monitor storage costs carefully as deduplication benefits are lost.

Best Practices Summary

Repository Organization Decision Tree

Use this flowchart to decide on your repository structure:

  1. Do you have compliance requirements for data separation?

    • Yes → Use separate buckets or prefixes per compliance boundary

    • No → Continue to question 2

  2. Do you need separate billing or cost tracking?

    • Yes → Use S3 prefixes with cost allocation tags

    • No → Continue to question 3

  3. Do different teams need different access controls?

    • Yes → Use S3 prefixes with IAM policies

    • No → Continue to question 4

  4. Are you backing up similar workloads?

    • Yes → Use a single repository (maximum deduplication)

    • No → Single repository still works, but benefits are smaller

Recommended Default: Single Repository

Unless you answered “yes” to questions 1-3, use a single S3 bucket without prefixes.

Configuration Checklist

For Single Repository (Recommended)

✓ Use bucket root: s3://my-backups
✓ No path prefixes
✓ Same secret shared across namespaces (if appropriate)
✓ Let automatic identity generation handle isolation
✓ Single maintenance schedule for the repository

For Prefixed Repositories (When Required)

✓ Specify prefixes in any format (slashes added automatically)
✓ Consistent prefix naming scheme
✓ Document the reason for separation
✓ Separate maintenance schedules per prefix
✓ Monitor storage costs per prefix
✗ Don't use per-PVC prefixes
✗ Don't mix prefixed and non-prefixed in same bucket

For Multiple Buckets (When Necessary)

✓ Geographic or regulatory reasons documented
✓ Separate secrets per bucket
✓ Clear naming convention
✓ Independent maintenance schedules
✓ Cost tracking per bucket

Monitoring and Verification

Check Deduplication Effectiveness

While Kopia doesn’t expose per-client deduplication stats, you can monitor overall repository efficiency:

# Enable debug logging to see deduplication in action
# Add to your repository secret:
# KOPIA_LOG_LEVEL: "debug"

# Watch backup logs
kubectl logs -f <replicationsource-pod>

# Look for messages like:
# "Hashing file example.txt"
# "Stored 50 blocks (1.2 MB)"
# "Deduplicated 150 blocks (3.8 MB)"

Monitor Storage Growth

Track your S3 bucket size over time:

# AWS CLI
aws s3 ls s3://my-backups --recursive --summarize --human-readable

# Check storage growth rate
# Initial backup: 500GB
# After 10 similar PVCs: 800GB (instead of 5000GB without deduplication)
# Deduplication ratio: 84%

Verify Repository Health

Regular maintenance keeps the repository optimized:

# Configure KopiaMaintenance for repository optimization
apiVersion: volsync.backube/v1alpha1
kind: KopiaMaintenance
metadata:
  name: repo-maintenance
spec:
  repository: kopia-shared-repo
  trigger:
    schedule: "0 0 * * 0"  # Weekly on Sunday
  cachePVC: kopia-cache

See KopiaMaintenance CRD Reference for detailed maintenance configuration.

Additional Resources

For questions about Kopia’s deduplication algorithm and performance characteristics, see the official Kopia documentation.