======================================= Repository Organization and Deduplication ======================================= .. contents:: Repository Organization Guide :local: This guide explains how to organize your Kopia repositories for optimal deduplication and storage efficiency. Understanding Kopia's deduplication capabilities will help you make informed decisions about repository structure that can significantly reduce storage costs. Understanding Kopia's Deduplication ==================================== Kopia uses content-addressable storage with content-defined chunking to achieve excellent deduplication. This means that identical data blocks are stored only once, regardless of which PVC they come from or where they appear in the file tree. How Deduplication Works ----------------------- When Kopia backs up your data: 1. **Content Chunking**: Files are split into variable-sized chunks using a rolling hash algorithm. This means that if you insert data at the beginning of a file, only the new chunks need to be stored - the rest remain unchanged. 2. **Content Addressing**: Each chunk is identified by its cryptographic hash (SHA-256). If the same data appears anywhere in any backup, it has the same hash and is stored only once. 3. **Repository-Wide**: Deduplication happens across the entire repository. All PVCs backing up to the same repository share the same chunk storage pool. 4. **Automatic**: You don't need to configure anything special. Deduplication happens automatically whenever duplicate data is detected. What Gets Deduplicated ----------------------- Kopia's deduplication is particularly effective for: **Common System Files** - Operating system libraries and binaries shared across containers - Base container image layers (Alpine, Ubuntu, etc.) - Standard application frameworks and dependencies **Application Data** - Configuration files that are similar across environments - Log files with repeated patterns - Database dumps with common schemas - Repeated documents or templates **Incremental Backups** - Only changed blocks are stored between snapshots - File moves and renames are handled efficiently - Small changes to large files only store the changed chunks **Real-World Example** Imagine you have three PVCs, each containing a WordPress installation: - **Without deduplication** (separate repositories): Each backup stores the complete WordPress files, PHP libraries, and plugins independently. Total: ~150MB × 3 = 450MB - **With deduplication** (shared repository): Common WordPress core files, PHP libraries, and identical plugins are stored once. Only unique themes, uploads, and configuration differ. Total: ~150MB + 30MB + 30MB = 210MB **Storage savings: 53%** - and this is before compression! The Single Repository Advantage ================================ Using a single Kopia repository for all your PVCs is the recommended approach that maximizes deduplication benefits while maintaining security and isolation. Recommended Configuration ------------------------- **Single S3 Bucket, No Prefixes** .. code-block:: yaml apiVersion: v1 kind: Secret metadata: name: kopia-shared-repo namespace: backup-system type: Opaque stringData: # Single repository for ALL PVCs - no path prefix KOPIA_REPOSITORY: s3://company-backups KOPIA_PASSWORD: secure-repository-password AWS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE AWS_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY # For MinIO or other S3-compatible storage AWS_S3_ENDPOINT: https://s3.example.com .. important:: Use the **bucket root** without any path prefixes. This is the key to maximum deduplication. **Multiple PVCs Using the Same Repository** .. code-block:: yaml # Application 1 - Web Frontend apiVersion: volsync.backube/v1alpha1 kind: ReplicationSource metadata: name: webapp-data namespace: production spec: sourcePVC: webapp-storage trigger: schedule: "0 2 * * *" kopia: repository: kopia-shared-repo # Automatic identity: webapp-data@production --- # Application 2 - Database apiVersion: volsync.backube/v1alpha1 kind: ReplicationSource metadata: name: database-backup namespace: production spec: sourcePVC: postgres-data trigger: schedule: "0 3 * * *" kopia: repository: kopia-shared-repo # Automatic identity: database-backup@production --- # Application 3 - File Storage apiVersion: volsync.backube/v1alpha1 kind: ReplicationSource metadata: name: shared-files namespace: production spec: sourcePVC: nfs-storage trigger: schedule: "0 4 * * *" kopia: repository: kopia-shared-repo # Automatic identity: shared-files@production All three backups share the same repository and benefit from deduplication, yet each maintains its own snapshot history and identity. How Identity Ensures Isolation ------------------------------- Even though all PVCs share the same repository, Kopia maintains complete isolation between them through unique identities: **Automatic Identity Generation** Each ReplicationSource automatically gets a unique identity based on: - **Username**: Derived from the ReplicationSource name (e.g., ``webapp-data``) - **Hostname**: Set to the namespace (e.g., ``production``) - **Combined Identity**: ``webapp-data@production`` **Security Guarantees** - **Separate snapshots**: Each identity has its own snapshot history - **No data leakage**: One client cannot see or restore another client's snapshots - **Independent retention**: Each identity can have different retention policies - **Concurrent access**: Multiple clients can write to the repository simultaneously For detailed information about identity management, see :doc:`multi-tenancy` and :doc:`hostname-design`. Benefits Summary ---------------- Using a single repository provides: **Storage Efficiency** - **50-80% reduction** in storage usage is common for similar workloads - Deduplication across all PVCs, not just within each PVC - Lower cloud storage costs - Reduced backup windows due to less data transfer **Operational Simplicity** - One repository to monitor and maintain - Single maintenance schedule for the entire backup infrastructure - Unified repository policies - Simplified capacity planning **Performance** - Kopia efficiently handles thousands of clients in a single repository - Shared cache benefits all backup operations - Concurrent access without conflicts - No synchronization overhead between repositories **Cost Optimization** For a real-world example with 10 PVCs, each containing similar application stacks: - **Separate repositories**: 10 × 100GB = 1000GB - **Single repository with deduplication**: ~400GB (60% savings) - **Monthly savings** (at $0.023/GB for S3): $13.80/month - **Annual savings**: $165.60/year The savings scale with the number of PVCs and similarity of data. When to Use Alternative Organizations ====================================== While a single repository is recommended, there are valid scenarios where you might need different repository structures. Using S3 Prefixes ----------------- If organizational requirements dictate logical separation within a single bucket, you can use S3 path prefixes. However, this comes with trade-offs. **When Prefixes Make Sense** - **Compliance requirements**: Regulations mandate separation (HIPAA, PCI-DSS, GDPR) - **Organizational boundaries**: Different departments with separate budgets - **Billing separation**: Need to track storage costs per team or project - **Access control**: Different teams need different S3 bucket policies - **Gradual migration**: Transitioning from separate repositories Prefix Configuration -------------------- **Important: Prefix Format** Kopia requires S3 prefixes to be treated as directories, which requires a trailing slash. VolSync automatically normalizes prefixes to ensure they always have a trailing slash, and collapses multiple consecutive slashes for consistency. **You can specify prefixes in any format** - VolSync will normalize them: - ``s3://bucket/finance`` → normalized to ``s3://bucket/finance/`` - ``s3://bucket/finance/`` → already correct, unchanged - ``s3://bucket/finance//`` → normalized to ``s3://bucket/finance/`` - ``s3://bucket/a//b///c`` → normalized to ``s3://bucket/a/b/c/`` **Prefix Configurations (all formats work)** .. code-block:: yaml # Application 1 - Finance Department apiVersion: v1 kind: Secret metadata: name: kopia-finance namespace: finance type: Opaque stringData: # Any format works - trailing slash is added automatically KOPIA_REPOSITORY: s3://company-backups/finance # Or explicitly with slash: s3://company-backups/finance/ KOPIA_PASSWORD: finance-repo-password AWS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE AWS_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY --- # Application 2 - HR Department apiVersion: v1 kind: Secret metadata: name: kopia-hr namespace: hr type: Opaque stringData: # Different prefix, same bucket KOPIA_REPOSITORY: s3://company-backups/hr KOPIA_PASSWORD: hr-repo-password AWS_ACCESS_KEY_ID: AKIAIOSFODNN7EXAMPLE AWS_SECRET_ACCESS_KEY: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY .. important:: **Automatic Slash Normalization** VolSync automatically ensures all S3 prefixes have a trailing slash to treat them as directories. Per Kopia's documentation: *"Put trailing slash (/) if you want to use prefix as directory."* Without the trailing slash, Kopia would concatenate the prefix with repository files, creating incorrect paths like ``financekopia.repository`` instead of ``finance/kopia.repository``. **This normalization happens automatically** - you don't need to worry about trailing slashes in your configuration. The system handles it for you and logs the normalization during repository connection. **Common Prefix Patterns** .. code-block:: yaml # By department KOPIA_REPOSITORY: s3://backups/finance KOPIA_REPOSITORY: s3://backups/engineering KOPIA_REPOSITORY: s3://backups/operations # By environment KOPIA_REPOSITORY: s3://backups/production KOPIA_REPOSITORY: s3://backups/staging KOPIA_REPOSITORY: s3://backups/development # By application KOPIA_REPOSITORY: s3://backups/webapp KOPIA_REPOSITORY: s3://backups/database KOPIA_REPOSITORY: s3://backups/cache # Nested structure KOPIA_REPOSITORY: s3://backups/production/webapp KOPIA_REPOSITORY: s3://backups/production/database KOPIA_REPOSITORY: s3://backups/staging/webapp Understanding the Trade-Offs ----------------------------- **What You Lose** Using S3 prefixes creates separate repositories, which means: - **No cross-prefix deduplication**: Duplicate data between ``s3://bucket/app1`` and ``s3://bucket/app2`` is stored twice - **Higher storage costs**: Each prefix stores its own complete chunk pool - **Multiple maintenance operations**: Each prefix requires separate maintenance - **No shared cache**: Benefits of shared repository cache are lost **What You Gain** - **Clear organizational boundaries**: Easy to see which team owns which data - **Independent lifecycle**: Delete or archive one department's data without affecting others - **Billing clarity**: S3 cost reports can break down storage by prefix - **Access control**: Apply different IAM policies to different prefixes - **Compliance**: Meet separation requirements for regulated data **Quantifying the Cost** Consider two applications with 80% data overlap: - **Single repository**: 100GB (base) + 20GB (unique) = 120GB - **Separate prefixes**: 100GB + 100GB = 200GB - **Extra cost**: 80GB × $0.023/GB = $1.84/month per duplicate application For 10 similar applications, this could mean hundreds of dollars per month in unnecessary storage costs. Common Mistakes to Avoid ------------------------ **Mistake 1: Adding Prefixes "Just in Case"** .. code-block:: yaml # Don't do this without a good reason! KOPIA_REPOSITORY: s3://backups/app1 KOPIA_REPOSITORY: s3://backups/app2 If you don't have a specific compliance or organizational requirement, use the bucket root: .. code-block:: yaml # Better - maximizes deduplication KOPIA_REPOSITORY: s3://backups **Mistake 2: Per-PVC Prefixes** .. code-block:: yaml # This destroys deduplication benefits KOPIA_REPOSITORY: s3://backups/webapp-pvc KOPIA_REPOSITORY: s3://backups/database-pvc KOPIA_REPOSITORY: s3://backups/cache-pvc Each PVC already has a unique identity in Kopia. Prefixes are unnecessary and costly. **Mistake 3: Inconsistent Prefix Usage** .. code-block:: yaml # Mixing prefixes and non-prefixed repositories KOPIA_REPOSITORY: s3://backups # App 1 KOPIA_REPOSITORY: s3://backups/special # App 2 KOPIA_REPOSITORY: s3://backups/test # App 3 This creates confusion and reduces deduplication. Choose one approach and be consistent. Using Multiple Buckets ----------------------- In some cases, you might use completely separate S3 buckets: **When Multiple Buckets Make Sense** - **Geographic distribution**: US bucket, EU bucket, APAC bucket for data residency - **Security levels**: High-security data in a locked-down bucket, general data in standard bucket - **Storage tiers**: Hot data in one bucket, cold archive data in Glacier bucket - **Different cloud providers**: AWS bucket, Azure container, GCS bucket **Configuration Example** .. code-block:: yaml # US Production Data apiVersion: v1 kind: Secret metadata: name: kopia-us-production stringData: KOPIA_REPOSITORY: s3://backups-us-prod AWS_REGION: us-east-1 # ... credentials --- # EU Production Data (GDPR compliance) apiVersion: v1 kind: Secret metadata: name: kopia-eu-production stringData: KOPIA_REPOSITORY: s3://backups-eu-prod AWS_REGION: eu-west-1 # ... credentials This approach has the same deduplication limitations as using prefixes, but may be necessary for regulatory or architectural reasons. Migration Scenarios =================== Moving Between Repository Structures ------------------------------------- **Migrating from Prefixes to Single Repository** If you started with prefixes and want to consolidate for better deduplication: 1. **Create the new single repository secret** .. code-block:: yaml apiVersion: v1 kind: Secret metadata: name: kopia-unified stringData: KOPIA_REPOSITORY: s3://new-unified-backups KOPIA_PASSWORD: new-repo-password # ... credentials 2. **Update ReplicationSources gradually** Start with non-critical PVCs to verify the configuration: .. code-block:: yaml apiVersion: volsync.backube/v1alpha1 kind: ReplicationSource metadata: name: test-app spec: kopia: repository: kopia-unified # Changed from kopia-finance # Identity is automatically maintained 3. **Run initial backup** The first backup to the new repository will be a full backup, but subsequent backups will deduplicate against all other data in the unified repository. 4. **Monitor storage usage** .. code-block:: bash # Watch repository grow and observe deduplication kubectl logs -f 5. **Migrate remaining PVCs** Once confident, update all ReplicationSources to use the unified repository. 6. **Clean up old repositories** After verifying backups and performing test restores from the new repository, you can delete the old prefixed repositories. .. warning:: **Data Migration Note** There is no automatic way to migrate existing snapshots from one repository to another while preserving snapshot history. When you change repositories, you start with a fresh backup history. Plan for: - Initial full backups to the new repository - Retention of old repository until retention periods expire - Higher storage usage temporarily while both repositories exist **Migrating from Single Repository to Prefixes** If compliance requirements force you to separate data: 1. **Create new prefix-based secrets** .. code-block:: yaml # One secret per prefix/department apiVersion: v1 kind: Secret metadata: name: kopia-finance-only stringData: KOPIA_REPOSITORY: s3://backups/finance KOPIA_PASSWORD: finance-password # ... credentials 2. **Update affected ReplicationSources** .. code-block:: yaml # Change repository reference spec: kopia: repository: kopia-finance-only # Changed from kopia-shared 3. **Accept storage cost increase** The first backup to each prefixed repository will be a full backup. Monitor storage costs carefully as deduplication benefits are lost. Best Practices Summary ====================== Repository Organization Decision Tree -------------------------------------- Use this flowchart to decide on your repository structure: 1. **Do you have compliance requirements for data separation?** - **Yes** → Use separate buckets or prefixes per compliance boundary - **No** → Continue to question 2 2. **Do you need separate billing or cost tracking?** - **Yes** → Use S3 prefixes with cost allocation tags - **No** → Continue to question 3 3. **Do different teams need different access controls?** - **Yes** → Use S3 prefixes with IAM policies - **No** → Continue to question 4 4. **Are you backing up similar workloads?** - **Yes** → Use a single repository (maximum deduplication) - **No** → Single repository still works, but benefits are smaller **Recommended Default: Single Repository** Unless you answered "yes" to questions 1-3, use a single S3 bucket without prefixes. Configuration Checklist ----------------------- **For Single Repository (Recommended)** .. code-block:: yaml ✓ Use bucket root: s3://my-backups ✓ No path prefixes ✓ Same secret shared across namespaces (if appropriate) ✓ Let automatic identity generation handle isolation ✓ Single maintenance schedule for the repository **For Prefixed Repositories (When Required)** .. code-block:: yaml ✓ Specify prefixes in any format (slashes added automatically) ✓ Consistent prefix naming scheme ✓ Document the reason for separation ✓ Separate maintenance schedules per prefix ✓ Monitor storage costs per prefix ✗ Don't use per-PVC prefixes ✗ Don't mix prefixed and non-prefixed in same bucket **For Multiple Buckets (When Necessary)** .. code-block:: yaml ✓ Geographic or regulatory reasons documented ✓ Separate secrets per bucket ✓ Clear naming convention ✓ Independent maintenance schedules ✓ Cost tracking per bucket Monitoring and Verification ---------------------------- **Check Deduplication Effectiveness** While Kopia doesn't expose per-client deduplication stats, you can monitor overall repository efficiency: .. code-block:: bash # Enable debug logging to see deduplication in action # Add to your repository secret: # KOPIA_LOG_LEVEL: "debug" # Watch backup logs kubectl logs -f # Look for messages like: # "Hashing file example.txt" # "Stored 50 blocks (1.2 MB)" # "Deduplicated 150 blocks (3.8 MB)" **Monitor Storage Growth** Track your S3 bucket size over time: .. code-block:: bash # AWS CLI aws s3 ls s3://my-backups --recursive --summarize --human-readable # Check storage growth rate # Initial backup: 500GB # After 10 similar PVCs: 800GB (instead of 5000GB without deduplication) # Deduplication ratio: 84% **Verify Repository Health** Regular maintenance keeps the repository optimized: .. code-block:: yaml # Configure KopiaMaintenance for repository optimization apiVersion: volsync.backube/v1alpha1 kind: KopiaMaintenance metadata: name: repo-maintenance spec: repository: kopia-shared-repo trigger: schedule: "0 0 * * 0" # Weekly on Sunday cachePVC: kopia-cache See :doc:`kopiamaintenance` for detailed maintenance configuration. Additional Resources ==================== - :doc:`backends` - Detailed S3 and other storage backend configuration - :doc:`multi-tenancy` - Understanding identity management in shared repositories - :doc:`hostname-design` - How VolSync ensures unique identities - :doc:`backup-configuration` - Complete backup configuration options - :doc:`kopiamaintenance` - Repository maintenance and optimization - :doc:`troubleshooting` - Debugging repository connection and backup issues For questions about Kopia's deduplication algorithm and performance characteristics, see the `official Kopia documentation `_.