Data Management¶

This project uses DVC (Data Version Control) to manage and version large data files. DVC allows us to version control our data alongside our code while keeping the data files themselves out of Git.

Tracked Data Locations¶

Currently, the following data directories are tracked with DVC:

data/processed/ - Contains processed datasets
data/interim/ - Contains preprocessed records for every dataset
data/embeddings/ - Contains precomputed embeddings for every dataset in several states and configs
data/raw-zips/ - Contains downloaded raw zip versions of the datasets

Remote Storage¶

We use AWS S3 as our remote data store. The data is stored at:

s3://fhnw-artifacts/data/dvc/

AWS Authentication¶

Before using DVC with our S3 remote storage, you need to configure AWS credentials. The easiest way is using the AWS CLI:

Install the AWS CLI if you haven't already:
```
pip install awscli
```
Configure your AWS credentials:
```
aws configure
```
You will be prompted for:
AWS Access Key ID
AWS Secret Access Key
Default region
Default output format (press Enter for None)

Contact your project administrator if you need AWS credentials.

Verify your configuration:
```
aws sts get-caller-identity
```
This should show your AWS account information if configured correctly.

Common DVC Commands¶

Pulling Data from Remote¶

To get the latest version of the data from the remote storage:

dvc pull

This will download all DVC-tracked files that are not present in your local workspace.

Adding New Data¶

To start tracking a new folder or file with DVC:

# For a folder
dvc add data/new_folder/

# For a single file
dvc add data/new_folder/data.csv

After running these commands:

DVC will create a corresponding .dvc file that should be committed to Git
The actual data will be stored in DVC's cache
Remember to push your changes to the remote storage using dvc push

Remote Storage Configuration¶

The project is configured to work with two remote storage options:

Hetzner Storage Box (Default)

# Currently configured as default remote
dvc pull  # Will pull from Hetzner by default

AWS S3

# To use AWS S3 storage instead
dvc pull -r ipole-aws

Both configurations are already set up in the .dvc/config file. The Hetzner Storage Box is set as the default remote, but the AWS S3 bucket is available as an alternative. Access to either storage requires appropriate permissions from the team.

Best Practices¶

Always pull the latest data before starting work:
```
dvc pull
```

After adding new data:

dvc add data/new_folder/
git add data/new_folder.dvc
git commit -m "Add new dataset"
dvc push

Managing Local Cache

Over time, your local DVC cache may accumulate unused data. Use dvc gc (garbage collection) to clean it up:

# View what would be removed without actually deleting
dvc gc --workspace --dry

# Remove files only referenced in workspace
dvc gc --workspace

# Keep files from all branches and tags
dvc gc -aT

# Also clean remote storage (be careful!)
dvc gc --workspace --cloud

Important options for dvc gc:

-w, --workspace - keep only files referenced in current workspace
-a, --all-branches - keep files referenced in all Git branches
-T, --all-tags - keep files referenced in all Git tags
-c, --cloud - also remove files from remote storage (use with caution!)
--dry - show what would be removed without actually deleting
-f, --force - skip confirmation prompt

⚠️ Warning: Using --cloud will permanently delete data from remote storage. Make sure you have backups if needed.

Troubleshooting¶

If you encounter issues:

Ensure you have proper AWS credentials configured
Check if the remote storage is correctly configured:
```
dvc remote list
```
Verify that all .dvc files are tracked in Git