Data Management¶
This project uses DVC (Data Version Control) to manage and version large data files. DVC allows us to version control our data alongside our code while keeping the data files themselves out of Git.
Tracked Data Locations¶
Currently, the following data directories are tracked with DVC:
data/processed/- Contains processed datasetsdata/interim/- Contains preprocessed records for every datasetdata/embeddings/- Contains precomputed embeddings for every dataset in several states and configsdata/raw-zips/- Contains downloaded raw zip versions of the datasets
Remote Storage¶
We use AWS S3 as our remote data store. The data is stored at:
AWS Authentication¶
Before using DVC with our S3 remote storage, you need to configure AWS credentials. The easiest way is using the AWS CLI:
-
Install the AWS CLI if you haven't already:
-
Configure your AWS credentials:
You will be prompted for: - AWS Access Key ID
- AWS Secret Access Key
- Default region
- Default output format (press Enter for None)
Contact your project administrator if you need AWS credentials.
- Verify your configuration: This should show your AWS account information if configured correctly.
Common DVC Commands¶
Pulling Data from Remote¶
To get the latest version of the data from the remote storage:
This will download all DVC-tracked files that are not present in your local workspace.
Adding New Data¶
To start tracking a new folder or file with DVC:
After running these commands:
- DVC will create a corresponding
.dvcfile that should be committed to Git - The actual data will be stored in DVC's cache
- Remember to push your changes to the remote storage using
dvc push
Remote Storage Configuration¶
The project is configured to work with two remote storage options:
-
Hetzner Storage Box (Default)
-
AWS S3
Both configurations are already set up in the .dvc/config file. The Hetzner Storage Box is set as the default remote, but the AWS S3 bucket is available as an alternative. Access to either storage requires appropriate permissions from the team.
Best Practices¶
-
Always pull the latest data before starting work:
-
After adding new data:
-
Managing Local Cache
Over time, your local DVC cache may accumulate unused data. Use dvc gc (garbage collection) to clean it up:
# View what would be removed without actually deleting
dvc gc --workspace --dry
# Remove files only referenced in workspace
dvc gc --workspace
# Keep files from all branches and tags
dvc gc -aT
# Also clean remote storage (be careful!)
dvc gc --workspace --cloud
Important options for dvc gc:
-w,--workspace- keep only files referenced in current workspace-a,--all-branches- keep files referenced in all Git branches-T,--all-tags- keep files referenced in all Git tags-c,--cloud- also remove files from remote storage (use with caution!)--dry- show what would be removed without actually deleting-f,--force- skip confirmation prompt
⚠️ Warning: Using
--cloudwill permanently delete data from remote storage. Make sure you have backups if needed.
Troubleshooting¶
If you encounter issues:
- Ensure you have proper AWS credentials configured
- Check if the remote storage is correctly configured:
- Verify that all
.dvcfiles are tracked in Git