Use git lfs to pull only specific folders or files

Git Mar 04, 2024 Viewed 1 Comments 0

Some git lfs repositories, because the files are so large, sometimes you only need to download a few folders or files from them.

Like this project, the website is https://huggingface.co/datasets/Wenetspeech4TTS/WenetSpeech4TTS/tree/main

1. Clone with GIT_LFS_SKIP_SMUDGE=1

I want to clone without large files - just their pointers, run:

GIT_LFS_SKIP_SMUDGE=1 git clone git@hf.co:datasets/Wenetspeech4TTS/WenetSpeech4TTS

The directory structure is as follows:

WenetSpeech4TTS
            |
            |____ Premium
            |          |__ Premium_md5check.txt
            |          |__ WenetSpeech4TTS_Premium_0.tar.gz
            |          |__ ...
            |
            |____ Standard
            |          |__ Standard_md5check.txt
            |          |__ WenetSpeech4TTS_Standard_0.tar.gz
            |          |__ ...
            |
            |____ Basic
            |          |__ Basic_md5check.txt
            |          |__ WenetSpeech4TTS_Basic_0.tar.gz
            |          |__ WenetSpeech4TTS_Basic_1.tar.gz
            |          |__ ...
            |
            |____ Rest
            |          |__ Rest_md5check.txt
            |          |__ WenetSpeech4TTS_Rest_0.tar.gz
            |          |__ ...
            |
            |____ Filelists
            |          |__ Premium.lst
            |          |__ ...
            |
            |____ DNSMOS
            |          |__ Premium_DNSMOS.lst
            |          |__ ...
            |
            |____ Testset
            |          |__...
            |
            |____ README.md

2. Pull data with --include option

According to the git-lfs help document:

$ git lfs pull -h
git lfs pull [options] []

Download Git LFS objects for the currently checked out ref, and update
the working copy with the downloaded content if required.

This is equivalent to running the following 2 commands:

git lfs fetch [options] []
git lfs checkout

Options:

* -I  --include=:
  Specify lfs.fetchinclude just for this invocation; see "Include and exclude"
  
* -X  --exclude=:
  Specify lfs.fetchexclude just for this invocation; see "Include and exclude"

You can specify the --include or -I flag (they are aliases of each other) to only include a specific filename in your pull.

For example, if you only wanted to pull the file called "WenetSpeech4TTS_Standard_0.tar.gz" in the folder of "Standard", try:

git lfs pull --include "Standard/WenetSpeech4TTS_Standard_0.tar.gz"


Or, if you only wanted to pull files matching the ".tar.gz" extension, try:

git lfs pull --include "*.tar.gz"

When given either the --include or --exclude, LFS will only pull files that are explicitly included and not excluded. For more information on these filters, you can check out our documentation here.

Updated Jun 05, 2024