Persistent version control in Google Colab with Git
In all of my projects, I usually follow a local-first approach. The code should run effectively on my laptop. And if it does not, it is time to think about performance adjustements. Because why waste precious cloud compute resources and cause CO2 emossions for something that could easily run a laptop? Obviously, this strategy sometimes does not work. This is especially true when working with Large Language Models (LLMs). Fine-tuning large models is simply unrealistic on a small laptop GPU. Thus, I tend to use cloud GPU resources for such tasks.
There are plenty of options available like Lightning AI, University-internal resources or Google Colab. The latter (Colab) is the overall cheapest and most flexible option when I quickly want to check if a workflow runs without issues.
The only issue with Colab is usually that I need to upload all of my files manually, run my workflow and make needed adaptions, and then download the modified files again. This is not only tedious but sometimes also leads to versioning issues. A much better way would be to use Git – which is generally possible in Colab. However, Colab storage is usually temporary and the notebook itself still cannot be synced via Git when working this way.
On a mission to solve this issue, I found a post on StackOverflow by Gonzalo Polo who proposes a solution for persistent version control using Google Drive which I adapted for my needs. It utilises Personal Access Tokens (PATs) for private repositories on Gitlab/GitHub combined with persistant Google Drive storage.
Obtain a Personal Access Token
First of all, you need a Personal Access Token (PAT) from your Git provider.
- For Gitlab, go to your profile, Preferences, Access tokens and create a new token with
read_repository
andwrite_repository
privileges. - For GitHub, go to Settings, Developer Settings, Personal access tokens, Tokens, and then generate a new token with full
repo
privileges.
Each of these steps will give you a PAT which you should store somewhere locally. It can only be viewed once.
Cloning a Repository
In theory, your PAT would suffice to act with your Git repository on the temporary storage. You can now clone and subsequently interact with Git repositories on Colab (and any other cloud provider) using the HTTPS link of the repository. Simply copy the HTTPS URL of the repository (e.g. https://github.com/hanny-bal/hanny-bal.github.io.git) and adapt it as follows:
git clone https://<username>:<Personal_Access_Token>@github.com/hanny-bal/hanny-bal.github.io.git
Alternatively, you can also use the following code:
git init
git remote add origin https://<username>:<Personal_Access_Token>@github.com/hanny-bal/hanny-bal.github.io.git
git pull origin master
Essentially, you need use a combination of username and PAT before the main URL. Git commands should then work as intended.
Persistant Storage using Google Drive
On a normal cloud server with persistent storage, the tutorial would stop here. However, we are on Colab. First thing you need to do is create a folder for your repository on Google Drive. Within this folder (which is not initialised yet), create a notebook called git_interaction.ipynb
and open it with cpu
compute in Colab.
This notebook can be used to interact with Git and is basically an alternative to command line usage. Within this notebook, we can implement the following cells/functionalities.
We start with some imports and variable setups.
import os
from pathlib import Path
from google.colab import drive
drive.mount('/content/drive')
# Repository URL
GIT_HTTPS_URL: str = 'https://github.com/hanny-bal/hanny-bal.github.io.git' # example
# Path to your repository on Google Drive
REPOSITORY_PATH: str = 'content/drive/repos/myrepo' # example
# Personal Access Token and Git username
GIT_USERNAME: str = '<your-username>'
GIT_PATH: str = '<your-personal-access-token>'
We can then build the full modified Git URL.
insert_string: str = f'{GIT_USERNAME}:{GIT_PATH}@'
GIT_URL: str = GIT_HTTPS_URL.replace('https://', f"https://{insert_string}")
In case no .git
folder exists, we initialise the repository.
if not os.path.isdir(f'{REPOSITORY_PATH}/.git'):
!git config --global init.defaultBranch <branch>
!git init $REPOSITORY_PATH
!cd $REPOSITORY_PATH; git config --local user.email <your-user-email@yourmail.com>
!cd $REPOSITORY_PATH; git config --local user.name <your-username>
!cd $REPOSITORY_PATH; git remote add origin $GIT_URL
!cd $REPOSITORY_PATH; git pull origin <branch>
!cd $REPOSITORY_PATH; !printf '\ngit_interaction.ipynb' >> '.gitignore' # ignore the git interaction file
To print the repository status, we can use:
print('Curent status:')
!cd $REPOSITORY_PATH; git status
print('\n\nGit Log:')
!cd $REPOSITORY_PATH; git log
Lastly, interaction with Git is possible as normal. Just make sure to cd
into the correct directory before doing anything.
# Add files
!cd $REPOSITORY_PATH; git add <file>
# Make commits
!cd $REPOSITORY_PATH; git commit -am "My commit message"
# Push
!cd $REPOSITORY_PATH; git push <origin> <master/main>
# Switch branches
!cd $REPOSITORY_PATH; git checkout <branch>
Done!
The interaction file is now your personal Git interaction tool for Google Colab with persistent storage on Google Drive. In addition, the script can be used on all kinds of online services like Lightning AI, vast.ai, Paperspace and many more. Have fun coding!
Cheers, David