Building Better Teams

Articles and Posts

Blog

 

Downloading hundreds of git repos

I wanted to run some machine learning over a few hundred large git repos. I tried a few methods and eventually discovered the simplest answer was really the best.

Initially I thought I would write a quick node script that would use nodegit and through2-concurrent to download my repos. Something like this:

.pipe(through2Concurrent.obj(
  {maxConcurrency: 100},
  function (line, enc, done) {
    const repo_git_address = "git@github.com:" + line;
    nodegit.Clone(repo_git_address, "./" + repo_name, {})
      .then(done);
  // ...
},

I found two immediate problems here:

  1. Memory consumption in node can grow easily

  2. nodegit was much slower than command-line git.

What I really needed for my research were commits, and git itself is great at downloading commits. So I created a bare repo and added thousands of remote addresses, and I'll just let git figure out how to download them all. Easy!

git init --bare all_repos  
cd all_repos  
git add remote repo1 repo1-address  
git add remote repo2 repo2-address  
git add remote repo3 repo3-address  
# Etc....

git fetch --all

This would download all my commits, however, one-repo-at-a-time. A full download would have taken far longer than I was willing to wait for.

Well, it turns out git recently added new features to git, allowing multiple jobs to fetch in parallel.

git fetch --all --multiple --jobs=100

Previously I would have had to create hundreds of individual new repositories and run a script to manage their fetching individually. This is much easier.

Brian Graham