The Jupyter+git downside is now solved

Neural Network

The Jupyter+git downside is now solved

hhhhm

2023年12月14日

[ad_1]

Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git downside has been completely solved. It gives a set of hooks which offer clear git diffs, resolve most git conflicts routinely, and be certain that any remaining conflicts may be resolved solely inside the usual Jupyter pocket book surroundings. To get began, comply with the instructions on Git-friendly Jupyter.

The Jupyter+git downside

Jupyter notebooks are a robust instrument for scientists, engineers, technical writers, college students, academics, and extra. They supply an excellent pocket book surroundings for interactively exploring information and code, writing packages, and documenting the outcomes as dashboards, books, or blogs.

However when collaborating with others, this ideally suited surroundings goes up in smoke. That’s as a result of instruments similar to git, that are the most well-liked approaches for asynchronous collaboration, makes notebooks unusable. Actually. Right here’s what it appears to be like like should you and a colleague each modify a pocket book cell (together with, in lots of circumstances, merely executing a cell withuout altering it), after which attempt to open that pocket book later:

The rationale for this stems from a elementary incompatibility between the format Jupyter notebooks use (JSON) and the format that git battle markers assume by default (plain traces of textual content). That is what it appears to be like like when git provides its battle markers to a pocket book:

   "supply": [
<<<<<< HEAD
    "z=3n",
======
    "z=2n",
>>>>>> a7ec1b0bfb8e23b05fd0a2e6cafcb41cd0fb1c35
    "z"
   ]

That’s not legitimate JSON, and due to this fact Jupyter can’t open it. Conflicts are notably frequent in notebooks, as a result of Jupyter adjustments the next each time you run a pocket book:

Each cell features a quantity indicating what order it was run in. If you happen to and a colleague run the cells in numerous orders, you’ll have a battle in each single cell! This may take a really very long time to repair manually
For each determine, similar to a plot, Jupyter consists of not solely the picture itself within the pocket book, but additionally a plain textual content description that features the id (like a reminiscence deal with) of the thing, similar to <matplotlib.axes._subplots.AxesSubplot at 0x7fbc113dbe90>. This adjustments each time you execute a pocket book, and due to this fact will create a battle each time two individuals execute this cell
Some outputs could also be non-deterministic, similar to a pocket book that makes use of random numbers, or that interacts with a service that gives totally different outputs over time (similar to a climate service)
Jupyter provides metadata to the pocket book describing the surroundings it was final run in, such because the title of the kernel. This typically varies throughout installations, and due to this fact two individuals saving a pocket book (even with out and different adjustments) will typically find yourself with a battle within the metadata.

All these adjustments to pocket book recordsdata additionally make git diffs of notebooks very verbose. This may make code evaluations a problem, and make git repos extra cumbersome than mandatory.

The results of these issues is that many Jupyter customers really feel that collaborating with notebooks is a clunky, error-prone, and irritating expertise. (We’ve even seen individuals on social media describe Jupyter’s pocket book format as “silly” or “horrible”, regardless of in any other case professing their love for the software program!)

It seems, nevertheless, that Jupyter and git can work collectively extraordinarily nicely, with not one of the above issues in any respect. All that’s wanted is a little bit of particular software program…

The answer

Jupyter and git are each well-designed software program techniques that present many highly effective extensibility mechanisms. It seems that we will use these to completely and routinely resolve the Jupyter+git downside. We recognized two classes of issues within the earlier part:

git conflicts result in damaged notebooks
Pointless conflicts on account of metadata and outputs.

In our newly launched nbdev2, an open supply Jupyter-based growth platform, we’ve resolve every of the issues:

A brand new merge driver for git gives “notebook-native” battle markers, leading to notebooks that may be opened instantly in Jupyter, even when there are git conflicts
A brand new save hook for Jupyter routinely removes all pointless metadata and non-deterministic cell output.

Right here’s what a battle appears to be like like in Jupyter with nbdev’s merge driver:

As you see, the native and distant change are every clearly displayed as separate cells within the pocket book, permitting you to easily delete the model you don’t wish to preserve, or mix the 2 cells as wanted.

The strategies used to make the merge driver work are fairly fascinating – let’s dive into the main points!

The nbdev2 git merge driver

We offer right here a abstract of the git merge driver – for full particulars and supply code see the nbdev.merge docs. Amazingly sufficient, the whole implementation is simply 58 traces of code!

The fundamental concept is to first “undo” the unique git merge which created the battle, after which “redo” it at a cell stage (as a substitute of a line stage) and searching solely at cell supply (not outputs or metadata). The “undoing” is simple: simply create two copies of the conflicted file (representing the native and take away variations of the file), undergo every git battle marker, and exchange the battle part with both the native or distant model of the code.

Now that we’ve bought the unique native and distant notebooks, we will load the json utilizing execnb.nbio, which can then give us an array of cells for every pocket book. Now we’re as much as the fascinating bit – creating cell-level diffs primarily based solely on the cell supply.

The Python commonplace library accommodates a really versatile and efficient implementation of a diff algorithm within the difflib module. Specifically, the SequenceMatcher class gives the basic constructing blocks for implementing your personal battle decision system. We go the 2 units of cells (distant and native) to SequenceMatcher(...).get_matching_blocks(), and it returns an inventory of every part of cells that match (i.e. don’t have any conflicts/variations). We will then undergo every matching part and replica them into the ultimate pocket book, and thru every non-matching part and replica in every of the distant and native cells (add cells between them to mark the conflicts).

Making SequenceMatcher work with pocket book cells (represented in nbdev by the NbCell class) requires solely including __hash__ and __eq__ strategies to NbCell. In every case, these strategies are outlined to look solely on the precise supply code, and never at any metadata or outputs. Consequently, SequenceMatcher will solely present variations in supply code, and can ignore variations in the whole lot else.

With a single line of configuration, we will ask git to name our python script, as a substitute of its default line-based implementation, any time it’s merging adjustments. nbdev_install_hooks units up this configuration routinely, so after operating it, git conflicts turn out to be a lot much less frequent, and by no means end in damaged notebooks.

The nbdev2 Jupyter save hook

Fixing git merges regionally is extraordinarily useful, however we have to resolve them remotely as nicely. As an example, if a contributor submits a pull request (PR), after which another person commits to the identical pocket book earlier than the PR is merged, the PR would possibly now have a battle like this:

   "outputs": [
    {
<<<<<< HEAD
     "execution_count": 7,
======
     "execution_count": 5,
>>>>>> a7ec1b0bfb8e23b05fd0a2e6cafcb41cd0fb1c35
     "metadata": {},

This conflict shows that the two contributors have run cells in different orders (or perhaps one added a couple of cells above in the notebook), so their commits have conflicting execution counts. GitHub will refuse to allow this PR to be merged until this conflict is fixed.

But of course we don’t really care about the conflict at all – it doesn’t matter what, if any, execution count is stored in the notebook. So we’d really prefer to ignore this difference entirely!

Thankfully, Jupyter provides a “pre-save” hook which allows code to be run every time a notebook is saved. nbdev uses this to set up a hook which removes all unnecessary metadata (including execution_count) on saving. That means there’s no pointless conflicts like the one above, because no commits will have this information stored in the first place.

Background

Here at fast.ai we use Jupyter for everything. All our tests, documentation, and module source code for all of our many libraries is entirely developed in notebooks (using nbdev, of course!) And we use git for all our libraries too. Some of our repositories have many hundreds of contributors. Therefore solving the Jupyter+git problem has been critical for us. The solution presented here is the result of years of work by many people.

Our first approach, developed by Stas Bekman and me, was to use git “smudge” and “clean” filters that automatically rewrote all notebook json to remove unneeded metadata when committing. This helped a bit, but git quite often ended up in an odd state where it was impossible to merge.

In nbdev v1 Sylvain Gugger created an amazing tool called nbdev_fix_merge which used very clever custom logic to manually fix merge conflicts in notebooks, to ensure that they could opened in Jupyter. For nbdev v2 I did a from-scratch rewrite of every part of the library, and I realised that we could replace the custom logic with the SequenceMatcher approach described above.

None of these steps fully resolved the Jupyter+git problem, since we were getting frequent merge errors caused by the smudge/clean git filters, and conflicts required manually running nbdev_fix_merge. Wasim Lorgat realised that we could resolve the smudge/clean issue by moving that logic into an nbdev save hook, and avoid the manual fix step by moving that logic into a git merge driver. This resolved the final remaining issues! (I was actually quite stunned that Wasim went from our first discussion of the outstanding problems, to figuring out how to solve all of them, in the space of about two days…)

The result

The new tools in nbdev2, which we’ve been using internally for the last few months, have been transformational to our workflow. The Jupyter+git problem has been totally solved. I’ve seen no unnecessary conflicts, cell-level merges have worked like magic, and on the few occassions where I’ve changed the source in the same cell as a collaborator, fixing the conflict in Jupyter has been straightforward and convenient.

Postscript: other Jupyter+git tools

ReviewNB

There is one other tool which we’ve found very helpful in using Jupyter with git, which is ReviewNB. ReviewNB solves the problem of doing pull requests with notebooks. GitHub’s code review GUI only works well for line-based file formats, such as plain python scripts. This works fine with the Python modules that nbdev exports, and I often do reviews directly on the Python files, instead of the source notebooks.

However, much of the time I’d rather do reviews on the source notebooks, because:

I want to review the documentation and tests, not just the implementation
I want to see the changes to cell outputs, such as charts and tables, not just the code.

For this purpose, ReviewNB is perfect. Just like nbdev makes git merges and commits Jupyter-friendly, ReviewNB makes code reviews Jupyter-friendly. A picture is worth a thousand words, so rather than trying to explain, I’ll just show this picture from the ReviewNB website of what PRs look like in their interface:

An alternative solution: Jupytext

Another potential solution to the Jupyter+git problem might be to use Jupytext. Jupytext saves notebooks in a line-based format, instead of in JSON. This means that all the usual git machinery, such as merges and PRs, works fine. Jupytext can even use Quarto’s format, qmd, as a format for saving notebooks, which then can be used to generate a website.

Jupytext can be a bit tricky to manage when you want to save your cell outputs (which I generally want to do, since many of my notebooks take a long time to run – e.g training deep learning models.) Whilst Jupytext can save outputs in a linked ipynb file, managing this linkage gets complex, and ends up with the Jupyter+git problem all over again! If you don’t need to save outputs, then you might find Jupytext sufficient – although of course you’ll miss out on the cell-based code reviews of ReviewNB and your users won’t be able to read your notebooks properly when they’re browsing GitHub.

nbdime

There’s also an interesting project called nbdime which has its own git drivers and filters. Since they’re not really compatible with nbdev (partly because they tackle some of the same problems in different ways) I haven’t used them much, so haven’t got an informed opinion about them. However I do use nbdime’s Jupyter extension sometimes, which provides a view similar to ReviewNB, but for local changes instead of PRs.

If you want to try to yourself, follow the directions on Git-friendly Jupyter to get started.

[ad_2]