Git as a Document Format

Altconf wil cover?fm=jpg&fl=progressive&q=75&w=300

Git as a Document Format

by Wil Shipley

Aug 10 2015

Traditional Cocoa file formats each have specific advantages and pitfalls. Wil Shipley, co-founder of The Omni Group & Delicious Monster, explores their history, and sketches out a utopian alternative. Looking at the traditional way of adding undo and redo to Cocoa apps, he explains how Git — the library, not the command-line tools — magically solves all those problems, and basically gives you everything you ever wanted, for free. It may also come with a pony. 🐴

Background: Desired Features of Documents (0:00)

Before the Git approach is unveiled, we’ll take a look at our desired features of a document format, and look at historically how this solution to it progressed in the last 26 years of Cocoa. Our features include the abilities to load, save, autosave, undo, redo, and backup.

Loading and saving files often require one or more “control” file that changes frequently to describe changes in the document that are more than just text content. These are often of JSON, XML, or plists type and are not difficult to load and save today. For example, in an RTF file the control file would be the actual text of the document. Within OmniGraffle, the control file includes the coordinates of the lines and boxes. In Delicious Library, it would be the different titles.

Control files also include “blobs”: images, movies, and large chunks of content that typically doesn’t change. I wanted to introduce these terms. It turns out that it’s actually really interesting issue around this, because the blobs tend to be really big, and the control files tend to grow and get big as the user does more and more stuff. Those are separate issues you have to handle, as the blobs tend not to change over the lifetime of a document. Normally, when you drag an image into something like an RTF file, you’re not going to edit that image that much. But the RTF is going to continually be updated. So when you first start, you really worry about efficiently writing these blobs. This is because later on the RTF file might be, and writing efficiency becomes an issue to the end user. Reading and writing the control files must be really fast, and reading and writing the large blobs has to not occur often for good performance.

Nonetheless, at the same time, you want atomic saves. This means that for every time the user saves the document, you create a new directory, and populate with all this stuff, so that if anything goes wrong during the save, you’ve got a saved version backed up.

Autosave is a wonderful, wonderful thing that we all love, except for when it actually live-saves over your actual document without you asking it to, and then you’ve modified your document without knowing it. For instance, I sometimes accidentally change this image I downloaded from the Net, which I didn’t mean to do, and I can’t undo the change.

Undo and redo are self-explanatory. The big thing to think about with undo and redo is, you really don’t ever want it to corrupt. The current system of undo and redo is really, really open to corruption. And if you’ve ever used Xcode, you probably encountered issues with that.

Regarding backups — basically no one ever does backups, unless they happen to them. Time Machine was a godsend to backups.

History of Documents in Cocoa

1988: NeXTstep 0.8 - `TypedStreams` (4:33)

Back in 1988, we started off with something called a TypedStreams. We still have a version of this in existence today. It doesn’t even have an “NS” in front of it; it was just TypedStreams. To use it, we were told to load and stream with TypedStreams. The great thing was that all you had to do was sub-class these two little methods: with one you’d “read read read” and the other you’d “write write write”, and suddenly you had loading and saving. This was a great improvement, because before that we were writing straight C files and we had to make up a file format, parse it, and more.

Get more development news like this

However, it turned out, as soon as the next version came around, we were really upset because they had encoded the class names, the positions of the variables, and actually everything in the TypedStreams. There was absolutely no compatibility. It was really, really nigh impossible to read old TypedStreams or to update them. We therefore fled away from that pretty quickly. Furthermore, the files created by TypedStreams were entirely opaque; it looks like garbage if you open it. That turns out to be really bad in documents; in every app I’ve written, I try to make it so that if the user ever just looks at the file in a text editor, it’ll make some sense to them. By doing so, everything gets better and easier: people start writing their own parses for your stuff, fix their own files if they get corrupted, and you can help fix corruption yourself.

1990: NeXTstep 2 — Text Files (6:04)

Next, we switched to text files. At OmniGroup, we actually made Diagram!, which you now know as OmniGraffle, and Concurrence, which you now know as Keynote, and gave them their own text file formats (that we created). They were usable and user-modifiable, but as files grew, extremely slow. When you were parsing a really large text file that contained 20,000 objects, as some Diagram! documents had, you were reading this control file that has 20,000 objects and it becomes immensely slow. And, we still hadn’t touched on autosave, undo & redo, and backups; that just was ignored in these days.

1992: NextStep 3 — Text “Property Lists” (6:58)

In 1992, we got a new version of NeXTSTEP. They gave us text property lists: the plist that you probably know and love today. We actually redid Concurrence to use property lists, because it was the new standard hotness. It had a lot of the same good and bad properties as the text files, with the only real difference between it and text files being that it was was a standard format. Therefore, you could parse it with one line of code in Cocoa, if you were writing a program to read Concurrence files for some reason.

1992: NextStep 3 — `NSUndoManager` (7:45)

With NeXTSTEP 3, we got our first undo manager. If you’ve ever done undo, you’ve probably actually used exactly this class and exactly these methods, as it hasn’t changed in Apple’s view for the last 23 years. To use Undo Manager, the basic model is that you say, “Hey Undo Manager, I’m about to do something, and rather than doing it, would you just remember what I’m going to do? Then if the user ever undoes, do this thing that I’m telling you to do that you’re not going to do.” In other words, you tell it what you’re about to do and exactly how to undo it. It keeps a stack of these actions, or “invocations”, that it can do, and as you undo, it pops off the stack and applies. When you redo, it goes the opposite way. This is super easy to use: if you have a blue shape and you change it to red, you tell the Undo Manager about the change, prepare an invocation that changes it to blue before the change, and then change the color. When you undo, it changes it to blue, or back to the original blue.

The bad news is, it’s super easy to screw up. If you miss any state changes along there (like Xcode does), everything gets completely corrupted, and you start applying changes to this corrupted model, making it more and more corrupted. The other bad thing about it is that it requires that every object you ever create stay active forever, since it references the real objects. You must keep these live in memory, and thus the longer you’re using a document, the more objects you build up in the undo cache. It’s also bad because it’s not persistent, which wasn’t really a thing that anyone was thinking about in ‘92. But, later on when people started doing persistent undo, it became an issue.

2001: OS X 10.0 — `NSFileWrapper` / `NSBundle` (9:57)

In 2001, we got our OS X first release. We got a new class called NSFileWrapper, which has a cool solution to our problem of saving the control files without resetting the blob every time. It does so by keeping references to all these things and noticing which parts have changed. When a file is written, it only writes the parts that have changed. As a result, changes to control files would be saved on any change, but the images and blobs wouldn’t get updated every time. It’s actually a really cool class, and still used all over the place. It even takes care of atomic save for you by making a new directory, temporarily linking the blobs across, and removing the old saved copy once all goes well, to avoid save corruptions on failure.

Although it’s a great solution, it doesn’t address what your control file should look like, whether it’s a plist or a text; it’s a solution at a different level from just the file format.

2001: OS X 10.0 — `NSKeyedArchiver` and Binary Property Lists (11:15)

Then we got NSKeyedArchiver, which was a future-proof version of the Archiving Objects that we saw earlier. Although the future-proof part was good, it still had all the problems that the other one has: it’s hard to edit, and if you stick blobs in the middle of it, you get these gigantic, monolithic files.

Then we got property lists in a binary form, instead text, which was way more compact. However, it still doesn’t solve the issue with keeping large blobs: you don’t want to be writing enormous, two gigabyte images to the middle of your property lists, because every time you save your control file you’ll save this two gigabyte image in the middle of it. Also, that has implications for backup — whenever you do a backup, if one byte in a three gigabye file changes, you back up all three gigabytes, even if two gigabytes of it was an image. So it’s really not friendly to have your blobs included in a single giant file, like this Keynote presentation has.

2004: OS X 10.4 — Core Data (12:20)

Core Data then came along. One of the amazing things about it was that it was incredibly fast: you can store essentially unlimited items, and it’s still just perfectly fast. The slowness of saving is not at all related to the size of your database. We have users of Delicious Library who have 20,000, 30,000, 40,000 items, and when writing XML control files, these saves would take over 4 minutes. When we switched to Core Data, it became a hundredth of a second to make a change and save to SQL. It’s still a neat solution today.

One of the bad things until OS 10.9 took place because of the way hardware discs don’t like to actually do syncs. SQLite tried to be atomic, but it turned out that hardware controllers had made it so that when you called to sync the file system to the disk, it just didn’t do it at all in order to optimize for disc speed. It’s stunning because literally, and I’m not making this up, they added a second system call, called really really sync. I mean, that’s not what it’s actually called, but I’m not making up that there’s a second call. But there was one that was really really sync, because so many people had no-opt’d the sync call — despite the fact that their database will be corrupted if it just loses power. Then, apparently some hardware guys worked around really really sync. In response, the guy who did SQLite, who’s one of the biggest geniuses in the world that I admire the heck out of, actually made a journal solution that doesn’t require the disk to be hardware-synced. This solution is included in 10.9, and 10.10, 10.11, so this is now safe, but I really like that story so I decided to tell it to you.

For the first time we can think about autosaving, because it’s just so fast. There’s no reason not to autosave as frequently as possible, since it takes no time with Core Data.

Undo and redo was sort of part of Core Data, but they tried to do it at the lowest level, where any time you changed the database, it just registered an undo event. This turned out to be a horrible idea. If you do any kind of housekeeping such as storing the user’s window position in the database, that suddenly becomes undoable. So, it turns out, if you have any auxiliary objects that get created, that also turns undoable. It’s not a good way to do undo.

For backup, it was also a nightmare, because you have this giant SQLite database, sitting there as a single file on the disk, and we have users who have three gigabytes databases. You’re backing up this three gigabyte file, and every time they even open it, the time stamp changes, and it backs up a three gigabyte file again. Well, that’s going to really suck down your Time Machine space.

2007: OS X 10.5 — Time Machine (15:32)

The next year, we actually got Time Machine. I’m the biggest fan of Time Machine in the world, especially now that it works. It’s automatic, it’s magic, and it’s the best thing ever. It’s great.

The only thing bad I can think to say about it was that version stuff, that really, I haven’t even met anyone who’s ever used it. I don’t know if anybody has, but I thought it was really cool.

2010: OS X 10.7 — “Versions” (16:19)

Versions provided a type of an autosave as well, and it was kind of not clear what it was. I actually still don’t even really fully understand, because they sort of do a local backup to a hidden directory, somewhere in the top level. I don’t know when they go away, I don’t know when they appear, and I don’t know if those are backed up to Time Machine eventually or not. It’s sort of integrated to Time Machine, but maybe not. This type of voodoo frustrates both users and programmers, as a terribly implemented idea. That’s unfortunate, as the basic idea was really cool.

The Ideal Document Format: Our Perfect World (17:03)

In our perfect world of documents, we want fast loads. We don’t want to be forced to load all our data at once. We want fast saves. Fast is going to be one of the themes of our perfect world. Again we don’t want to re-save large blobs, and we’d love to have an editable format. We want autosaves to be instant, but not blow away the previous states of the document like Preview does. Undo and redo also need to be instant and never be corrupted by bad coding. We’d like them to be persistent, because that’s really cool and provides the ability to prune them.

For security reasons, you don’t necessarily want every version of your document in the document; we’ve learned that from the Microsoft Word exploits. The backup of our perfect file format should play nicely with Time Machine, which means that it shouldn’t have one giant monolithic file that changes by one byte and restarts the backup process. And, it should be incredibly easy to implement, because this is our perfect world, so why not? Also, I’d like a pony. 🐴

2013: Git as a Document Format (18:39)

So, it’s 2013 and I’m writing in my first new app in nine years. I want persistent undos in my app in order to make demo videos simple: just play the undo stack from the bottom up. My friend Sean O’Brien suggested using Git! Initially, I was skeptical, as at the time, I thought of Git as nothing but an ugly SCM. I now appreciate the beauty of Git, but the command line Git is the worst thing ever.

Git, at its core, is like a row of trees. In your document, if you think about your control file at the top of the tree, the text in your RTF file underneath, and the images as dangling off that, then your document structure is similar to a tree. If you’ve got a row of trees, then what you have is a persistent history of this entire document. And so, what you could do is have undo and redo implemented by reloading the entire document, rather than trying to save state changes. In our example, instead of creating invocations to the previous state such as turn this object to be blue to go forward again, turn this object to be red, we can just reload the document from scratch every time. Ten years ago this would’ve been completely idiotic, because of how slow it would be. Nowadays, since object creation is really fast, it’s a viable method that I tested.

Git: Saving (21:41)

Next I will detail how I used Git to save documents. You need to create a new local git repo for each document. This causes every image and control files to be treated as blobs, which is great. Then you stick them all in a tree, give them all UUIDs, write the file, and commit. Point the HEAD of the tree to it, and it’s done! It’s literally about seven lines of code to do a save in an existing file. You obviously have to write your objects into the blobs, so you have to come up with some format for them. If it’s an image, the blob format for an image will be the image. So that’s really easy. If it’s a control file, you still need to come up with a control file format, and decide whether to use RTF, plist, JSON, or anything.

Git is really cool because it runs a hash on each of these objects, checks for changes, and if they haven’t changed, writes nothing, just “hard link” the previous version. If it has changed, it writes a new version, writes the new tree, and then, when it gets big enough, it starts compressing and throwing away extra data, so it actually keeps them as separate files for a while. It compresses when it needs to. I love that kind of thing. That’s the kind of magic I like. If you tell me, I can just throw blobs at you, and ask for the blobs back. Sometimes they’ll be compressed and sometimes they won’t, but it doesn’t matter — you just get them always the right way. That’s good magic.

Git: Loading (23:31)

To load, we simply reverse the process. Read the head, commit, there’s a tree, read all the blobs in from the tree, and create your objects from the blobs. We have to make sure to assign UUIDs in the objects to the names so that when they get modified again, we can save them back out again, and they’ll get the exact same file. Otherwise you’ll be writing a completely new tree every time, and you’ll get none of the advantages of Git unique-ifying.

It turns out that there’s an implementation detail just because NSUndoManager sucks. When you load a file, because of the way NSUndoManager works, it’s really hard to get it to play along without undoing things itself. My solution to it is creating a fake stack of undo actions with those single-state changes. What I register as state changes require calling me when the Undo Manager wants to switch versions, and I handle it myself through Git. Therefore, when you first load the file, you actually play out the whole history of the Undo Manager, as if someone were typing really fast, and play out all the changes to reconstruct the undo management invocation stack. I did some timing tests for this where I just created 10,000 changes in a file and played it back, and it was still a sub tenth of a second to do this ugly step, so it worked.

Git: Autosave, Undo, Redo, and Backup (25:05)

Autosave: you just always save, after a top-level undo is closed. When you’re doing undo, you say, “begin a group of changes”; to make some changes, you say, “end group”. And, there’s a top level group — when the top one is closed, that’s usually your undo unit. You can tie into that, and the Undo Manager will call you when it closes a top level unit — exactly when you want to checkpoint the file. Then, you get another entry in your tree. In my test, it was about a hundredth of a second to do this autosave, because again, it’s only saving stuff that has changed.

Undo: Easy, you move the head back one commit, reload, done.

Redo: move the head forward a commit, reload, done.

Backup also works great: because you have all these files that do not necessarily belong in a single blob, modifying one little thing shouldn’t disturb the whole thing. Also, because it’s actually a standard Git file, you can push it to a remote repository and publish your actual documents on GitHub.

You can prune these files because you have, much like Time Machine, a whole bunch of interlinked trees; you can actually delete any number of trees you want, except for the head one, which is the current one, and all the other ones are still perfectly valid. They all contain links to the resources they need, and when you delete the last one that links to a resource, by definition that resource isn’t needed anymore, so it works just perfectly. You could delete everything but the first and the last tree, and you could still undo to the first or last. You could also implement branching, although that would be more complex.

Git: Not a File Format for Control Files (27:05)

As I said, you still need to decide on the file format for the control files; Git isn’t a file format that you would parse and turn into a series of drawing commands, it just gives you the files and their blobs really easily. One thing I did is I actually decided, in a drawing program, that every single line was going to be its own file, because files are super cheap in there. That way, if I change one line, it actually only changes four bytes in Git.

If we look back to our perfect world, Git covers all of our ideal features! Well, everything but the pony. 🐴

Git as a Document Format: Demo (27:54)

[In this demo, Wil uses a drawing program that implements its document format using Git, and shows how easily he can undo, redo, save, load, and prune work.]

Implementing Git as a Document Format: The Tools and Swiftly Written Sample Code (30:06)

To do this yourself, use the following libraries: libgit2 & objective-git. These include complete Objective-C bindings, such as the GTTree and GTBlob I referred to earlier.

Here is some sample code to get you started on each of our features, hastily ported to Swift. This code is based on my real, very old Objective-C code, but won’t compile as-is & has missing pieces.

class GitDocument : NSDocument {

  static let endOfUndoReferenceName = "refs/heads/master", headReferenceName = "HEAD"
  // MARK: properties
  var repository: GTRepository! 
  var currentCommit: GTCommit! 
  var walls: [Wall]()

  // MARK: NSDocument
  override func readFromURL(url: NSURL, ofType typeName: String) throws { 
    // File loading
    do {
      try repository = GTRepository(URL: url) 
    } catch { // create new file
      do {
        try GTRepository.initializeEmptyRepositoryAtURL(url) 
        try repository = GTRepository(URL: url)
      } catch { // create new file
        throw NSError() // MISSING I'm too lazy to make an error here, you do it 
      }
      
      checkpointFileWithMessage("Empty Nest", "commit name for an empty file") 
    }

    // MISSING: here we'd set up the NSUndoManager to have the old undos from the file, and also sets "currentCommit"
  
    loadTreeFromCurrentCommit() 
  }

  override func autosavesInPlace() -> Bool { return false } // we’ll handle this ourselves, thanks
  override func updateChangeCount(change: NSDocumentChangeType) { } // Nope! We save down at the model level, after every event, so we'll just ignore these messages, so we don't get prompted to save changes and we don't get a dirty window

  // MARK: private methods
  private func loadTreeFromCurrentCommit() {
    let wallsTree = try GTTree.objectWithTreeEntry(currentCommit.tree, entryWithName: "connections")
    
    walls = []
    for entryIndex in 0..<wallsTree.entryCount {
      let wallEntry = wallsTree.entryAtIndex(entryIndex)
      let wallBlob = try GTBlob.objectWithTreeEntry(wallEntry) 
      let wall = Wall(dataRepresentation: wallBlob.data) 
      wall.nameUUID = NSUUID(UUIDString: wallEntry.name)

        walls.appendObject(wall) 
    }
  }
}

In readFromURL(_:), the loading method, I read from the existing GTRepository and create a new one if it’s empty (which is likely not what you’d want to do in production since you’ll be opening an existing file…).

When creating a file, you would create a checkpoint immediately after creating the repo so that the user could undo all the way to the beginning. This method is not implemented here, but is as simple as you think.

Furthermore, you’d want to turn off autosavesInPlace() by overriding it with an empty implementation since we’ll be saving with every change instead of every 30 seconds. This way we won’t have extra undo events.

We also don’t want to use updateChangeCount(_:) since that triggers the red close button being lit up and prompting the user to save before exiting. Again, since we save to disk with every change using Git, we don’t need this.

The final sample load method is close to actual shipping code. I start by looking up the current commit which is a document and I get a tree. I load it in and loop through all of its sub-objects and look at their names to create new objects. This way I construct the file from the commit.

To implement undo, redo, and more, refer to the previous slides and look into the libgit2 & objective-git as mentioned before.

About the content

This talk was delivered live in June 2015 at AltConf. The video was recorded, produced, and transcribed by Realm, and is published here with the permission of the conference organizers.

Wil Shipley

4 design patterns for a RESTless mobile integration »