Reverse Engineering Code Completion

Slug jp reverse engineering?fm=jpg&fl=progressive&q=75&w=300

Reverse Engineering Code Completion

by JP Simard

Nov 23 2015

In this talk, JP Simard walks through the process he took to reverse-engineer the way Xcode generates auto-complete options for Swift, and to port that functionality over to Atom, GitHub’s hackable text editor.

Hacking Xcode (0:00)

I want to show you something I’ve been hacking on for a few hours. It’s always fun to reverse engineer stuff, so I wanted to extract away parts of Xcode so you can re-access them through other editors.

We all know that Xcode has been growing in complexity and really exploding over the last few years. Right now, it’s a 3D editor, with Interface Builder; you’ve got LLDB; you’ve got so much functionality that’s baked in there, and sometimes, it’s just too much. It would sometimes be nice to take some of that functionality and put it elsewhere, like in your favorite text editor.

SourceKit Logging Flag (1:05)

One thing I wanted to look at was code completion in Swift, so I want to show you some of the steps that I took to try and figure out how Xcode gets its code completion information for Swift files, and try to see if we can duplicate that in another text editor.

To start off, there are all sorts of these really neat environment variables you can pass to Xcode to make it do really, really cool things. One of those is a SourceKit logging flag. If you set that to three, the highest possible value, and then launch Xcode with this, we’ll actually see a bunch of log in the terminal. That log will actually be pretty useful. If we go to Xcode, we’ll open up this playground file that I just created. It’s an empty playground. There’s nothing in here, but we’re already getting a bunch of log information. SourceKit is showing us exactly what it’s doing.

Get more development news like this

We see it initializing, and we see Xcode sending some requests, then we see an editor.open request for this playground file. It’s going to do some mapping information and figure out what these calls mean. Then, whenever you see these request_sync-after, these are responses from SourceKit. Further down, we’re asked to parse, and we don’t get anything because it’s an empty file.

Then we get to a really weird part: Xcode tries to send a bunch of requests over and over again, so it opens the document and then closes it. We’ll try to open it again, but this time with compiler arguments. We do that and then we parse, but there’s still nothing, so it just sends these requests over and over again. It even sends the request to replace the text, even though nothing’s changed here. It’s kind of funny. If I were to guess, this is probably Xcode just kind of working around some funny issues, most likely implementation detail as to how SourceKit’s actually built.

A Single Character (2:51)

Let’s just wipe that away, and see what happens when we type a single character. We actually got a bunch more information here! Let’s see what we have. The next request that came in was this replacetext with the character “0”, exactly what we want, and then the response: we actually get some syntax information. You see that we have a number literal at x offset and x length, so this is pretty neat.

Now, we’ll see that Xcode already sent a code completion request with all of our compiler arguments, and the new sourcetext of just the number “0.” When that came back, we have this response from SourceKit of all these results. This is actually wrong, because this is the code completion information that would come back if you had nothing, so I find that a little funny. Let’s just wipe that away again, and let’s just type something else.

Let’s type a ., and boom, all of a sudden, you have all of these different options for code completion. I’ll quit Xcode and see what this response actually looks like. We replacetext, we add our ., and we have our new parsing response. We have this number literal and then we have a few diagnostics, but the most interesting part is that when we sent this new code completion request, we actually got back some real code completion results here.

In this result, we have a bunch of options. All of these results, we have about a few hundred in here. You can advancedBy, call all functions or get properties on this… this is the information that we want to get to! We want to get to being able to reproduce this information in our own text editor.

Obtaining Documentation (4:46)

We know that there’s a code completion request that’s sent via XPC to the SourceKit service, and that in response, brings back all this information. You get the name of the code completion item, the sourcetext, and you actually have placeholders. If you tab through, you’ll see that you can enter these placeholder values, and have a description, the typename, even a few usrs, which, let’s not get into what that is. You can also see which module it’s coming from.

This is neat, but a very interesting part is that for some of these responses, you actually get documentation. This is extremely useful, and it’s a way to expose the documentation for these auto-completion calls. We should, theoretically, be able to access this in our own code completion service that we want to run later. So let’s take a look.

If we actually send this XPC request to SourceKit ourselves, we get back this binary data. This is where the code completion information lives. It proves a little difficult to parse, but if we scroll down a bit, we’ll actually see our code completion option in strings here, in UTF-8 strings that this text editor is detecting.

The Parsing Struggle is Real (6:14)

How would we actually go about parsing this information? We see that these are variable length strings, so how can we determine what their boundaries are? We could check wherever there are null bytes, but with Swift supporting Unicode, these null bytes could really come at any point. They can be part of a larger graphing cluster. We’re kind of back to this garbage data that’s at the beginning of this binary data anyway.

However, you can kind of make out some sort of diagonal pattern of information. If we try just adjusting it a bit, you actually get patterns that are a lot more obvious: there are repetitive segments. If we see how long they are, we find that they’re in multiples of 45 bytes. This is a pretty good hint that every 45 bytes, we have some repeated information to figure out.

From there, we can do more extrapolation. It’s likely that these offsets to these strings are probably written in the preliminary data, so you can parse this original data and then see what offsets to read, i.e. the name, the documentation, the type that comes further down. From here on out, it’s really just trying to infer patterns in the data to find what kind of repetitive information is actually involved here.

SourceKitten 🐱 (8:11)

If we open up SourceKitten, a little framework that helps wrap calls to SourceKit, we should be able to see what parsing this information actually looks like. Let’s fast forward and go through all the time that was just spent trying to map numbers to other numbers in this binary file, to eventually reach something like this.

You figure out that, at the eighth byte, for a duration of eight, you can actually see what the maximum Range. This lets you see how large the file is, and you get to infer these numbers just by changing what the inputs and the outputs are, seeing what’s in common, counting the number of results back, and trying to see if that number matches anywhere in the binary blob. In my example, we figured out that at the eighth byte, for a length of eight, you can figure out how long to parse the rest of the binary blob.

We’ve already established that we have patterns that repeat at every 45 bytes. After reading this length here, we’re at byte 16. This is where Swift’s syntax is actually pretty nice. It looks a little funny when it’s shrunken up on the screen here with this huge text, but using the stride function, you can parse these chunks in a very readable and easy to write fashion. We’re parsing chunks of 45 bytes in the data, starting at offset 15, and we can check to see where the offsets for all of these variable length strings should be in our file.

We can flatten that out, extract the UTF-8 string information from the binary data, and then continue enumerating over these chunks of 45 bytes. We end up with things like the name, the descriptionKey, and the sourcetext. This is mapping the offset to the binary data that’s in the file. Once we run this through, we can extract all of the same information that we had via Xcode, but here, the result of all this parsing is that we now have our own process. You get the exact same information for a code completion request, including this nice documentation.

Putting Our Hack to Work (10:54)

We successfully reproduced some of the functionality of Xcode. Now let’s put it to use. The editor Atom is supposed to be a hackable text editor. It’s written in V8 JavaScript and it runs in this kind of web interface, so its big claim to fame is that it should be really easy to hack. If we want to add our auto-completion to a different editor, then Atom might be a good start.

I’ve extracted this parsing information into a binary and into some commands that you can send to it, such as something like “complete x source text at x offset.” I also simply forked Atom’s auto-completion library. Every time there’s this getSuggestions callback that’s triggered, we get the Text from the editor, we get the characterIndex, and then we just pass that to SourceKit, which does all the heavy lifting. We have this makeSuggestion that’s just parsing JSON, and then it puts it back into the data structure that Atom is expecting to be able to parse to provide these auto-completion examples.

If we run this on some Swift code and start writing, we see all the same information that we would get through Xcode, as well as all the documentation. That is just a little look at what reverse engineering some of Xcode’s functionality would look like to try to port it over to some other editor.

About the content

This content has been published here with the express permission of the author.

JP Simard

JP works at Realm on the Objective-C & Swift bindings, creator of jazzy (the documentation tool Apple forgot to release) and enjoys hacking on Swift tooling.

Twitter

4 design patterns for a RESTless mobile integration »