Improving Swift Tools with libSyntax

Improving Swift Tools with libSyntax

by Harlan Haskins

Oct 16 2017

The open source Swift compiler has gained a new library, libSyntax, that will transform how we write Swift tools. Learn how libSyntax is structured, the design decisions involved with it, and how to make use of it to analyze, generate, and transform Swift code.

Introduction

My name is Harlan Haskins, I’m here to tell you about the awesome world of tools with libSyntax. We’re going cover compiler basics, introduce libSyntax and go through some of its design choices and APIs available. Lastly, I’ll cover how to write tools by writing a small Swift formatter using libSyntax.

Compiler Basics

Compilers are usually split into two pieces: the front end, and the back end. Each is a pipeline that operates on your code.

First, you’ll start with the Lexer, which takes your code and breaks it into a stream of tokens. Tokens are the smallest bit of syntactic information in the program.

In this case, we’ll have four tokens:

The struct keyword
Cat Identifier
Left braces
Right braces

Once we’ve broken down the code, we put it back together by passing it to the parser. The parser’s job is to take that lexed set of tokens and turn it into a structured representation of the program.

Once you have a structured representation of the code, you pass it through various semantic analysis passes. Swift, at the beginning of semantic analysis, will do name-binding, and it will do a constraint solver, to figure out types, overloads, type check, and make sure everything is okay before passing it to the back end.

The back end is more complex and involve more steps. You start by generating underlying code representation. Swift then takes your code and turns it into a representation called SIL, Swift Intermediate Language.

From there, we run some more analysis passes over SI, e.g. where definite initialization occurs. Definite initialization is the pass over the compiler that makes sure that you’ve initialized all your variables, or that you’ve initialized all your properties when you’re inside an initializer.

Get more development news like this

Then, you run it through the optimizer, which generates more code, which will be analyzed and re-optimized, resulting in LLVM intermediate representation. LLVM is a lower representation of the source code between SIL and machine code. LLVM will run hundreds of optimization analysis and verification passes before lowering to machine code.

Eventually, you settle on one binary file. To quote Scott Owens from the ICFP 2017, “a compiler is a big function that takes a string and returns a big Int”.

Swift Abstract Syntax Tree

Swift Abstract Syntax Tree is the structured representation of your code that the parser produces. Let’s look at this struct called Cat.

struct Cat {
  let name: String
}

It has one property called name, let. If we look at a simplified version of the AST for this code, you can see all the pieces represented: the name cat, and one property name that’s a string. This is essentially what the Swift compiler thinks about when it reasons about your code.

Note that there are things missing, such as the colons, spaces, and braces. All those get discarded when we do this abstract representation. Throwing this away gets rid of some important bits that we could use later.

libSyntax

The Swift compiler gained a new library called LibSyntax. It aims to represent the full contents of the source file, provides Swift and C++ APIs for reading files, generating other Swift files, and analyzing them.

It works by keeping track of every one of those tokens in the tree itself, along with trivia. Trivia are all the invisible bits of data: spaces, new lines, comments - everything that normally gets thrown out at the beginning of parsing the code. The main goal is to be able to print the source file right back out from the structured representation.

Let’s look at a simple example of what libSyntax produces.

struct Cat {}

This struct will turn into this representation. Note that all the tokens are still available in the tree. We have a tree structure, but the tokens are still present. Even more, this tree keeps track of the spaces after struct and cat. This design is crucial to one of libSyntax’ main strong suits: syntactic transformation.

Suppose we want to rename this struct from cat to feline.

struct Feline {}

Recall how we parsed the tree. We have each token and the formatting and spaces around it. If we want to rename it, we create a new identifier: feline. By making this new identifier, we’ve created a different tree that represents a different source file.

If the nodes haven’t been changed, they’re shared with the old source file. As such, we can avoid creating all new nodes. This reinforces the notion in libSyntax, where the entire tree of your source code is only the sum of its parts.

Gritty Details

libSyntax splits up the notion of the syntax tree into three constituent parts: RawSyntax, SyntaxData, and the syntax nodes.

Raw syntax

Raw syntax is the core of the syntax tree.


indirect enum RawSyntax {
  case node(SyntaxKind, [RawSyntax], SourcePresence)
  case token(TokenKind, Trivia, Trivia, SourcePresence)
}

It’s structured as an enum with two cases. One is a node, which is an abstract idea for a piece of syntax that has children. It has an array of children along with a a presence. That presence says whether that piece of syntax actually existed in the tree. libSyntax needs to be able to represent invalid or in progress code. If you start writing a struct declaration but without the last brace, we should still be able to recover from that and give you a tree you can work with.

The other case is token.

Trivia Rules

Leading Trivia, Trailing Trivia

We came up with two specific rules for when certain trivia applies to certain tokens.

Trailing trivia:

A token owns the trivia after it, up to the first new line

Leading trivia:

A token owns the trivia before it, starting with the first new line.

Let’s look at an example:

if x == 6 {

}

Here’s an empty if statement. Let’s go token by token, and figure out which trivia each piece owns.

The if token. Rule one says that it owns the trivia up to it until the first new line. It owns that one space after, but because we are rubbing up against the x token, we stop. Now we own all trivia via rule two, before it, from the first new line. This if statement doesn’t have any trivia before it so it’s good, it has one space of trailing trivia.

x is the same. One space after it, and no trivia before it. This is the same for the space and the six. Most tokens in the tree are only going to have trailing trivia. The only time they’re going to have leading trivia is if they come after a new line.

Six has one space after it, and the brace has a new line directly after it. Rule one says it owns all the trivia after it, up to that new line. This has no leading or trailing trivia, but the brace after it has two new lines before - it is the owner of those two new lines.

This is how the libSyntax Lexer breaks up your code and assign trivia to each token that you can then mess around with when you’re transforming the code.

SyntaxData

SyntaxData is a class that wraps a RawSyntax node and provides storage for the children. You can think of it as a class that has an array inside. That array represents all the children of RawSyntax but SyntaxData nodes have an identity.

A RawSyntax node can be shared across multiple places because a 1 in your source code is a 1 that has a space . But the SyntaxData for a 1 in your source code is the specific one that shows up at line 7, column 4.

SyntaxData is also responsible for lazily creating its children so that you only create nodes that you ask for and avoid doing all sorts of other allocations. Finally, SyntaxData points back to the parent so you can walk up the tree if you need to then visit neighboring nodes.

Syntax Nodes

The raw syntax and SyntaxData are hidden. When using the library, you will work with Syntax Nodes directly. There’s a different declaration for each node in the Swift grammar, and these provide a type-safe way to get at each of the children.

Syntax Creation APIs

LibSyntax gives you three APIs for creating Syntax Nodes. The make APIs exist to create fully initialized nodes if you have all their constituent parts already. The with APIs exist to transform a node by replacing one of their children. Finally, the builders exist to let you incrementally build nodes as you determine what parts you need.

make APIs


let returnKeyword =
  SyntaxFactory.makeReturnKeyword(trailingTrivia: .spaces(1))
let three = SyntaxFactory.makeIntegerLiteralExpr(3)
let returnStmt =
  SyntaxFactory.makeReturnStmt(returnKeyword: returnKeyword,
                                expression: three)

Make APIs are for making nodes when you have all the necessary parts - they are static methods on an enum called SyntaxFactory, which serves to provide them with a namespace. There’s one make API per node, and you have to provide all of the required children of each node you want to construct.

If we want to make this return 3 expression, then we create a return keyword. We create with one space of trailing trivia. Then we create the three expression, and to make the actual return statement, we call make return statement, passing in that return keyword and the three.

with APIs


let returnHello = returnStmt.withExpression(
  SyntaxFactory.makeStringLiteralExpr("Hello"))

The with APIs are used to transform nodes into other nodes. Suppose instead of returning three, we want the statement to return hello. We’ll call this with expression method, then pass in the string literal.

Syntax Builders


let structKeyword = SyntaxFactory.makeStructKeyword(
                      trailingTrivia: .spaces(1))
let catID = SyntaxFactory.makeIdentifier("Cat",
              trailingTrivia: .spaces(1))
let struct = StructDeclSyntax { builder in
  builder.useStructKeyword(structKeyword)
  builder.useName(catID)
  builder.useLeftBrace(SyntaxFactory.makeLeftBraceToken())
  builder.useRightBrace(SyntaxFactory.makeRightBraceToken())
}

For each piece of syntax, there is a corresponding builder struct. These provide an incremental approach to building Syntax Nodes. If we want to build that cat struct from before, we only need four tokens. We need the struct keyword, the cat identifier, and the two braces.

Structs have more associated with them. They have attributes, generics, and inheritance clauses.

If we want to make this cat, first we make the struct keyword with one space of trailing trivia, then make the cat identifier with one space of trailing trivia. Then we call into the StructDeclSyntax, and builder initializer. This is an API that passes you in a closure, a builder, and you tell that builder each piece of syntax that you want to use.

Guiding Principles of libSyntax API Design

These APIs were developed under a set of guiding principles. You should always know what to do next if you’re trying to build a piece of syntax.

Users of libSyntax do not have to be Swift compiler experts. They should not have to know that a struct declaration is called StructDeclSyntax, for example.

The APIs are flexible, and they will always generate code that is structurally valid, but not syntactically valid. If we were generating that cat struct and we forget the space of trivia after the struct keyword, when we print it back out to a file, we would have structcat, which is not valid.

Because the APIs are all named similarly, they’re auto-complete friendly, and you can start typing SyntaxFactory.make listing of all the things that you want to be able to make.

There’s Still Work to be Done

libSyntax is still heavily in development. The syntax trees you get from libSyntax are going to be very bare-bones. They don’t include all the structure. Mostly, they provide access to tokens with trivia, which is why formatting passes only needed to deal with tokens. If this meets your needs for formatting, you can start using libSyntax without any issues.

Otherwise, you’re going to see nodes called UnknownSyntax. To get that full structure, we need to go through and update the Swift parser to produce libSyntax nodes.

More Information

You can get to the libSyntax documentation here. You can get to the Swift open source repo here.

Next Up: Advanced Swift #8: Operators and Strong Opinions

About the content

This talk was delivered live in September 2017 at try! Swift NYC. The video was recorded, produced, and transcribed by Realm, and is published here with the permission of the conference organizers.

Harlan Haskins

Harlan is a Computer Science student at Rochester Institute of Technology. He’s previously worked at Apple as an intern on the Swift Quality Engineering team, where he contributed to LLVM, Swift, and the Swift Migrator. He’s also been working on Swift libraries to interface with LLVM and Clang, which he uses in his hobby compiler, Trill. He currently works as an iOS engineer at Bryx, Inc making apps for 911 and EMS responders.

Twitter

4 design patterns for a RESTless mobile integration »