A Look Into Realm's Core DB Engine

Realm core jp cover?fm=jpg&fl=progressive&q=75&w=300

A Look Into Realm's Core DB Engine

by JP Simard

Oct 5 2015

Most of Realm is open-source, but the secret sauce behind Realm’s platform is the Core DB engine written from scratch in C++. In this talk, Realm’s own JP Simard provides a peek into Realm’s Core! JP explains the principles behind its design that make it so fast and efficient, as well as the reasons for writing it on our own as opposed to wrapping SQLite, as other mobile database libraries including Core Data have done. Learn about what makes an effective model layer and Realm’s advantages over other solutions. 🎯

Introduction (0:00)

Ever since we launched Realm just over a year ago, we’ve gotten a lot of questions about how things work. We’ve been pretty vocal about how you can use Realm and what its advantages are, but we’ve been shy about sharing the underlying complexity and power behind the core database engine.

I started thinking about sharing the rationale behind why we decided to build Realm entirely from scratch, rather than building an ORM on top of a preexisting, proven, robust database core like SQLite. Let’s take a look inside what makes Realm tick: the C++ core engine that’s common to all of our bindings at the moment.

All Objects, All the Time (1:33)

The whole point of Realm, or at least one of its very core ideas, is that it is objects all the way down. That was one of the driving principles that encouraged us to start fresh, rather than using an existing relational model. If you look at existing solutions that are currently out there, they tend to be ORMs. More often than not, there’s this conceptual object-oriented model that people are working with, which is really an abstraction of what’s going on underneath. Usually, these are records, tables with foreign keys, and primary keys. As soon as you start to have relationships, the abstraction starts to fall apart because you start needing expensive operations to be able to traverse these relationships.

Let’s face it, this is how everyone builds apps today, and it’s how people have been building apps since the beginning of the smartphone revolution. Unfortunately, this is really not how the existing database options were conceived to function. As a result, there is often an unnecessary layer of mapping and complexity that, when avoided, can result in big gains. Gains not only in performance, but in simplicity as well.

What is Realm? (3:17)

At its heart, Realm is really an embedded database, but it’s also a way to think about data. It’s a way to think about models and business logic for your mobile applications. One way we make that happen is by trying to minimize overhead. We try to keep it as fast as possible, so we’re always tweaking the performance numbers.

We also try to keep as many operations as zero-copy as possible. That’s why we go out and replace your object’s accessors as you’re using them. We reduce the need to ship things from the database, read it out, and put it in an instance variable so that you can use it. Instead, you just have raw access to the database. That way, you never copy anything out, and you never deserialize when you don’t mean to.

Realm is used by a lot of people. It’s ACID-compliant, like any good database should be, and it’s cross-platform.

What Does Realm Look Like? (4:17)

Swift

let company = Company() // Standalone Realm Object
company.name = "Realm" // etc...

let realm = Realm() // Default Realm
realm.write { // Transactions
  realm.add(company) // Persisted Realm Object
}

// Queries
let companies = realm.objects(Company) // Typesafe
companies[0].name // => Realm (generics)
// "Jack"s who work full time (lazily loaded & chainable)
let ftJacks = realm.objects(Employee).filter("name = 'Jack'")
                .filter("fullTime = true")

Get more development news like this

Objective-C

// Standalone Realm Object
Company *company = [[Company alloc] init];
company.name = @"Realm"; // etc...

// Transactions
RLMRealm *realm = [RLMRealm defaultRealm];
[realm transactionWithBlock:^{
    [realm addObject:company];
}];

RLMResults *companies = [Company allObjects];
// "Jack"s who work full time (lazily loaded & chainable)
RLMResults *ftJacks = [[Employee objectsWhere:@"name = 'Jack'"]
                                 objectsWhere:@"fullTime == YES"];

Java

Realm realm = Realm.getInstance(this.getContext()); // Default Realm
realm.beginTransaction(); // Transactions
Company company = realm.createObject(Company.class); // Persisted
dog.setName("Realm"); // etc...
realm.commitTransaction();

// Queries
Company company = realm.where(Company.class).findFirst();
company.getName; // => Realm
// "Jack"s who work full time (lazily loaded & chainable)
RealmResults<Employee> ftJacks = realm.where(Employee.class)
                                      .equalTo("name", "Jack")
                                      .equalTo("fullTime", true)
                                      .findAll();

You can create objects with Realm, just as you would any regular Swift or Objective-C object, and set its properties, but the Realm object is this concept of a database connection. As soon as you instantiate a Realm, you’re already connected to the database. A database connection is not as expensive as you might necessarily think. In Realm, we’ll memory map whatever we can in order to minimize the amount of data that you have to keep in memory for all of this to happen. As soon as you add this company object to the Realm, it becomes an accessor. Once you start reading properties from it, you’re no longer accessing your ivars, you’re accessing the raw database values, with the benefit of cutting out four or five steps and a bunch of memory copy along the way.

Then we get to querying. In Swift, we use generics (much like Objective-C, as of Xcode 7), and that allows you to do things like property accesses on lists and results. This concept of zero-copy and lazy-loading is really shown when you’re doing a filtered query, for example trying to find all the Jacks in a company who work full-time. Even though we’re doing the same query operation as we did a few lines before to get all the companies, we’re not reading all the employees from disk. Instead, we’re compiling what this query object is. Even though we’re doing this composition by adding one filter after another, we’re not redoing all these queries, we’re essentially building a tree of what the result should look like. Even if you just access the first result out of this query, we’re not going to have to read all the properties for all the other objects, because we really try to keep it lazy. This is the kind of behavior that allows us to get some nice performance metrics.

However, it does come at a cost. Something we hear all the time is, “it would be great if I could use Realm with Swift structs.” Yes, that would be really cool, but the way the Swift language is designed at the moment, you would have to copy the entirety of all of your objects and your object graph into memory for this to work, and that’s exactly what we want to avoid. So yes, even though you can go out of your way to detach all these objects, put them in memory, and have all of these heavy but light-feeling objects that are completely detached from the database, it’s still counterproductive to the way that we are trying to encourage you to build apps. We want to help you move from being forced to work against your tools to really using your tools.

For Objective-C, the layout is similar. In Java, we have explicit query functions, but overall the layout is also very similar. In Cocoa, we’ll use NSPredicates.

Why Build Realm From Scratch? (7:37)

Why did we decide to design our own database engine rather than use something that was already out there? Why create something new when we had something that was, say, 15 years old, robust, solid, and battle-tested?

Part of that, as I mentioned, is avoiding the ORM, and the abstractions that come with it; by cutting them and having a binding that’s as thin as possible, we can cut out complexity. Yet another reason why we decided to build our own database engine, though, is that there have been a lot of developments and research into commercial databases.

There’s a chart in the video above that plots out the database innovation that has happened really since the late 1990s. Check it out in the video and slides above.

You notice a flurry of activity on the top of the graph, which represents server-side databases. There is something to be said for those, sure: every hot startup trying to make yet another database. There has been a lot of innovation here, especially as of 2007, the smartphone revolution. You’d think that you’d see something similar on the mobile scale, but the truth is you really don’t.

There’s a bottom half to this chart, and these are mobile databases. You’ve got SQLite, that came out in 2000-ish…and literally nothing else since then. You’ve had a lot of wrappers built around SQLite: Core Data, ORMLite, greenDAO. A lot of great products that are built on this underlying core technology, but nothing that ever came close to replacing that database engine at a core level. We looked at this, and saw that there were a number of opportunities to bring some of these newer technologies back to mobile, where they have a lot of great advantages, but without many of the constraints that you have on the server-side.

Server-side database make a lot of compromises to fulfill server requirements, like being massively distributed, shardable across many instances, and with low latency across the Internet. The minute you’re designing for a mobile phone, you can put away many of those concerns and just focus on getting the best local experience.

MVCC: Multiversion Concurrency Control (10:41)

The goal is to take a lot of these innovations and bring them back to the mobile world. One of those is this concept of MVCC: Multiversion Concurrency Control. It’s the same design that powers source control algorithms like Git. You can think of Realm’s internal model a lot like Git, with concepts like branches and atomic commits. This means that you can be working on multiple branches without ever needing complete duplication of all that data. You can do a copy-on-write type of semantic, and you never have to worry about a write in another thread affecting you. You always have full isolation in your own transaction, and you could do this with very little overhead.

You can think of every transaction as a snapshot of the database when the transaction starts. This is why you can be making a bunch of changes with hundreds of threads concurrently accessing the database and not be screwed.

Another way that this is represented is the ability to perform a write transaction without blocking a read and without having to do a lot of bookkeeping. You can fork off in that write, and then continue reading in your read, so you can have multiple read transactions even while you have a write transaction going on. In a sense, it’s an immutable view of that data, a snapshot. The way Realm works currently, at the bindings levels, allows you to always be able to modify information within a single transaction. However, with this underlying scheme, we’d be able to prevent any mutation from happening in that transaction, and then you get immutability for free. Basically, you get a lot of really great advantages that at a core, fundamental level without having to work against the underlying database store like a lot of these ORMs do to provide these same functions.

Native Links (13:18)

Links, links, links, links everywhere. The whole point of the file structure is that everything is a link. We like saying that it’s just B-trees all the way down, which is why queries on multiple links — relationships — are so fast in Realm. You don’t have dual abstraction between ORM to relational, but raw links to the objects at a file system level in the file format. It’s the same if you’re querying, say, an integer column, or a relationship, or a one to many, or a many to many.

It’s also great for object graph traversals, which happen to be most of the operations we tend to do when we design mobile apps in an object-oriented way. You just follow pointers.

There are a bunch of optimizations that we can make at the core level such as native links at the file format level. That’s something we couldn’t have just submitted a patch to an existing database engine to add support for. These are fundamental changes that that cannot be added easily as an extra feature on something that’s out there.

String & Int Optimizations (14:18)

Another element is optimization of some of these data. We can do some operations like convert. Say, for example, you have a dropdown of which country your users are from, and you want to represent those countries with their names. In the case of our staff, we would have Denmark, U.S.A., Canada, Australia, etc. You can have this giant list of countries, even hundreds of countries, but if you have just even a few thousand entries in your database, then all of a sudden you run into a lot of duplication of those strings. What we can do is walk through your strings and turn them into enums, so they’re just almost like tagged pointers in Objective-C, providing quick lookup.

Integer packing is designed to have ints take as little space as possible. That’s why in Realm, it doesn’t really matter when you specify an 8, 16, 32, 64 in your model. Realm is still going to store it under the hood as an int in the most optimized way that it can, with little to no performance overhead because of the way it works.

Crash Safety (16:00)

Crash safety is a big one. If you have very light data needs, you can do things like just serialize to a binary plist, JSON file, or what have you, but one big risk you run is having your phone run out of battery halfway through your write. The next time you launch your phone, you’ll see the file and think “all good!” but when you try to read from it, you’ll find it to be completely corrupted.

Realm has a few semantics in place to rethink how data is stored in a way that can work around some operating system bugs and unforeseen crashes. It can really protect your users’ data in these cases. As I mentioned, Realm is similar to gigantic B-tree, and at any point in time, you have the top level commit (à la Git’s HEAD commit). As you’re making changes, copy-on-write behavior is happening, meaning you’re forking the B-tree and writing without modifying existing data, so if something goes wrong, that initial data is still there. Thankfully, by design, that top level pointer is still pointing to the non-corrupt tree; you’re just writing elsewhere. What happens finally is that you’ve got this two phase commit concept, where once we’ve confirmed that everything is synced to disk and safe, we’ll move that pointer over and we’ll say, “Okay, this is the new official version.” That means that the worst that can happen is in a write transaction, you’ll lose just the parts that you were working on rather than the whole thing.

Zero-Copy (18:06)

How do most ORMs deal with showing you data? Most of the time you’ve got data on disk. It’s persisted, sitting there, inert. Say that you have this NSManagedObject and you’re accessing a property from it. Core Data will translate that request into a set of SQL statements, create a database connection if it hasn’t been created, send that to disk, perform that query, read all of the data from the rows that match that query, and then bring all that into memory (that’s the memory allocation). However, at that point, you’ve got to deserialize the format that it was on disk to a format that you can represent in memory, which means aligning bits so that the CPU can deal with it. From there, you have to convert it into the language level type (say you’re reading a string, and even though you loaded the entire contents of the entire row just for one property, you’re going to go and convert that one string property). Then, finally, you return that object to the initial requester. There are lots of steps there that need to happen.

How does Realm deal with showing you data? Realm skips the entire copy. First off, the file is always memory-mapped. You access any offset in the file as if it was already in memory even though it’s not, it’s virtual memory. An important part and design consideration for the core file format was to make sure that the format on disk was readable in memory without having to do any deserialization. You skip that whole step. All you do is calculate the offset of the data to read in your memory-mapped memory, read that value from the offset to its length, then return that raw value from the property access. You’re skipping a bunch of steps there, making things much more efficient.

True Lazy-Loading (20:33)

It’s entirely impossible, based off of how hard drives and solid state drives are built, to just read one bit. Say you just want a Boolean property off of an object: you’ve got to load the disk’s page size. You can’t read anything smaller than that, because the hardware access doesn’t give you that choice. Most databases tend to store things on a horizontal level which is why when you read a single property from SQLite, you do have to load that entire row. It’s stored contiguously in the file.

The way Realm stores things is that we try to keep properties contiguously linked at the vertical level, which you can think of as columns. This means that if you’ve got a bunch of mail items and you want to mark all of them as unread, instead of having like a special operation just for that, we’ll try to optimize the naive case which is just iterate over all of them and set all of their properties to read. Based off the way this lazy-loading works, you’ll actually be very efficient by doing that. Of course, there is still some expense because you have to create language level accessors, and there are ways to work around that, but we really try to optimize the naive case. Most of the time, you really don’t have to over-optimize past that. It avoids disk roundtrips and reading unused properties.

Built-In Encryption (22:10)

Built-in encryption is another thing that is really hard to do with an existing database solution. You’ve got SQLCipher out there that basically hooks into the underlying engine and completely redoes a lot of the things that the engine itself has to do. With Realm, we can build this at a very low level because we have control over how we want to build the database core. The way encryption is built is very similar to how it is typically done in Linux. Since we memory map the whole file, we can protect parts of that memory. If anyone tries to read from this encrypted chunk, we can throw a file system violation that we can then catch and say, “Oh, someone’s trying to access this bit of encrypted data. Decrypt only that chunk and pass it back to the user.” We can do this in a very efficient way while having very secure technology. We’re not hacking it on top of another product, where we would have to work against the way that it was initially built.

Multiprocess Support (23:30)

With iOS app extensions, it’s becoming more and more important to support multiple concurrent accesses from different processes. Imagine you have a keyboard, and that keyboard should access a dictionary, and you should be able to perform queries and writes in that dictionary at the same time as being the app that came with the keyboard.

You need multiprocess support for this. This is where MVCC really makes things shine because it’s always append-only. It’s always copy-on-write. You’re always working in a fresh, isolated state for your write transaction. The way that it’s built under the hood is that we basically get the concurrency at the database level for free, and what we need on top of that is to notify the other process whenever things happen. We just use a named pipe for that, which is pretty low overhead.

Null Values (24:29)

This is something that the Core database supports but we haven’t fully shipped this at a binding level. We’ve had a PR up for I don’t know how long, and we’re just making sure that everything is perfect so we don’t corrupt any data accidentally because we didn’t test this enough. That’s another reason why things take a little longer for us to build: we really have to make sure that we’re not going to accidentally nuke your drive just because we’re trying to get a hot new feature out for you.

class Conference: Object {
    dynamic var name: String? = nil
}

This is what this would ideally look like. Swift optionals is a good example of this. Representing optionals at the file system level is also very important. It’s something that the file format and Core supports now, across all queries and types that we have, and this will come really soon. You can even check out a PR if you’re interested in trying this out.

Compromises (25:24)

Writing a whole new database engine from scratch has not come without compromises. I do stand by Realm’s advantages, but ultimately there are a few things we’ve had to pay for because of our decision to write a new database center.

Long Development Time (25:54) — If we had decided to building something on top of SQLite, we probably could have provided a lot of the features that we have at the binding level with a lot less work. However, we wouldn’t be able to go much further than that. That’s where the difference lies. You can see the great features we’re able to work on thanks to a flexible base we can manipulate, and that would be very difficult for such a mature product like SQLite. But… that means that you have to wait a bit!
Pre-1.0 (26:36) — The APIs are in flux. If you’re writing Swift, you’re already used to this, and you’re already rewriting your app every time there is a new Xcode version. But it is a downside. It does mean that you’ll have some backwards incompatible changes. We’ve had a few deprecations since we launched a year ago. Mostly just at the API naming level — never, as far as I can tell, a really massive underlying change in functionality — but it is still a consideration.

The file format could also change. In fact, it will change for a support for null. We’ll have to make sure you’re not running some sort of software that you want to maintain for the next 10 years without ever changing the file format. It’s still a consideration.
Fewer Features At First (27:26) — We’re still playing catch-up with a lot of the features that Core Data has. We don’t have an equivalent for fine-grain notifications, for example, but that’s the kind of features we’re working on.

Moving Forward (27:40)

We recently shipped KVO: you can now have fine-grained notifications per object. That is part of Realm Cocoa 0.95. You can see exactly what changed anytime your Realm file changes or if you’re observing a query. Null values are also pretty close. Further out is also handover between threads. Those are some of things that we’re actively working on among many others.

Links & Resources (28:25)

Q&A (28:45)

Q: I understand this MVCC appends to the database all the time. Does all the deprecated data disappear or does it accumulate?

JP: That can be garbage collected, and based off of what is pointed to in the tree, we can know which nodes are unused and we can clean that up. Something you can do to enforce that is to copy the Realm file which will just walk through the valid nodes and rewrite them to a fresh file. Sometimes you can gain quite a lot of space for that.

Q: So you have to do that yourself?

JP: The Core system will do it automatically, but if you want to enforce it at any point in time you can call this copy function.

Q: What about cloud sync? Any rumors on that?

JP: Keep an eye out!

Q: I was wondering, how do you balance providing new bindings through other platforms and other languages versus shipping these features and reaching the 1.0 and adapting the old bindings or the older bindings? There must be a bit of a trade off there.

JP: There’s a definitely a trade off in breadth versus depth. Really, the design of everything that I’ve just talked about is really optimized for mobile. That’s why it made sense to really focus on the predominant mobile platforms at first, and we’ve got a lot of exciting stuff in the works. We’re not resting on our laurels. How do you balance it? Well, you focus on the biggest platforms out there first, and then you move on to the others. There’s no more magic than that.

Next Up: Understanding Realm #2: How We Beat C++ STL Binary Search

About the content

This talk was delivered live in July 2015 at MobileOptimized. The video was transcribed by Realm and is published here with the permission of the conference organizers.

JP Simard

JP works at Realm on the Objective-C & Swift bindings, creator of jazzy (the documentation tool Apple forgot to release) and enjoys hacking on Swift tooling.

Twitter

4 design patterns for a RESTless mobile integration »