Pragmatic Core Data

Cocoaheads florian cover?fm=jpg&fl=progressive&q=75&w=300

Pragmatic Core Data

by Florian Kugler

Feb 8 2016

Despite being the staple persistency framework for Cocoa platforms, Core Data often confuses many with its incredible flexibility to suit different uses. In this CocoaHeads Stockholm talk, Florian Kugler, co-author of the objc.io Core Data book, discusses the advantages and disadvantages of the many approaches to using the framework. He provides you the tools to reason about the the best setup for yourself without relying on online authorities or Stack Overflow. By understanding the intricacies behind performance, concurrency, and save conflicts, you too can create the most suitable stack for your app.

Father of Core Data (0:00)

To begin, I want to bring up a small analogy. Imagine a son comes to his father. He says something like, “Dad, can I ask you for some advice? I have this great group of friends and we share a lot of common interests. Every time I’m together with them, we are doing all this cool stuff and we are having a great time. But then sometimes, I have the feeling that I have to be this fun guy to be with them, I’m not sure if I can always really be myself with them.”

“But then I also, I have this other small circle of friends, and I really like them because we can have these great, deep conversations about all these meaningful things in our lives. But then sometimes, I really notice, actually we don’t have so many interests in common and I feel this obligation to find things for us to do together, and what do you think? With whom should I spend more time, this one group of friends or the other group of friends?”

In this analogy, I don’t want to be the parent who tells his son, “well son, these people are not good for you, you should go to these people, and you should hang out with them. They will bring you onto a much better path.” I would want to be the father who talks to his son about the values that are important to me, to have meaningful, long term deep friendships. Then, my son is equipped to decide himself what he wants in a friendship and what is more important to him. So that he can make the decision and not want my advice on the result. So with this, to stick with this analogy, I will discuss Core Data.

I don’t want to be the guy to tell you what’s the right way to use Core Data. I don’t want to be the guy who tells you, you have to set your stack up like this, because I can’t tell you that. Because I don’t know your specifics. I don’t know your constraints. I don’t know your preferences, even. But what I want to do, is I want to encourage you and I want to show you how you can reason about these things for yourself. Without referring to anybody you might assign authority to in this field, be it me, be it some guy on Stack Overflow, be it other prominent figures, whomever.

I really want to encourage you to think this through for yourself, and to make the best choice for yourself, instead of looking for somebody who tells you how to do it. This is at least my intention tonight, that I will try to resist the urge of telling you, you have to do it that way, but I will try to explain you things and show you how you can find all these things yourself. I will explain how Core Data works, so you can make the best decision in the end.

There is not the canonical way of using Core Data the right way. Core Data is a super flexible framework and you can do a lot of different things with it. Different set ups, different approaches. They have different trade offs.

Reasoning About Core Data (5:34)

I’m in the process of finishing up this book about Core Data. I’m writing this together with Daniel, so this process of figuring stuff out and finding out how Core Data works and how it behaves and what it does under the hood, is very fresh in my mind still, and I want to try to communicate some of that to you. With Core Data, I think there are two big topics that come up again and again. Those are performance and concurrency.

Performance comes up because, it’s a persistence framework, and we are dealing with Disk I/O which is inherently slow – incredible slow compared to everything you do in memory. So you run into performance issues, sometimes, in there, and since you do, you use concurrency to solve them and then, the saying goes, you have another problem. I want to talk a bit about both of those aspects with Core Data. But as I said, I can’t tell you what’s the right way to get great performance, or to do concurrency. But I want to show you how to reason about it for yourself.

In order to reason about it, it’s really important to understand the framework. When you understand the framework, when you really know what happens in response to whatever you do, like executing a fetch request, or saving a chart context, or whatever it is, then it becomes pretty straightforward to reason about if what are you doing is a good idea. Or if it’s a bad idea.

Performance (6:43)

Let’s jump in about some things related to performance. When we talk about performance, it’s super important to look at the architecture of Core Data. You have the managed object context at the top, you have the persistent store coordinator in the middle, and then you have SQLite at the bottom. I’m just going to assume here that it’s always SQLite for now, just for simplicity’s sake. You can think of these three layers of Core Data as the three performance tiers of Core Data. So you have the context tier at the top, the coordinator tier, and the SQLite tier.

Get more development news like this

Whenever you have to drop down another tier for some operation, you are going to incur a performance hit. So I will talk about each of these tiers a bit more in detail.

The context tier (7:47)

The context tier works entirely in memory. It’s not thread safe, and that is actually a design feature. Because this top layer of Core Data, the managed object context, with its managed object inside is designed to be not thread safe, so that it can be really really fast, and that it can have predictable performance characteristics. Because it’s lock free. You cannot get into the situation that you make some call on a context and you run into contention, somebody else is working with this context. Since it’s not thread safe, this cannot happen.

Of course the flip side of this is you are responsible to use it only from one queue. So that’s what makes the top tier the fastest one, fast one. It’s in memory, it’s not thread safe. So what can you do within the context tier? So for example, you can call APIs to test whether an object has a so-called fault or not. If the object has already been populated with data or not. You can test that, this will not drop down into the lower layers. You can access properties on objects that are populated with data entirely memory operation.

You can traverse to one relationships or you can traverse to many relationships if you have accessed those before. The second time everything is only in the context tier and in memory only. So you can also fault in objects with a specific ID. When you call object with ID on the context, you are going to get back an object, mostly a fault, and also this happens entirely in the context tier and therefore is very very fast. You can get objects that are registered in the context already, or check if they are registered in the context with this API. You can make any kind of change, you can set properties on your objects. You can change relationships. All this happens entirely in the context.

When you save a child context, this is also in memory only. However, you still can run into contention issues if your parent context runs on a different queue.

The coordinator tier (10:19)

Next, let’s examine the coordinator tier. Coordinator tier is also entirely in memory. But the coordinator tier is thread safe, and therefore you can run potential lock contention there. So the coordinator is now this building block where you can connect multiple contexts to. So therefore it has to be thread safe. If you don’t run into contention, accessing the coordinator tier is very fast. Still only in memory, the performance issue incurs not very big. But it gets harder to reason about it. Because if you have some background activity maybe, working with the same coordinator, you don’t know up front. You can’t know how fast certain call will be because you might run into a containment coordinator.

So when do you drop down into the coordinator tier and only into the coordinator tier? There’s basically just one case, maybe I don’t know if I missed something. So if you access a property on an object, which is a fault, but whose data is already in the coordinator’s row cache. In the persistent store of the coordinator tier there is this row cache where Core Data caches all the raw data it loads from SQLite for all the objects you’re working with. So if you have an object which is not populated with data, and then you access one of its properties, then Core Data can go into the coordinator layer and it can retrieve this data from the row cache. And this is a pretty fast operation because it’s in memory only. And, maybe as a side note, it’s good to know this behavior of the row cache, which is pretty opaque, it’s not documented very well, but it’s very, it’s completely predictable.

So this is not a cache which will just start to discard stuff when there are more than 1,000 objects in it or something like that. This thing behaves completely deterministically, and you, when you have loaded data, from SQLite, the data goes into the row cache, and it will stay there as long as a managed object is alive which references it. So the stuff is reference counted. If the last managed object goes away which refers to a cache entry, the cache entry will go away.

The SQLite tier (12:50)

Then we come to the lowest level, to the SQLite tier. There we have to do disk I/O, so it’s inherently slow, and it’s inherently difficult to reason about it. Because you might have a small database and after the first time you access it, maybe most of it is in the SQLite’s page cache and every next access will be very fast, which you don’t know really. You might have a big database, and you will have to go back to disk a lot. If you have to go back to disk, you don’t know what else is going on in the system, so how fast will your disk access be – you don’t really know up front. So this becomes more unpredictable, but generally you have to assume every time you go down there, it will be slower. Just because disk I/O is insanely slow compared to what you can do in memory.

It’s thread safe, because you can use SQLite concurrently from multiple threads, so you can also have potential contention there, although SQLite is pretty good at handling that but we will come back to that later.

So, when do you have to drop down to the SQLite tier? The main thing is when you execute fetch requests. And fetch requests by API contract go down to the SQLite store or to whatever store you have. This always executes this complete round trip down to the store, fetches the data and comes back. There are a bunch of configuration properties on the fetch requests, which determine what exactly is going to happen there. I’m not going into all of them, you can read this stuff up in the API documentation, or from other places.

Executing a fetch request (14:34)

Let’s take a quick look at this process of executing a fetch request. You call execute fetch request on the context; this will forward your request to the coordinator, which will forward the request to the persistent store, and the persistent store will translate your request into an SQL query which will send this query to SQLite. SQLite will execute the query and it will give back the raw data to your persistent store. Your persistent store takes this raw data, stores it in a raw cache. Then the persistent store uses the reference it has to the context, and instantiates all these managed objects you have asked for, and then it gives back these object faults to the context. The context then starts checking if it has any pending, any unsafe changes which also fit your query, accounts for them, and then gives you back the result.

This is for standard fetch requests. No special configuration options set. And now you can start modifying this behavior with all these different flags. So for example, you can say, what was it, includes property values. You can set this flag to false on the fetch request which will cause Core Data to execute a query in SQLite which doesn’t include the actual data. It will only load the object ID and some other things, but it will not load the actual data. So sometimes this can be used for, for example, something like if you use fetch batch size, on a fetch request. This internally uses something like that.

I’d urge you to try out the options documented in the NSFetchRequest class. Test whether they are doing a what you think they do or they should do. It’s super helpful to just set up small test projects. Just create one entity or two entities, put a few objects in, just a couple lines of code. Enable the debug flags, and use a lot of print statements and figure this out. I’ve like dozens of them somewhere for the process of figuring this stuff out.

Working with the SQLite tier (18:05)

When do you have to drop down to the SQLite tier other than executing fetch requests? Well saving, of course. When you save the context that is connected to the coordinator, you have to go to the store. This is a synchronous API so you are going down there to SQLite to save the data to disk. When you access a property of an object, which is a fault, so it’s not populated with its data yet, and the data is not in the row cache. Core Data has to go all the way down to the store and come up back to the context with the data just for this one property. If you do this a lot, this happens a lot and you can run into problems there and performance issues.

Profiling Core Data (18:53)

The nice thing is you don’t have to take my word for any of that. You can use a bunch of tools to just try this out, figure this out yourself, to test your specific use case like Core Data instruments. There are four of them. You have to watch out for in the default template, in the Core Data template, in Instruments, there are only three of them included. There is one more, so check the instruments library, there is one more very useful instrument, I think it’s the faults instrument. So you get four instruments, one fetch request instrument, one save request instrument, one fault instrument, and one cache misses instrument.

You get a lot of information there; you can really see when do you have, when are you accessing a fault and Core Data has to drop down to a lower level to access data. Then you can check the cache misses track and see when you had to drop down because the object was a fault that you also incur a cache miss. You had to go down to SQLite, or these kind of things you can see here in a very nice way, to see if you might have performance issues there on some level. You can also enable this launch argument, this SQL debug launch argument which will put out SQL queries into the console:

-com.apple.CoreData.SQLDebug 1

The logs are super handy to figure out what Core Data is doing actually. For example, it could tell when a fault is fulfilled from database for some object. That tells you that you that you have access to property on an object which wasn’t materialized and the data wasn’t in the row cache so Core Data had to go down and execute the query we see at the top to fetch the data. If you see lots of those, you might or might not be in trouble, depending on how often this happens.

Performance TL;DR (20:53)

Now to answer the ultimate question, how do I get best performance with Core Data? Is the question that I can un-answer in the abstract. I think that your best bet is to look at the architecture, to understand this architecture, to really see how do these different layers work. How are their performance characteristics of these layers and then reason about your app’s behavior with this background knowledge. When you do that you can make the right choice for your specific project. Because if I try to make these recommendations in the abstract, tell you you should use as little fetch requests as possible, well first of all, what does that even mean? Theoretically I maybe can get away with one, for one root object and then I only traverse relationships. But I might get terrible performance with that. So this general advice is super difficult to give. You really have to see what’s your specific problem, what’s your use case, what are your constraints. Reason with this specific information based on your understanding of what happens down in this Core Data.

Concurrency (22:23)

The other very hot topic is concurrency. In concurrency, there is this big question floating around in the interwebs, like to nest or not to nest. It’s this different set ups you can make with Core Data. Should you use nested context, or should you go down the more traditional route using these parallel context?

In the book we are writing, we recommend pretty strongly to at least consider and really think this through very carefully if the nested set up is the right one for you. It doesn’t mean that you shouldn’t use that ever. We list use cases for nested context. But it’s a really good example of the problem when giving these general recommendations because when you write in the abstract, and we write this for broad audience and we don’t know who this audience will be and what kind of problems you’re facing, what kinds of use cases you have you have to account for. It’s really hard to say anything specific, to say you should do it like that.

If we say, we recommend you do this or that, don’t just believe it, and do what we recommend. But check if this holds up, if you think this through in the context of your specific circumstance. So, let’s just stack up the evidence for different kind of concurrent Core Data sets.

Parallel contexts setup (24:42)

Let’s start out with this traditional set up, parallel contexts. It’s a simple stack: you have one coordinator and then you have these two contexts on top, main and private. The main queue context is where you do all your UI work. The private queue context is where you do whatever background work you have to do, usually some importing from web service data or something like that.

So the advantages of such a set up are that when you executes requests in one of these contexts, they don’t block the other context’s queue. If you execute two requests basically at the same time in both contexts, you will run into contention at the common coordinator. But if you just execute a lot of requests for example on your background context, your main queue will be free, it will will be not blocked. And merging changes to reconcile the things you have changed in both contexts, you do this with merging the safe notifications, right from one context into the other.

This merging process can ignore all the changes that are not of interest to the other context. If you make updates to objects that are not even registered in the other context, Core Data can just skip them and doesn’t need to consider those changes. This is a performance win when reconciling changes.

Disadvantages of the setup are for example that saves on the UI context because this I/O is on the main queue. If you save, you have to save a lot of data, some data in the UI, let’s say you have copied a huge document into something and then you hit save. Save is a synchronous operation. So you are doing I/O on the main thread.

If this is a problem, I don’t know. Only you can know for your use. For a lot of things, small things, iOS apps, you create a couple of records, this is not an issue. It might be an issue for a complicated app. You have to check that.

Multiple coordinators setup (26:54)

Another contender in this area is the set up with multiple coordinators, it’s something I think Apple also has talked about a couple of times over the last years. You have the possibility to set up two complete separate stacks that share an SQLite database at the bottom. This will lead you to having two persistent store coordinators and one context on top of them.

The advantages of such a set up are that now the context and the coordinator tier are separated. They are completely independent of each other, so you can work in both contexts at the same time without running into contention at the coordinator level. Now you have pushed down the point of contention to SQLite. And SQLite is better at handling concurrency because it can handle multiple reads and a single wipe at the same time. And the coordinator, of course, since this is simply if one context uses the coordinator, it takes a lock, and that’s it. It doesn’t matter what it does if it reads or writes.

This can be a performance win for heavy background work. The disadvantages however are that those stacks don’t share the row cache any more. If you do a lot of background work in the background stack, and then you save, you take the save notification, you merge it into your main stack, none of the data you have worked with or imported or whatever is in this main stack row cache. So any object when you access one of these objects, it has to go to disk all of the time to get this data because the row cache is not shared. This is something to watch out for. You can work around that; you can take measures to mitigate this issue. For example, maybe you have to refetch some things globally when you merge and change save notifications, something like that, but it’s a disadvantage to be aware of. Another minor disadvantage is that you can’t share object IDs directly between these two stacks because object IDs are bound to a persistent store so you have to use their URI representation. And jump through some hoops to get back to the object then.

Nested contexts (29:18)

Let’s get to our friend – the nested context. With this setup, you would have these different contexts stacked on top of each other. This is the nested context API Apple introduced with iOS 5. In this case we use the persistent store coordinator, you put a private context on top, then the UI context, then another private context or multiple of them to do our background work. Our background work would be done in these worker contexts at the top.

The advantages of the set up are that if you save stuff from the main context, for example, the scenario I mentioned before, you copy in tons of data into your app, just copy paste and then you save. Now the save doesn’t block the main thread any more. Because the save of your UI context, only pushes all this stuff down into this private context that’s below. The private context can save this later. It’s one of the major advantages, and one of the major reasons nested contexts were introduced with iOS 5, where there’s iCloud syncing stuff.

Another advantage – kind of an esoteric advantage is that maybe the people who have worked with Core Data have run into that – you can have this issue when you work with multiple contexts, you delete an object, there is this time slice where you can run into this race condition that you access the object that was just deleted in another thread and then you just flat out crash. So, it’s of course possible to handle this issue. You can account for that, but it’s a little bit of work to do this right. You don’t have to worry about this in this nested set up, because this window of opportunity where you can crash the app by accessing this object just doesn’t exist in this set up.

The disadvantages here are that every request in one of the worker contexts on top will block your main queue. So every fetch request you execute, every save request you execute, has to go through your main context. Those are synchronous APIs. So your main queue is blocked for the time the background context at the top executes fetch requests. You have to check whether that’s a problem or not depending on your use case.

Another disadvantage is that saving a child context pushes all changes down into the parent context. So before I said when you merge change, merge save notifications, Core Data can discard stuff because it knows it is not of interest to the other context. Now, with nested context, Core Data has to channel all this data down to the store, so it has to push all of this in. So if you insert 10,000 objects in the child context, Core Data has to instantiate 10,000 objects in the parent context, even if you don’t care about them there. Related to that, every save has to go through the main context. So every save of stuff you do in the background context has to be pushed through the UI context. Sometimes it’s a very valid thing to consider if you want that, if you want to push all the stuff you are importing to the background, for example right into the UI context. Your UI might get unusable by that. There are cases where you want to defer that, and want to import in the background without having the UI update all the time, for example. That becomes less straight forward in the set up.

Another disadvantage is that you have no merge policies. When you work with nested context, you save the child context and all the changes are pushed down into the parent context with brute force. There is no consideration about conflict resolution, this is just a brute force push down into the previous context. It just overrides. Another pretty esoteric issue is that once you work with nested contexts you have to be aware there are weird things going on with temporary and permanent object IDs, and you can run into weird situations where you can break stuff like the uniquing guarantees that Core Data usually has. But I’m not going to expand on that, it’s a weird complicated topic with all kinds of edge cases, but just to be aware of there are some dangers.

Which Stack Should I Use? (34:09)

Well by now, you probably know the answer I’m going to give you: I can’t tell you. What I can do, and what I try to do in a very short compressed form is I can tell you the facts. I can tell you what it does. I can tell you how it operates. But if these things I listed as advantages and disadvantages are even of interest to you or if they don’t matter for your project, I don’t know. If they do matter, I don’t know how to weigh them. So I cannot say this is the best set up for everything, I can’t just say.

There are all these different possibilities, and based on Core Data’s architecture, they come with certain tradeoffs. I think it’s really important to understand these tradeoffs, to reason about them for your specific project and to make a choice based on that and to not do what we all like to do sometimes and just google something, find it on Stack Overflow, my book, or any resource, and copy paste. You will not get the best thing for your specific use.

Handling Save Conflicts (35:38)

A related topic where I also think it’s really important to understand what Core Data does is save conflicts. This is the weird thing which we all like to ignore and just ignore or in Swift just write the try! save. Once you have multiple contexts of course, you can run into conflicts where data can be not reconciled when you save it. Core Data might say, “hey I cannot, I cannot save that like that because things have changed in the meantime.” That’s something you have to handle, as soon as you do concurrent stuff with Core Data. That’s not Core Data’s fault, it’s just concurrency, this is hard.

What Core Data does is two-step optimistic locking. That means first it compares a snapshot which is attached to a managed object with the data in the persistent store’s row cache when it saves something. If those two things don’t match up any more, Core Data knows something has changed in the mean time. Second, if the persistent store’s row cache matches the data in SQLite, and if it doesn’t, Core Data again knows somebody else has changed something in the SQLite store in the meantime, and it cannot just save the data.

If you look into an SQLite file of Core Data, you will see a column in the tables that is called something like ZOPT, which is the optimistic locking identifier. It’s just an integer which gets incremented each time you make a change. Those things are compared when you make a save between the context and the row cache, between the row cache and SQLite. Why does Core Data do this two times? Well that’s because this thing is so damn flexible that you can put this together in so many different ways if you have set up with one coordinator and two contexts.

Then there’s the topic of merge policies. You can use the system and the standard merge policies or define your own ones. That’s a bit beyond the scope of our discussion, but it’s really important to understand how Core Data handles conflicts because as soon as you do concurrency, you have to worry about conflict. It can happen; you can ignore it, and you shouldn’t just close your eyes and hope for the best.

Beyond Core Data (40:16)

Beyond this more technical details about Core Data, I just want to reiterate this main point I’m trying to make here. The main point is that I really think we shouldn’t go out there and look for recipes, but I think what we should do if we are trying to make choices like that or if we try to understand why is Core Data not performing well, I think we should look for explanations and think these explanations through and really work through them in our mind and understand them.

Once we do that we don’t need recipes any more. We don’t need anybody any more to tell us that’s the best set up. We don’t need anybody to tell us you have to do it, I don’t know, you have to configure your fetch request like that to get the best performance for your table. Because if you understand what’s going on below, you can make much better choices for yourself because you can apply this knowledge to your specific circumstances. Every one of us is in a unique position to make the best choices for ourselves because we have this special knowledge about ourselves about our preferences about our prerequisites about the stuff, the specifics of our project, our constraints, our requirements.

If you combine that with an understanding of what’s happened and reason about this for ourselves and apply it together with the knowledge about our specific circumstances, I think then you can really make a better choice than anybody like me or some guy on Stack Overflow, or whoever would make for you. That’s the one point, that’s the first point why I stress this, why I really say, in the end, there’s no alternative to thinking it through yourself.

The second part why I stress this is because I think that once you do that you build these little islands of knowledge in certain areas, and I think this is a very empowering thing to do because you build these little pieces of knowledge even if it’s just niche areas, like the weird persistency framework on one platform, where you can say with confidence, now I have a good reason to have a strong opinion on my choice, here in my project. Because I didn’t just do what Florian Kugler said in his book, but because you thought it through with an understanding of the framework and the requirements of your project.

In my experience that something that’s very important to me. For me this is just this beautiful ability we have to reason stuff out for ourselves, and not having to refer to authorities. Not having to assign authority at all to somebody, but to be able to think this through for ourselves. I think this is a beautiful ability we have and in my experience, every time I use this, I think this is a very empowering feeling. So I want to end on this very non-technical note and encourage you to do this more often, even if you all have the urge to copy stuff from Stack Overflow from time to time.

About the content

This content has been published here with the express permission of the author.

Florian Kugler

Florian Kugler is a co-founder of objc.io, & a software developer and technical author in the iOS and Mac ecosystems. After the success of his book “Functional Programming in Swift,” he is working on a new book, this time about Core Data.

Twitter

4 design patterns for a RESTless mobile integration »