Tales from the Dark Side: Developing SDKs at Scale

Tales from the Dark Side: Developing SDKs at Scale

by Kenneth Geisshirt

Aug 8 2017

I’ve been here at Realm for a while. I’m not an Android developer, but the Realm SDK is used by many Android developers and in many Android apps. Because of that, we see some of the dark sides of Android because of the bugs we encounter - bugs that are not created by you and your team but created by other people and other projects (because you’re used everywhere). First we discuss the statistics of Android devices, and then dive into four different cases of bugs, which we will explore in detail.

A bit About Android Statistics

Android is an interesting operating system. It’s an open source project - everybody can have access to it. The downside is that people don’t upgrade enough. 30% of active devices are for version pre-5.0. People are using devices with Android versions that are not supported by Google anymore. Android 4.4.4 only gets some security bug fixes - they don’t even fix all bugs, only security bugs. And even though they put the security bug fixes out there, the probability that the vendors of the phones will put them into the devices and update them are low. You have to remember, as an app developer, that your users are using old devices. And because of that, you receive bugs that are old, that have been fixed, but are still not fixed in these old devices. We often buy old devices and run tests to check if we are hit by any of these bugs.

The bugs I’m going to discuss today are were never discovered by any unit tests of ours. These were in things that we released to the public, people have been using, and then got hit by some of these bugs.

Cannot Load `.so` File

Realm is done in C++. The C++ code is compiled and linked into a shared object, an .so file, that is put onto the device. You have to load that using the dynamic linker. Android has a call called loadLibrary that initializes Realm by loading that .so file. All these .so files for all the platforms you want to support are stored in the APK. When a user installs an app, it will install that .so file for their particular particular platform or architecture. When Realm starts up, it will load this .so file. It sounds very simple… but it’s not simple to load .so files.

In October 2015, we had a bug report: people were having issues with the loading. This is from the stack trace from a user:

Caused by: java.lang.UnsatisfiedLinkError: Couldn't load realm-jni: findLibrary returned null
at java.lang.Runtime.loadLibrary(Runtime.java:365)
at java.lan.SysteloadLibrary(System.java:535)
at io.realm.internal.RealmCore.loadLibrary(RealmCore.java:114)

He has an unsatisfied link error - the dynamic linker couldn’t load this .so file and his app was crashing.

When we looked into it, the Android’s package manager does not always install the .so files during installation of the app. No one has figured out why, but it seems that in certain cases, it doesn’t happen. It’s annoying when you rely on having an .so file. Luckily, another company KeepSafe had put out a solution as they were also hit by the same bug, and they investigated carefully and figured out that it was not installed.

Since the APK file is installed, it’s downloaded to the phone, and since it’s a .zip file, you can unzip it and find the .so file there and copy it manually. That’s what they are doing in a project called ReLinker. If if this load library fails and cannot find the .so file, they unzip the APK file, find the .so file to hand copy it in and then load it.

Google has never responded to them and it’s not their problem. Facebook also has an advanced mechanism for loading .so files. They go a step further: if they cannot find the .so file on the device, they will download it from their server and put it into the app. After we started using ReLinker, we have never seen any of these issues. ReLinker is the best workaround for that bug that has never been solved by Google.

Get more development news like this

You may think that loading .so files is now simple. But you can also have other issues with .so files.

More `.so` Issues

In October in 2015, the ARM 64 devices came out, and we decided to add that as an architecture. We start shipping, there were two or three models supporting it - it worked.

But, if you have two .so files for two libraries that have .so files, and one of them is 32 bit and the other one is 64 bit, since it’s doing the installation, it will install the 64 bit version on your 64 bit device. Then when you’re loading the 32 bit and when it’s time to load the 64 bit version, it will say, “these are two different architectures”. Then the app will crash.

If you have multiple .so files, and one of them is a 32 bit only, you have to exclude the 64 bit architecture. We saw that with the VLC client, because they were only distributing a 32 bit version. Excluding 64 bit is the way you do that if you see this problem. We have identified a number of closed-source libraries that are only 32 bit. If you are using Parallel Space (popular in China), or you use RenderScript or Unity3D (I heard that some game developers are using that) you need to be careful because of this. You might have to exclude the 64 bit version of Realm. If you do this and use that in your packaging options, you will have no problems.

Loading an .so file sounds easy, it’s a dynamic linker, it can’t go wrong. But it does.

We’re trying to keep this list of libraries that we know are troublemakers. I added Parallel Space because we have users that were using it. Parallel Space is so you can have multiple users logging in to your phone and using your phone for isolation. It’s fairly popular in China, and it’s also a Chinese company. They confirm that is a problem, and agree that this is the way to solve it. It’s written in our documentation. If you’re using Crashlytics, they’re also loading .so files. But they don’t have this issue.

Encryption

We have Realm support encryption of our database, which is important for many customers, especially banks, and healthcare applications. We are using the OpenSSL libraries, which have encryption functions. It’s very simple - we encrypt data very transparently so the user doesn’t see that it’s encrypted.

Back in April 2015, we had an issue that people were seeing a segmentation fault when they’re using encryption. It was only affecting certain versions of Android, and it was always when they’re using it together with the cookie manager. If you have about 15 classes with each class having about 15 fields, then it will crash. It was apparently a size problem, but this one was the simplest cases we’d come up with.

We have this problem. It’s a fairly simple application - it opens up Realm and then it crashes. To begin with, we couldn’t reproduce it. But then we realized what was wrong. The first encryption implementation Realm was doing was Unix signals. Signals are processes that can send to each other, saying I want you to know something. The kernel will also do that.

When we’re writing and reading from Realm, we had to do that in an encrypted fashion. When we’re reading a Realm file, it’s in pages so when we’re accessing that page then we’ll do a page fold, and that page fold, the handler would decrypt and encrypt the page. We were relying on signal handles, and most operating systems support signal handling because that is a POSIX compliance issue. Otherwise you’re not POSIX-compliant, and most operating systems today are POSIX-compliant, so we were relying on these signals.

Signal handling discipline

But using signals requires discipline. You set up a signal handler and that takes care of the signal. But, when you receive a signal, other processes or other parts of your application might have also registered a signal handler for that signal.

In order for your application to work if this is the case, you have to resend the signal, or pass it on to the next signal handler. You have a queue or a list of signal handlers, and every single one of them has to take care of the and look at pass it on. If you forget, the ones after you in the queue will not get that signal, and they will not process it.

What presented this bug for us was that Google actually introduced a version of WebView that forgot to pass on signals. At some point we realized there was a problem - when we moved back one version of WebView, it was working; we went up to the newest version, it was crashing.

It was probably that version that was the issue. We started reading the source code, and we wrote a small sample and sent it to Google. Google was very nice and fixed it within a month. But still, since we didn’t get the signal, this encryption, de-encryption didn’t work for us, and the memory ended up being corrupt. That’s why it was crashing. Because the C++ pointers were pointing in the wrong direction and ending up crashing.

If Google could forget to do the signal handling correctly in WebView, which they update every six months, the probability that other people could forget that too was high.

The fix

Our workaround was to rewrite our encryption layers. We don’t use signals anymore because we don’t trust that other people will set up the signal handling correctly. Crashlytics does it correctly, and they state in the documentation: if you’re doing C++ on Android, don’t use signals because we are using signals and it might not work.

If you have three handlers and signal handler number 2 is not resending its signals, then handler 3 will not be activated and might then have a malfunction. We ended up rewriting our encryption layer because we didn’t want to see this issue again. Now we have to have a mapping of all the pages to keep track which one of these pages have to be de-encrypted or encrypted before it’s written to disk. We have a complete file mapping layer for pages within our system. All the platforms we are supporting are using the same file mapping system for the encryption.

Even WebView can introduce crashes for you, because people were upgrading to the newest version. If they used old devices, that would have been better in this case.

Cannot Find App’s Directory

When you have an Android app and it’s isolated, running in the sandbox, and you can ask for a directory where you can write files. There is the getFilesDir method, which is the application’s area to write files. In April, we had crashes reported that users were unable to open a Realm path. getFilesDir sometimes return null for an app.

According to documentation, it cannot return null.

But sometimes this happens anyway. We figured out that there’s a bug - it used to be a race condition in how directories and caches are created when a process starts up.

When an app or process starts up in Android, there’s a race condition, which led to the right directories (that app can write and read to) not being created. It was fixed in Android 4.4 in 2010, but we are still hit by it seven years later than the fix was issued.

Since it’s a race condition, it’s a matter of waiting until the process has settled. We tried to create the directory for the apps a number of times up to 200 milliseconds. If we cannot create this directory within 200 milliseconds (which is 12 frames, so the app has not been responding for quite a while), we give up.

We had to issue a workaround for a bug that was fixed seven years ago because we have users (30% of our active devices) that have this problem. Sometimes they might not be able to start the app, not because of you, but because there’s a little old bug in Android, and users are not upgrading. We ended up fixing it, and it’s a very simple fix. There’s a loop going through and waiting and trying to create that directory and nothing else.

Random Crashes

A native crash

In October last year we got a bug report. There was a segmentation fault in a method called ArrayString::set(), which is deep down inside the core of our database. This is the one of the most used methods ever, whenever you want to store a string within our database.

One could say, if there’s bug in that method, we should have seen it. We are installed more than a billion times, with 32 bit, 64 bit. We have many bindings and products that are using it. This is one of the first methods that was written within our database. We should have know there’s a bug; otherwise, everything would break every single time to try to open an app. The address is 176.

Reproducing the crash

We could not ever reproduce it on an emulator. I have a OnePlus One, and other devices at my office in Copenhagen; I couldn’t reproduce it. None of our unit tests were able to reproduce it. The user that was reporting it was on an old Samsung tablet that went out of sale a couple of years ago, but we didn’t have that one. I managed to find a used model - it took me a month to find it!

I ran all our unit tests, no failures. We also have a few examples that we are distributed on very small apps doing nothing. And an intro example is creating a Realm, writing a few objects to it, and reading them again. This is the hello world example for Realm. I could reproduce the crash on that one. But the debugging capabilities, especially when you’re working on C++ code, is fairly limited on Android. I only have stack traces.

Temporary fix

With the stack trace, we saw something was wrong in a call to a method called create_metadata_tables, which is internal. We create some tables for storing primary keys and schema version, some metadata about the database. They are not visible to users, but they are used when you initialize the Realm for the first time. I could see the crash was there. The code is creating two tables, and some columns and search indexes.

I had no clue what was going on at first. I started fooling around, modifying the order, the lines, trying to disable search. Suddenly, when I was swapping the two, the order of how I created the two tables, the bug disappeared. That was a red flag because that should have nothing to do with it. Even though I tried to create all units this way, I was creating exactly the same strings and storing them into the database, and I couldn’t reproduce it. But by swapping the order, and having the bug disappear, that told me there is something going on.

Insights from temporary fix

Internally, Realm tries to store as little as possible. When you have a string, first created a table called “pk”, for example, and the name of the table is also strings that are stored in the database, they are also using this method.

When you create a string of “pk”, two characters, you only store those two characters. When you insert the next one, which is eight characters, in order to match them you expand the other ones. You expand “pk” up to eight characters, with spaces or null characters. Realm is a column store, and otherwise it won’t work with these columns.

The temporary fix I was doing where I swapped the order, meant I was inserting first one with eight characters and then inserting one with two, so when I’m going to insert one with two, the two will be expanded up to eight characters first, and then inserted.

There’s two different code paths in this set method. One is expanding an existing value, and the other one is expanding before inserting. It’s two branches of an if statement. I could see one branch was crashing and the other branch was not. That gave me a clue as to where the bug was.

Digging further

Since that helped in some cases, we released the fix and went to Christmas vacation. But still I couldn’t understand what it was. I was disassembling our compiled code to see what the compiler was producing. And then I realized that 176 was the address that was failing. It was failing in a call to a method called memmove(). This erase string set does not have a memmove() call; it calls fill and copy_backward, but underneath these C++ standard methods are calling memmove() underneath. But the compiler is inlining the call to memmove(). This is probably the copy_backward we are seeing. I start reading the sample output to see where it was crashing. Now I have an idea: memmove() is apparently the sinner.

memmove()

memmove() is a very old method in the C standard library. It goes back more than 25 years. It was introduced in BSD Unix in 1990. It moves or copies data from one place to another in memory. memmove() can take care of where it’s overlapping. If you move something and it can overlap, there’s also have a sibling called memcpy(), which does not allow overlaps.

memcpy() and memmove() are siblings. Very often this is the same reference implementation. But most operating systems are writing a very specialized version of memmove() because it’s used a lot, and we want it to be extremely fast. People are hand-coding in assembler, but all sort of code paths, with align and not aligned in memory, and it depends whether you’re on an ARM or Intel CPU, and so on.

If you start looking at FreeBSD, for example, they have ARM assembler code, they have Intel assembler code. If you go into Android, they also have own assembler code. Everybody has their own implementation, and they are all very highly optimized. The Android for ARM are using these NEON instructions, which are extremely fast for moving stuff in memory, at least in some versions, for some vendors. Android vendors also write their own version because they know exactly what is the fastest one.

Now I have the clue that memmove() and memcpy() might be the troublemaker.

I found a blog post by ChengYi He that analyzed memcpy(), and I could see there was a problem with memcpy() on Android. He showed assembler output and puts in this NEON instruction. You could pinpoint that the root cause is a race condition in the Linux kernel that was fixed in 2013.

We could see what it was: a bug in the Linux kernel used on Android. It’s hard for me to say, “We’re not able to fix this one because it’s in the kernel” - people would not be happy about that.

I could also see that Unity3D had been hit by it. They had no clue, and were just saying “wait until the next version” - they did not do anything about it. The Qt also was hit by it or a similar one, and they were able to write a small unit test for it and they contributed that to Android. The Android Open Source project did not have a unit test for memmove() until version five of Android, even though it’s used by everybody in any library, everywhere.

The solution at Qt was modifying some of the compiler flags for GCC and then they could get it rolling. I also tried those compiler flags, and it didn’t fix the bug. But I had learned that memmove() was the problem here, and since we still have to support old devices, we have to find a workaround.

Ready For Workaround

The workaround is not simple. First of all, we have this simple test case from Qt. We decided to roll out our own memmove(). We know that some devices have a version that has a bug. Instead of using the ones that are supplied by the vendor of the phone, let’s use our own version.

We ended up using the Don’t Repeat Yourself project. They have all the standard methods in the C library, and they have a very liberal license. We were also considering using the FreeBSD version of memmove(), but they have a little bit more strict license.

At link time, when we’re linking our .so file, GCC has its own memmove(). At link time, we take that out and use our own instead. There’s a flag for GCC where it’s wrapping memmove(), where you can wrap all the intrinsic functions. They have a lot of intrinsic functions, and you can remove those and use your own. You take a function finder in C and punch another function and call that. I left out the memmove() implementation up there, but we are checking for it, and here’s the call for our version of memmove(), so we are grabbing that version:

typedef void* (*MemMoveFunc) (void *dest, const void *src, size_t n);
static MemMoveFunc s_wrap_memmove_ptr = &__real_memmove;

static void* hacked_memmove(void* s1, const void* s2, size_t n)
{
    // DRY implementation
}

static void check_memmove()
{
    char* array = strdup("Foobar");
    size_t len = strlen(array);
    void* ptr = __real_memmove(array + 1, array, len - 1);
    if (ptr != array + 1 || strncmp(array, "FFooba", len) != 0) {
        s_wrap_memmove_ptr = &hacked_memmove;
    }
    free(array);
}

void* __wrap_memmove(void *dest, const void *src, size_t n)
{
    return (*s_wrap_memmove_ptr) (dest, src, n);
}

We are checking if you are affected. If you are not affected, you’ll probably use the vendor’s version of memmove() because it’s faster. But if you are affected by this bug, then we want to do that. Remember this was a bug that was fixed in the Linux kernel more than three years ago, and we are still hit by it.

Summary

If you have enough users, your code will run on every Android version that has ever been released. People are not upgrading, they’re not buying new phones, they have these old phones. You will be hit by all these old bugs at some point. Sometimes it’s your fault, but sometimes it’s also because there’s these old bugs that have not been fixed, or that the fixes have not reached the device yet.

Realm Java is an open source project, so you are welcome to check out the repo and see exactly how we resolved these bugs.

Acknowledgements

Finn Schiermer Anderson for debugging memmove() and reading assembler outputs together with me and trying to understand ARM assembler. He built MIPS processors in the old days, so he understands assembler quite well.
Mulong Chen figured out how we could wrap these using GCC.
Some of our users:
- Jonas Bark, a German developer, was very kind. Every time we came up with a new idea to solve this memmove() problem, we sent him a new APK of his app and he was testing on his device, and see with his test devices and see if he could crash them. In the beginning he was quite well in crashing them and I haven’t heard from him in a month - I guess that issue has been solved.

Q & A

Q: if you started all over, what would you do differently? Would C++ still be a core part of the strategy, or would you guys use a different strategy? Kenneth: Yes, definitely because that has something to do with speed. You don’t write a database in anything but C or C++ I guess. But you have to be careful searching for closed bugs, because some of these are closed by Google, so they’re very hard to find. You have to be very persistent in trying to search for bug reports. That is probably the best advice because yeah, they are closed, so they will not show up easily in searches on the internet.

Q: How do you determine what bugs maybe not to fix, or that you can’t fix? Kenneth: All that we have seen and been able to identify a workaround or fix for, we have fixed. Bugs we cannot reproduce at the moment are waiting until we get a chance to find the cause and figure out what it is in reality. These are often solved in a matter of months. We dedicate a person to look into it and figure out what it is, and if we cannot do that in a couple of months, then typically we have to park it somewhere and say, this is hopefully not affecting too many people.

Next Up: New Features in Realm Java

About the content

This content has been published here with the express permission of the author.

Kenneth Geisshirt

Kenneth holds a Ph.D. in chemistry (and a B.Sc. in computer science), and in the 1990s he primarily worked on simulating chemical reacting on supercomputers. After graduating, he has been working as a software developer focusing on open-source software. Currently, he is working for Realm where he is part of the Android team. In his spare time, he has been speaking at meetups, conferences, and user groups and writing articles and book on topics related to software development and open source software.

Twitter

4 design patterns for a RESTless mobile integration »