July 5, 2020

Random Linux Oddity #1: ru_maxrss is Inherited

These days I do roughly 100% of my development on and for systems running Linux. Since my work and personal interests are pretty “low level”, I have spent quite a bit of time investigating the weird, and surprising details of the Linux kernel. Every time I spend hours or days tracking one of these little oddities down, I have a strong desire to shout it from rooftops, and “share it with the world”. Unfortunately, these tidbits are too small to really fill out a full-length “blog entry”. To date, I have avoided posting them because of this. But today, after spending another 6 hours investigating yet another quirk, I’ve yet again become overwhelmed with the need to share. So today, I’m creating a series of all the small annoyances and surprising wonders I’ve run into working on Linux called “Random Linux Oddities”. Each will be numbered starting from 1 for no particular reason.

The oddity that has the honor of being the first in the series, is the one I spent several hours on today that has prompted this post: ru_maxrss.

Background #

Today I was working on ashuffle, a little music shuffling client I wrote for the music playing server MPD. One of the earliest bug reports I got for ashuffle, was that it would often crash or get OOM killed when running on memory constrained devices (think Raspberry Pis) for users with large (10k-50k) music libraries. This is unfortunately not entirely surprising. Due to the quirks of the MPD protocol, I have to keep a copy of the “URI” for every song the user wants to shuffle. For must users, this is their entire library. The URI for a song is just the path to that song on the user’s local filesystem (with the common “library root” prefix removed), so these can be a bit long. When you scale that to tens of thousands of songs, running out of memory is plausible.

Investigation #

One of the most important things I’ve learned about performance work, is that measuring the performance you care about is very important. Otherwise you’re playing whack-a-mole in the dark, just guessing on what is and isn’t causing you problems. The first step was to write a test harness that could say how much memory ashuffle was using. From previous experience I knew that the wait4 syscall provides a helpful rusage parameter that reports several interesting statistics. One of those looked useful: ru_maxrss, the maximum number of kilobytes the process ever had mapped into its address space. Seems like a perfect fit!

I wrote a new test that ran ashuffle against a large test library, and then logged out the ru_maxrss value received from wait4. Much to my dismay, it was ~800MiB for a library of “only” 20k or so songs! That could certainly be an issue on a device like the Raspberry Pi Zero that has only 512 MiB of RAM in total.

So, with the target clear, I set to work on figuring how all that memory was being used. I hooked up a heap profiler (specifically the one bundled with tcmalloc). After re-running the test with the profiler enabled, I got this:

Only 60MiB in use when the program exited! That’s quite a bit less than the 840MiB reported by ru_maxrss. What is going on here?

Investigating the Investigation #

I’m no expert in memory profilers, so at this point, I didn’t know what to think. Since I was just looking at the final memory usage, and ru_maxrss measures peak usage, I figured there was probably something I was missing given the 10x increase. I re-ran the profiler, but this time had it dump samples for every 10MiB that was allocated. Looking at each of the samples, I didn’t see any weird peak-y behavior. The in-use memory increase was linear and just gradually ramped up to the final 60MiB count.

So where to go from here? The ru_maxrss value was consistently high, it wasn’t just a one off. ru_maxrss is measuring the maximum “Resident Set Size” (RSS) which is not quite the same thing the heap profiler was checking. The heap profiler captures all memory allocated through “normal” user-level memory allocation APIs like malloc/free or C++‘s new/delete. The resident set size contains other things, like memory retrieved from the operating system using mmap or sbrk. Maybe the “resident set size” captures some other big chunk of memory as well?

After a bit of digging, I found out an alternate way to get the RSS of a process: /proc/<pid>/statm. This file contains several helpful numbers that represent the current memory usage of the process. The MAN page for proc also helpfully explained that RSS is actually the sum of three different values:

The number of shared memory pages mapped into the virtual memory of the process.
The number of files mapped into the process (including things like the binary itself, and any shared libraries).
The number of “anonymous” pages mapped into the process. Anonymous pages are what I would think of a “process memory”. Pages used exclusively by that process.

To figure out if some of these other components of the RSS, were messing with my ru_maxrss value; I rigged up a bit of code to print the size of the RSS, shared+file mapped pages, and anonymous pages right at the start of my program (first line of main) and right before I called exit. The numbers lined up perfectly with the results from the heap profiler:

Startup: rss 1.61 MiB, anon 164.00 KiB, shared 1.45 MiB
Exit: rss 67.34 MiB, anon 63.74 MiB, shared 3.60 MiB
Change: rss 65.73 MiB, anon 63.58 MiB, shared 2.15 MiB

A growth of ~60MiB over the life of the program. So the heap profiler was right, but where the hell was that 800 MiB coming from!?

As a final lark, I decided to just print the ru_maxrss from within the process itself using the getrusage syscall (rather than getting from wait4). I thought that maybe the semantics of wait4 were different, or my Go-based test runner was using different units. The result was even more surprising:

Startup MaxRSS: 846.71 MiB
Startup: rss 1.61 MiB, anon 164.00 KiB, shared 1.45 MiB
...

The ru_maxrss I was getting from wait4 matched exactly what I was seeing starting on the first line of main. How did I manage to allocate 846MiB before my program even started?

Original Sin #

Googling didn’t turn up anyone else surprised to see their ru_maxrss value being sky high on the first line of main, so I started digging into the Linux source. Grepping for ru_maxrss, I found this line in the implementation for getrusage:

        if (maxrss < p->signal->maxrss)
            maxrss = p->signal->maxrss;

p in this case is the current task. From my previous adventures into the kernel, I knew that p->signal is used to keep track of lots of “process” related stuff (i.e. stuff related to all threads in a process), not just signal information. Further grepping for signal->maxrss turned up this bit of code:

    if (old_mm) {
        mmap_read_unlock(old_mm);
        BUG_ON(active_mm != old_mm);
        setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
        mm_update_next_owner(old_mm);
        mmput(old_mm);
        return 0;
    }

This code is run during the implementation of execve. Specifically in exec_mmap which officially switches the new task to it’s new, clean, empty, virtual memory map. This code says that if the newly exec’d task already had a memory map (if (old_mm) { which is always true for userland tasks), then we update the process’s maxrss value to be whatever the previous memory map’s maxrss was. This means that task->signal->maxrss is effectively inherited! A new process is created with a fork() call followed by an execve() call. fork() preserves memory mappings, so it will preserve the maxrss value. Then as we see here, on execve(), this value is copied from the parent memory map to the child process. If one of our parents pegged task->signal->maxrss to some high value, say 800MiB, then our process too will see that as our max RSS size.

Why is this? I have no clue. Like for most other oddities, probably some legacy reason lost to time. That’s just how it worked, and we cannot break backwards compatibility, so that is how it shall always work no matter how surprising or useless it is. Hyrum’s law dictates that someone, somewhere, must be relying on this.

After this discovery, I did a more targeted search and found this Go issue where someone else found this surprising behavior (and Ian Lance Taylor confirmed ru_maxrss is inherited). I’m not the first person to have been tripped up by this. Hopefully this post will save someone else a bit of time.

P.S. #

Another good takeaway from this experience, always RTFM. While I was adding links to this post, I searched “exec” in the getrusage manual page and found this helpful line in the NOTES section:

Resource usage metrics are preserved across an execve(2).

Kudos