Extreme I/O slowdown with PAE kernel

Last modified: 2022-06-26 10:36

The problem

In December 2014 I migrated from a 32-bit PC with 4 GiB of RAM to a 64-bit PC with 32 GiB of RAM. I ran 64-bit Slackware for routine use but installed 32-bit Slackware alongside for the convenience of quickly compiling software to run on slower 32-bit PCs. It is possible to use up to 64 GiB of RAM in 32-bit mode thanks to the Physical Address Extension (PAE) CPU feature, which is optionally enabled by Linux kernels.

Eventually I noticed an extreme I/O slowdown happening only in 32-bit mode, and only on the new PC. It would run normally for a while, but if I ran a big compile or tried to unpack a Slackware installation, the write bandwidth at some point would slow to a trickle, with no obvious indicators of where the bottleneck was.

Bug reports

None of these bug reports received a proper fix. The last one is marked "Fix Released" but actually it was closed without a fix on 2021-04-24.

Posted workarounds include:

echo 3 > /proc/sys/vm/drop_caches to relieve symptoms temporarily (confirmed, very temporary)
Limit memory with mem= kernel parameter to 8 GiB or less (confirmed, but do not want)
Revert to ancient kernel, circa 3.2 (did not try)
Set a bit with sysctl -w vm.highmem_is_dirtyable=1 or kernel parameter sysctl.vm.highmem_is_dirtyable=1 (confirmed, seems good)

Context

The bad behavior is triggered by a shortage of lowmem that results primarily from having 32 GiB of RAM. By default, 32-bit kernels allocate only 1 GiB of address space for the kernel and all of the data that it must keep available in lowmem, leaving 3 GiB of address space for user processes. Larger amounts of RAM cause more of the 1 GiB to be used up by overhead.

It has been recognized since 2003 if not earlier that the 1 GiB lowmem allocation is insufficient to manage 32 GiB of RAM. Ingo Molnar wrote:

But as the amount of RAM increases, the 3/1 split becomes a real bottleneck. Despite highmem being utilized by a number of large-size caches, one of the most crucial data structures, the mem_map[], is allocated out of the 1 GB kernel VM. With 32 GB of RAM the remaining 0.5 GB lowmem area is quite limited and only represents 1.5% of all RAM. Various common workloads exhaust the lowmem area and create artificial bottlenecks. With 64 GB RAM, the mem_map[] alone takes up nearly 1 GB of RAM, making the kernel unable to boot.

In practice, the limit before problems set in has proven to be only 8 GiB. This limit, and kernel devs' disinclination to remedy it, is documented obscurely by the final sentence of the undated Documentation/vm/highmem.txt file in the kernel source tree (4.6.7):

The general recommendation is that you don't use more than 8GiB on a 32-bit machine—although more might work for you and your workload, you're pretty much on your own—don't expect kernel developers to really care much if things come apart.

The benevolent dictator himself acknowledged the problem in a rant in 2007, in which he said (among other things):

PAE was a total and utter disaster. ... Directory caches, inodes, etc. couldn't use it, and in general it meant that under Linux, if you had more than 4GB of physical memory, you generally ran into problems (since only 25% of memory was available for normal kernel stuff—the rest had to be addressed through small holes in the tiny virtual address space).

It is not clear whether the I/O slowdown is a regression or merely a new symptom. It is possible that the slowdown was always there, but over time the kernel started using more lowmem, causing the slowdown to be triggered more often.

Blame

Although a lowmem shortage is necessary to trigger the problem, I assign ultimate blame to a bad default setting of the obscure highmem_is_dirtyable parameter. The parameter's description in kernel Documentation/admin-guide/sysctl/vm.rst describes exactly the bad behavior that was happening:

Available only for systems with CONFIG_HIGHMEM enabled (32b systems).

This parameter controls whether the high memory is considered for dirty writers throttling. This is not the case by default which means that only the amount of memory directly visible/usable by the kernel can be dirtied. As a result, on systems with a large amount of memory and lowmem basically depleted writers might be throttled too early and streaming writes can get very slow.

Changing the value to non zero would allow more memory to be dirtied and thus allow writers to write more data which can be flushed to the storage more effectively. Note this also comes with a risk of pre-mature OOM killer because some writers (e.g. direct block device writes) can only use the low memory and they can fill it up with dirty data without any throttling.

A default of on for this parameter results in uncommon operations getting OOM killed. A default of off results in common operations putting the OS into a useless state. I don't expect to need to run the uncommon operations under 32-bit, so there is no upside to having this off.

Another solution that worked for me (until it didn't)

Ingo's 4G/4G patch provided a full 4 GiB address space to both kernel and user at the cost of marginal performance overhead. It solved real problems and got deployed by Red Hat Enterprise Linux, but it was never applied upstream. Most people then migrated to 64-bit, leaving just a few of us "weirdos" still suffering from the unsolved PAE problem.

Although there is no 4G/4G patch for a modern kernel, it is possible to give the kernel more lowmem address space by taking some away from user space. The following description was accurate for kernel version 4.6.7 (2016) and remains accurate for kernel version 5.18.5 (2022).

Original configuration:

CONFIG_HIGHMEM64G: Processor type and features → High Memory Support = 64GB
CONFIG_X86_PAE: Processor type and features → PAE (Physical Address Extension) Support = y
CONFIG_HIGHPTE: Processor type and features → Allocate 3rd-level pagetables from highmem = y

Changes to adjust the memory split:

CONFIG_EXPERT: General setup → Configure standard kernel features (expert users) = y (was n)
VMSPLIT_2G: Processor type and features → Memory split = 2G/2G user/kernel split (was 3G/1G user/kernel split)

The help text for "Memory split" reads:

Select the desired split between kernel and user memory.

If the address range available to the kernel is less than the physical memory installed, the remaining memory will be available as "high memory". Accessing high memory is a little more costly than low memory, as it needs to be mapped into the kernel first. Note that increasing the kernel address space limits the range available to user programs, making the address space there tighter. Selecting anything other than the default 3G/1G split will also likely make your kernel incompatible with binary-only kernel modules.

If you are not absolutely sure what you are doing, leave this option alone!

In 2016, changing the memory split to 2G/2G fixed the problem for me and did not break the Nvidia driver in the process.

It worked for a while, but the problem came back in 2022. It flared up severely while installing Slackware 15.0 even though I was running kernel 5.18.5 with the 2G/2G split. I had to use the drop_caches trick to get installation to finish. This relapse lends support to the idea that kernel bloat has worsened the problem over time.

It was only at this point that I learned of the highmem_is_dirtyable workaround from Ubuntu bug #1333294 (posted by Norbert, nrbrtx). With that bit set, I was able to complete the same installation with the standard 3G/1G memory split. If things go downhill, it should be possible to combine the new workaround with the previous 2G/2G split.

P.S.: a PAE-killing kernel fault was fixed in 4.15.2

Kernel 4.12.x: works
Kernel 4.13.x: panics during boot
Kernel 4.14.x: immediate spontaneous reboot at "booting the kernel"
Kernel 4.15.4: works again

commit 62c00e6122a6b5aa7b1350023967a2d7a12b54c9
Author: William Grant <william.grant@canonical.com>
Date:   Tue Jan 30 22:22:55 2018 +1100

    x86/mm: Fix overlap of i386 CPU_ENTRY_AREA with FIX_BTMAP
    
    commit 55f49fcb879fbeebf2a8c1ac7c9e6d90df55f798
    
    Since commit 92a0f81d8957 ("x86/cpu_entry_area: Move it out of the
    fixmap"), i386's CPU_ENTRY_AREA has been mapped to the memory area just
    below FIXADDR_START. But already immediately before FIXADDR_START is the
    FIX_BTMAP area, which means that early_ioremap can collide with the entry
    area.
    
    It's especially bad on PAE where FIX_BTMAP_BEGIN gets aligned to exactly
    match CPU_ENTRY_AREA_BASE, so the first early_ioremap slot clobbers the
    IDT and causes interrupts during early boot to reset the system.
    
    The overlap wasn't a problem before the CPU entry area was introduced,
    as the fixmap has classically been preceded by the pkmap or vmalloc
    areas, neither of which is used until early_ioremap is out of the
    picture.
    
    Relocate CPU_ENTRY_AREA to below FIX_BTMAP, not just below the permanent
    fixmap area.
    
    Fixes: commit 92a0f81d8957 ("x86/cpu_entry_area: Move it out of the fixmap")
    Signed-off-by: William Grant <william.grant@canonical.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/7041d181-a019-e8b9-4e4e-48215f841e2c@canonical.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

KB
Home