Throughput drop and high CPU load on fast NVMe drives

From: Philipp Falk <philipp.falk@thinkparq.com>
To: linux-fsdevel@vger.kernel.org
Subject: Throughput drop and high CPU load on fast NVMe drives
Date: Tue, 22 Jun 2021 19:15:58 +0200	[thread overview]
Message-ID: <YNIaztBNK+I5w44w@xps13> (raw)

We are facing a performance issue on XFS and other filesystems running on
fast NVMe drives when reading large amounts of data through the page cache
with fio.

Streaming read performance starts off near the NVMe hardware limit until
around the total size of system memory worth of data has been read.
Performance then drops to around half the hardware limit and CPU load
increases significantly. Using perf, we were able to establish that most of
the CPU load is caused by a spin lock in native_queued_spin_lock_slowpath:

-   58,93%    58,92%  fio              [kernel.kallsyms]         [k]
native_queued_spin_lock_slowpath
     45,72% __libc_read
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        ksys_read
        vfs_read
        new_sync_read
        xfs_file_read_iter
        xfs_file_buffered_aio_read
      - generic_file_read_iter
         - 45,72% ondemand_readahead
            - __do_page_cache_readahead
               - 34,64% __alloc_pages_nodemask
                  - 34,34% __alloc_pages_slowpath
                     - 34,33% try_to_free_pages
                          do_try_to_free_pages
                        - shrink_node
                           - 34,33% shrink_lruvec
                              - shrink_inactive_list
                                 - 28,22% shrink_page_list
                                    - 28,10% __remove_mapping
                                       - 28,10% _raw_spin_lock_irqsave
                                            native_queued_spin_lock_slowpath
                                 + 6,10% _raw_spin_lock_irq
               + 11,09% read_pages

When direct I/O is used, hardware level read throughput is sustained during
the entire experiment and CPU load stays low. Threads stay in D state most
of the time.

Very similar results are described around half-way through this article
[1].

Is this a known issue with the page cache and high throughput I/O? Is there
any tuning that can be applied to get around the CPU bottleneck? We have
tried disabling readahead on the drives, which lead to very bad throughput
(~-90%). Various other scheduler related tuning was tried as well but the
results were always similar.

Experiment setup can be found below. I am happy to provide more detail if
required. If this is the wrong place to post this, please kindly let me
know.

Best regards
- Philipp

Experiment setup:

[1] https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/

CPU: 2x Intel(R) Xeon(R) Platinum 8352Y 2.2 GHz, 32c/64t each, 512GB memory
NVMe: 16x 1.6TB, 8 per NUMA node
FS: one XFS per disk, but reproducible on ext4 and ZFS
Kernel: Linux 5.3 (SLES), but reproducible on 5.12 (SUSE Tumbleweed)
NVMe scheduler: both "none" and "mq-deadline", very similar results
fio: 4 threads per NVMe drive, 20GiB of data per thread, ioengine=sync
  Sustained read throughput direct=1: ~52GiB/s (~3.2 GiB/s*disk)
  Sustained read throughput direct=0: ~25GiB/s (~1.5 GiB/s*disk)