From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9484C2B9F4 for ; Tue, 22 Jun 2021 17:16:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A5C6E6113E for ; Tue, 22 Jun 2021 17:16:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231834AbhFVRSU (ORCPT ); Tue, 22 Jun 2021 13:18:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56184 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229501AbhFVRSU (ORCPT ); Tue, 22 Jun 2021 13:18:20 -0400 Received: from mail-wm1-x331.google.com (mail-wm1-x331.google.com [IPv6:2a00:1450:4864:20::331]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 347D3C061574 for ; Tue, 22 Jun 2021 10:16:03 -0700 (PDT) Received: by mail-wm1-x331.google.com with SMTP id m41-20020a05600c3b29b02901dcd3733f24so2271250wms.1 for ; Tue, 22 Jun 2021 10:16:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=thinkparq-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:subject:message-id:mime-version:content-disposition; bh=k3zU6Su0h1BSZnCCHgosH8g3/GX+UKcaMqkKsTDFsDs=; b=b857j7h+G50Qraklp/VJzDS2GYp1csdXJiAh8sev8TwkFqAWQJQWtz99pR7AzHnsoo +wV+ahtkI/ojWxRuDEHJTsge8ZGzrlOZg/WXm+l9r4CvOS+bMoDHKXZhZBdmt+UHGVqR ZzSXMlxqaOTH/G5knZRsLPvuj9CnoeT3VGDxKdk4aT5J2u50g+VfHcJeAx5/nLZIERFU YDsJ3p660DljuyyNwpboez+hPk4emEY1Ltdx/wxvXKQE7rpK4Kfewd10YJT/uJGT5WU/ 7OcOJZJGiQuYuEN7wyAcUsCcvcmewMmnu0OMAfvRzKoh/oEzhCr0OnJT8Sry4VyHLG+l kE/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:subject:message-id:mime-version :content-disposition; bh=k3zU6Su0h1BSZnCCHgosH8g3/GX+UKcaMqkKsTDFsDs=; b=ir3xLXi597dFlHs+OXkgqOBAadghCEbUcOAwrMesl/tMNZE7MMGI1BqOHieKtajRzc yXhUHjPNO5jNmYjSrwhq9vesjusfPBDxeMwLnD1zWHZ5e4UiPq462mTCV4b4Uu3NVGSb CrOgXa778l5xK8WGwxnY/UuCQNSJC+/Y6H+vQVLh8BIjBvZ6nZgTBR1eBsczwGnnMbXk dM64u7iH23+kvz7ilqsJSZC4wSNxcDAAst00TZjCxpgdbpjv2GaZdI00EjJmXfNF6vVr NqGL6t5QIo1dqtj6jkQRmiJtuaWFtnkPwgdBmPr1aTFhgrXA195zPpuAymr3yh3R8Sap Rzvg== X-Gm-Message-State: AOAM533UdEwfW8zrv9tmzdtfx944JtDznCLM71UENG8ECEVEuFilm8G0 HLsp+Vtnn16IkSyjTLBJfUW66EhocIagyFAL X-Google-Smtp-Source: ABdhPJy5fjWRLqVzfoiq0rQODpTer9+Sg3yQqjmCcu10Q9+Gq26Q+GeZmBnbF3Y8AcLGKHNUbjNneA== X-Received: by 2002:a7b:c203:: with SMTP id x3mr5502781wmi.153.1624382161696; Tue, 22 Jun 2021 10:16:01 -0700 (PDT) Received: from xps13 (HSI-KBW-095-208-248-008.hsi5.kabel-badenwuerttemberg.de. [95.208.248.8]) by smtp.gmail.com with ESMTPSA id o20sm3067575wms.3.2021.06.22.10.16.00 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 22 Jun 2021 10:16:01 -0700 (PDT) Date: Tue, 22 Jun 2021 19:15:58 +0200 From: Philipp Falk To: linux-fsdevel@vger.kernel.org Subject: Throughput drop and high CPU load on fast NVMe drives Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We are facing a performance issue on XFS and other filesystems running on fast NVMe drives when reading large amounts of data through the page cache with fio. Streaming read performance starts off near the NVMe hardware limit until around the total size of system memory worth of data has been read. Performance then drops to around half the hardware limit and CPU load increases significantly. Using perf, we were able to establish that most of the CPU load is caused by a spin lock in native_queued_spin_lock_slowpath: - 58,93% 58,92% fio [kernel.kallsyms] [k] native_queued_spin_lock_slowpath 45,72% __libc_read entry_SYSCALL_64_after_hwframe do_syscall_64 ksys_read vfs_read new_sync_read xfs_file_read_iter xfs_file_buffered_aio_read - generic_file_read_iter - 45,72% ondemand_readahead - __do_page_cache_readahead - 34,64% __alloc_pages_nodemask - 34,34% __alloc_pages_slowpath - 34,33% try_to_free_pages do_try_to_free_pages - shrink_node - 34,33% shrink_lruvec - shrink_inactive_list - 28,22% shrink_page_list - 28,10% __remove_mapping - 28,10% _raw_spin_lock_irqsave native_queued_spin_lock_slowpath + 6,10% _raw_spin_lock_irq + 11,09% read_pages When direct I/O is used, hardware level read throughput is sustained during the entire experiment and CPU load stays low. Threads stay in D state most of the time. Very similar results are described around half-way through this article [1]. Is this a known issue with the page cache and high throughput I/O? Is there any tuning that can be applied to get around the CPU bottleneck? We have tried disabling readahead on the drives, which lead to very bad throughput (~-90%). Various other scheduler related tuning was tried as well but the results were always similar. Experiment setup can be found below. I am happy to provide more detail if required. If this is the wrong place to post this, please kindly let me know. Best regards - Philipp Experiment setup: [1] https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation/ CPU: 2x Intel(R) Xeon(R) Platinum 8352Y 2.2 GHz, 32c/64t each, 512GB memory NVMe: 16x 1.6TB, 8 per NUMA node FS: one XFS per disk, but reproducible on ext4 and ZFS Kernel: Linux 5.3 (SLES), but reproducible on 5.12 (SUSE Tumbleweed) NVMe scheduler: both "none" and "mq-deadline", very similar results fio: 4 threads per NVMe drive, 20GiB of data per thread, ioengine=sync Sustained read throughput direct=1: ~52GiB/s (~3.2 GiB/s*disk) Sustained read throughput direct=0: ~25GiB/s (~1.5 GiB/s*disk)