From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CB9AC433EF for ; Wed, 23 Feb 2022 19:28:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 96BF58D004E; Wed, 23 Feb 2022 14:28:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 91BCC8D0011; Wed, 23 Feb 2022 14:28:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7BD628D004E; Wed, 23 Feb 2022 14:28:52 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0134.hostedemail.com [216.40.44.134]) by kanga.kvack.org (Postfix) with ESMTP id 568568D0011 for ; Wed, 23 Feb 2022 14:28:52 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id CFF2D9CD46 for ; Wed, 23 Feb 2022 19:28:51 +0000 (UTC) X-FDA: 79175032062.19.9DFB396 Received: from mail-yw1-f175.google.com (mail-yw1-f175.google.com [209.85.128.175]) by imf20.hostedemail.com (Postfix) with ESMTP id 4A5411C000D for ; Wed, 23 Feb 2022 19:28:51 +0000 (UTC) Received: by mail-yw1-f175.google.com with SMTP id 00721157ae682-2d68d519a33so240417b3.7 for ; Wed, 23 Feb 2022 11:28:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=4AmhSbrkqhVjTGvVYJZEgTP9S73y1xu5aKxob8ui4Lk=; b=EiUGkf8isYjbujnilB75PrECUVA7V8rzrAIq5bfRX0VcayNnp7bBrZF4+q33OhLofQ Dzm5G1onYojN1STzbZFe/xkUtdPKIAZJJyyTLgXNDQMsSRUlnjSIcFDMltk03xGDwMPV 2e8bMoE1BDibilVwF8QbIdz5fqw3gxgBZvDs0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=4AmhSbrkqhVjTGvVYJZEgTP9S73y1xu5aKxob8ui4Lk=; b=Pwwt0/VDpG03OR7NR9H5rhr91vOxdsn/rVHoC24YIFlFmyrk7rWRnYEzJGLvqzGAs/ lhRSSwsNVleQzcTZXECDKfiaL88E/F2drsSEOje9YitOItx2PEzlLX82iQ3ixlXmXqlc r+w0h0T1tXx8GwSgR32c8R3YrsRVv5i2njjkddXYfWzLUvT0vx7kf11C6wba+qCEJ2+7 b5y/euxBV7P3uVDVjK1QKBIC2hoMRCLfreJjgjEWYi9f0xkGErVcKcH20Z9d63eiKLTv xn/wxn+Ww0yt3TqA9VF3FGKcOSpOckYeXEDuU0z7W7+TJzuaZRLOPti7xTRhipjNlC0M wq7w== X-Gm-Message-State: AOAM531tXsjl4HWmEERoiutDzpoIgQEdzjsQ5KPmyiIwsM2cqUl2T2VG ZVYVLQ7Hhh+QBPD1YXEwh76Dq/vsjOLHQRNJ4VSBKQ== X-Google-Smtp-Source: ABdhPJz857bW0bbqcmhdXllq8hQUjG8tT+NpVRcySV1ZN6LC0p2n6oJxF+S953pNKVWji9bqX6zJk1m7KvFx72ltTbo= X-Received: by 2002:a81:6685:0:b0:2d6:275f:9fd3 with SMTP id a127-20020a816685000000b002d6275f9fd3mr1117252ywc.104.1645644530321; Wed, 23 Feb 2022 11:28:50 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Ivan Babrou Date: Wed, 23 Feb 2022 11:28:39 -0800 Message-ID: Subject: Re: Regression in workingset_refault latency on 5.15 To: Shakeel Butt Cc: Daniel Dao , kernel-team , Linux MM , Johannes Weiner , Roman Gushchin , Feng Tang , Michal Hocko , Hillf Danton , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Andrew Morton , Linus Torvalds Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 4A5411C000D X-Stat-Signature: ofojdh11ix1gazugq8koh4nqxwpaci64 Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=cloudflare.com header.s=google header.b=EiUGkf8i; spf=none (imf20.hostedemail.com: domain of ivan@cloudflare.com has no SPF policy when checking 209.85.128.175) smtp.mailfrom=ivan@cloudflare.com; dmarc=pass (policy=reject) header.from=cloudflare.com X-HE-Tag: 1645644531-984508 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > Can you please first confirm if you are running 5.15.19 kernel? If > that is the case then it means the machine is seeing (nr_cpus * 32) > stat updates. Yes, we're on 5.15.19. > Couple of questions: > 1) Did you see any call stack other than xfs read where expensive > mem_cgroup_flush_stats() is showing up? Yes: mem_cgroup_flush_stats: 7717 us stack: mem_cgroup_flush_stats+1 workingset_refault+296 __read_swap_cache_async+512 swapin_readahead+817 do_swap_page+561 __handle_mm_fault+2318 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 > 2) Can you please use the similar bpf+kprobe tracing for the > memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the > source of frequent stat updates. "memcg_rstat_updated" is "static inline". With the following: bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }' The top stacks are: @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 clear_page_dirty_for_io+292 write_cache_pages+323 iomap_writepages+28 xfs_vm_writepages+113 do_writepages+195 __writeback_single_inode+55 writeback_sb_inodes+493 __writeback_inodes_wb+76 ]: 36429 @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 __test_set_page_writeback+421 iomap_do_writepage+1241 write_cache_pages+393 iomap_writepages+28 xfs_vm_writepages+113 do_writepages+195 __writeback_single_inode+55 writeback_sb_inodes+493 ]: 36431 @stacks[ __mod_memcg_lruvec_state+1 mod_objcg_state+421 memcg_slab_post_alloc_hook+345 __kmalloc_node_track_caller+386 __alloc_skb+139 alloc_skb_with_frags+75 sock_alloc_send_pskb+495 unix_stream_sendmsg+599 sock_sendmsg+94 sock_write_iter+151 ]: 39459 @stacks[ __mod_memcg_lruvec_state+1 release_pages+760 tlb_finish_mmu+89 zap_page_range+265 do_madvise.part.0+1844 __x64_sys_madvise+69 do_syscall_64+59 entry_SYSCALL_64_after_hwframe+68 ]: 54648 @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 page_remove_rmap+85 unmap_page_range+1413 zap_page_range+224 do_madvise.part.0+1844 __x64_sys_madvise+69 do_syscall_64+59 entry_SYSCALL_64_after_hwframe+68 ]: 54719 @stacks[ __mod_memcg_lruvec_state+1 __pagevec_lru_add+256 lru_cache_add+82 __handle_mm_fault+4686 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 98103 @stacks[ __mod_memcg_lruvec_state+1 __mod_lruvec_page_state+96 page_add_new_anon_rmap+78 __handle_mm_fault+4675 handle_mm_fault+191 do_user_addr_fault+428 exc_page_fault+103 asm_exc_page_fault+30 ]: 105071 > 3) I am still pondering why disabling swap resolves the issue for you. > Is that only for a workload different from xfs read? My understanding is that any block IO (including swap) triggers new memcg accounting code. In our process we don't have any other IO than swap, so disabling swap removes the major (if not only) vector of triggering this issue. > 4) Can you please also trace cgroup_rstat_flush_locked()? Stacks are: bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @stacks[kstack(10)]++ }' Attaching 1 probe... ^C @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 shrink_node+163 balance_pgdat+777 kswapd+508 kthread+295 ret_from_fork+34 ]: 1 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 __read_swap_cache_async+512 swapin_readahead+817 do_swap_page+561 __handle_mm_fault+2318 handle_mm_fault+191 do_user_addr_fault+428 ]: 5 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 filemap_get_pages+569 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 ]: 18 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 xfs_file_read_iter+106 ]: 31 @stacks[ cgroup_rstat_flush_locked+1 cgroup_rstat_flush_irqsafe+36 __mem_cgroup_flush_stats+51 workingset_refault+296 add_to_page_cache_lru+159 page_cache_ra_unbounded+340 force_page_cache_ra+226 filemap_get_pages+233 filemap_read+164 xfs_file_buffered_read+152 ]: 223 Let me know if you only need the slow ones. I should mention that there are really two issues: 1. Expensive workingset_refault, which shows up on flamegraphs. We see it for our rocksdb based database, which persists data on xfs (local nvme). 2. Expensive workingset_refault that causes latency hiccups, but doesn't show up on flamegraphs. We see it in our nginx based proxy with swap enabled (either zram or regular file on xfs on local nvme). We solved the latter by disabling swap. I think the proper solution would be for workingset_refault to be fast enough to be invisible, in line with what was happening on Linux 5.10.