From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1CB9AC433EF
	for <linux-mm@archiver.kernel.org>; Wed, 23 Feb 2022 19:28:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 96BF58D004E; Wed, 23 Feb 2022 14:28:52 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 91BCC8D0011; Wed, 23 Feb 2022 14:28:52 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7BD628D004E; Wed, 23 Feb 2022 14:28:52 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0134.hostedemail.com [216.40.44.134])
	by kanga.kvack.org (Postfix) with ESMTP id 568568D0011
	for <linux-mm@kvack.org>; Wed, 23 Feb 2022 14:28:52 -0500 (EST)
Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id CFF2D9CD46
	for <linux-mm@kvack.org>; Wed, 23 Feb 2022 19:28:51 +0000 (UTC)
X-FDA: 79175032062.19.9DFB396
Received: from mail-yw1-f175.google.com (mail-yw1-f175.google.com [209.85.128.175])
	by imf20.hostedemail.com (Postfix) with ESMTP id 4A5411C000D
	for <linux-mm@kvack.org>; Wed, 23 Feb 2022 19:28:51 +0000 (UTC)
Received: by mail-yw1-f175.google.com with SMTP id 00721157ae682-2d68d519a33so240417b3.7
        for <linux-mm@kvack.org>; Wed, 23 Feb 2022 11:28:51 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cloudflare.com; s=google;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=4AmhSbrkqhVjTGvVYJZEgTP9S73y1xu5aKxob8ui4Lk=;
        b=EiUGkf8isYjbujnilB75PrECUVA7V8rzrAIq5bfRX0VcayNnp7bBrZF4+q33OhLofQ
         Dzm5G1onYojN1STzbZFe/xkUtdPKIAZJJyyTLgXNDQMsSRUlnjSIcFDMltk03xGDwMPV
         2e8bMoE1BDibilVwF8QbIdz5fqw3gxgBZvDs0=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=4AmhSbrkqhVjTGvVYJZEgTP9S73y1xu5aKxob8ui4Lk=;
        b=Pwwt0/VDpG03OR7NR9H5rhr91vOxdsn/rVHoC24YIFlFmyrk7rWRnYEzJGLvqzGAs/
         lhRSSwsNVleQzcTZXECDKfiaL88E/F2drsSEOje9YitOItx2PEzlLX82iQ3ixlXmXqlc
         r+w0h0T1tXx8GwSgR32c8R3YrsRVv5i2njjkddXYfWzLUvT0vx7kf11C6wba+qCEJ2+7
         b5y/euxBV7P3uVDVjK1QKBIC2hoMRCLfreJjgjEWYi9f0xkGErVcKcH20Z9d63eiKLTv
         xn/wxn+Ww0yt3TqA9VF3FGKcOSpOckYeXEDuU0z7W7+TJzuaZRLOPti7xTRhipjNlC0M
         wq7w==
X-Gm-Message-State: AOAM531tXsjl4HWmEERoiutDzpoIgQEdzjsQ5KPmyiIwsM2cqUl2T2VG
	ZVYVLQ7Hhh+QBPD1YXEwh76Dq/vsjOLHQRNJ4VSBKQ==
X-Google-Smtp-Source: ABdhPJz857bW0bbqcmhdXllq8hQUjG8tT+NpVRcySV1ZN6LC0p2n6oJxF+S953pNKVWji9bqX6zJk1m7KvFx72ltTbo=
X-Received: by 2002:a81:6685:0:b0:2d6:275f:9fd3 with SMTP id
 a127-20020a816685000000b002d6275f9fd3mr1117252ywc.104.1645644530321; Wed, 23
 Feb 2022 11:28:50 -0800 (PST)
MIME-Version: 1.0
References: <CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com>
 <CALvZod4+-q7ciz3A_-J0UWjQNe1P==WMyNNKFu4dqiy8oGisTQ@mail.gmail.com>
 <CALvZod76u0qXBzrCZ16j9xTVkPPvqyw9+4s2K67bVwB1fQHnPw@mail.gmail.com>
 <CA+wXwBTXBTEBiYmaWGK=Oct_X5SctyaWTEJ0orCzWFoobexbHg@mail.gmail.com> <CALvZod5g1o9f2qZcb-JkVAqdAmfN-TYyO_3KaK2iZh=4hmO56Q@mail.gmail.com>
In-Reply-To: <CALvZod5g1o9f2qZcb-JkVAqdAmfN-TYyO_3KaK2iZh=4hmO56Q@mail.gmail.com>
From: Ivan Babrou <ivan@cloudflare.com>
Date: Wed, 23 Feb 2022 11:28:39 -0800
Message-ID: <CABWYdi0yzaKdOC2r6tgWODmQtHT_8NgaHQhrFEo_to0BWDKP2A@mail.gmail.com>
Subject: Re: Regression in workingset_refault latency on 5.15
To: Shakeel Butt <shakeelb@google.com>
Cc: Daniel Dao <dqminh@cloudflare.com>, kernel-team <kernel-team@cloudflare.com>, 
	Linux MM <linux-mm@kvack.org>, Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <guro@fb.com>, 
	Feng Tang <feng.tang@intel.com>, Michal Hocko <mhocko@kernel.org>, Hillf Danton <hdanton@sina.com>, 
	=?UTF-8?Q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>, 
	Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 4A5411C000D
X-Stat-Signature: ofojdh11ix1gazugq8koh4nqxwpaci64
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=cloudflare.com header.s=google header.b=EiUGkf8i;
	spf=none (imf20.hostedemail.com: domain of ivan@cloudflare.com has no SPF policy when checking 209.85.128.175) smtp.mailfrom=ivan@cloudflare.com;
	dmarc=pass (policy=reject) header.from=cloudflare.com
X-HE-Tag: 1645644531-984508
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

> Can you please first confirm if you are running 5.15.19 kernel? If
> that is the case then it means the machine is seeing (nr_cpus * 32)
> stat updates.

Yes, we're on 5.15.19.

> Couple of questions:
> 1) Did you see any call stack other than xfs read where expensive
> mem_cgroup_flush_stats() is showing up?

Yes:

mem_cgroup_flush_stats: 7717 us
stack:
        mem_cgroup_flush_stats+1
        workingset_refault+296
        __read_swap_cache_async+512
        swapin_readahead+817
        do_swap_page+561
        __handle_mm_fault+2318
        handle_mm_fault+191
        do_user_addr_fault+428
        exc_page_fault+103
        asm_exc_page_fault+30

> 2) Can you please use the similar bpf+kprobe tracing for the
> memcg_rstat_updated() (or __mod_memcg_lruvec_state()) to find the
> source of frequent stat updates.

"memcg_rstat_updated" is "static inline".

With the following:

bpftrace -e 'kprobe:__mod_memcg_lruvec_state { @stacks[kstack(10)]++ }'

The top stacks are:

@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    clear_page_dirty_for_io+292
    write_cache_pages+323
    iomap_writepages+28
    xfs_vm_writepages+113
    do_writepages+195
    __writeback_single_inode+55
    writeback_sb_inodes+493
    __writeback_inodes_wb+76
]: 36429
@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    __test_set_page_writeback+421
    iomap_do_writepage+1241
    write_cache_pages+393
    iomap_writepages+28
    xfs_vm_writepages+113
    do_writepages+195
    __writeback_single_inode+55
    writeback_sb_inodes+493
]: 36431
@stacks[
    __mod_memcg_lruvec_state+1
    mod_objcg_state+421
    memcg_slab_post_alloc_hook+345
    __kmalloc_node_track_caller+386
    __alloc_skb+139
    alloc_skb_with_frags+75
    sock_alloc_send_pskb+495
    unix_stream_sendmsg+599
    sock_sendmsg+94
    sock_write_iter+151
]: 39459
@stacks[
    __mod_memcg_lruvec_state+1
    release_pages+760
    tlb_finish_mmu+89
    zap_page_range+265
    do_madvise.part.0+1844
    __x64_sys_madvise+69
    do_syscall_64+59
    entry_SYSCALL_64_after_hwframe+68
]: 54648
@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    page_remove_rmap+85
    unmap_page_range+1413
    zap_page_range+224
    do_madvise.part.0+1844
    __x64_sys_madvise+69
    do_syscall_64+59
    entry_SYSCALL_64_after_hwframe+68
]: 54719
@stacks[
    __mod_memcg_lruvec_state+1
    __pagevec_lru_add+256
    lru_cache_add+82
    __handle_mm_fault+4686
    handle_mm_fault+191
    do_user_addr_fault+428
    exc_page_fault+103
    asm_exc_page_fault+30
]: 98103
@stacks[
    __mod_memcg_lruvec_state+1
    __mod_lruvec_page_state+96
    page_add_new_anon_rmap+78
    __handle_mm_fault+4675
    handle_mm_fault+191
    do_user_addr_fault+428
    exc_page_fault+103
    asm_exc_page_fault+30
]: 105071

> 3) I am still pondering why disabling swap resolves the issue for you.
> Is that only for a workload different from xfs read?

My understanding is that any block IO (including swap) triggers new
memcg accounting code. In our process we don't have any other IO than
swap, so disabling swap removes the major (if not only) vector of
triggering this issue.

> 4) Can you please also trace cgroup_rstat_flush_locked()?

Stacks are:

bpftrace -e 'kprobe:cgroup_rstat_flush_locked { @stacks[kstack(10)]++ }'
Attaching 1 probe...
^C

@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    shrink_node+163
    balance_pgdat+777
    kswapd+508
    kthread+295
    ret_from_fork+34
]: 1
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    __read_swap_cache_async+512
    swapin_readahead+817
    do_swap_page+561
    __handle_mm_fault+2318
    handle_mm_fault+191
    do_user_addr_fault+428
]: 5
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    filemap_get_pages+569
    filemap_read+164
    xfs_file_buffered_read+152
    xfs_file_read_iter+106
]: 18
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    filemap_get_pages+233
    filemap_read+164
    xfs_file_buffered_read+152
    xfs_file_read_iter+106
]: 31
@stacks[
    cgroup_rstat_flush_locked+1
    cgroup_rstat_flush_irqsafe+36
    __mem_cgroup_flush_stats+51
    workingset_refault+296
    add_to_page_cache_lru+159
    page_cache_ra_unbounded+340
    force_page_cache_ra+226
    filemap_get_pages+233
    filemap_read+164
    xfs_file_buffered_read+152
]: 223

Let me know if you only need the slow ones.

I should mention that there are really two issues:

1. Expensive workingset_refault, which shows up on flamegraphs. We see
it for our rocksdb based database, which persists data on xfs (local
nvme).
2. Expensive workingset_refault that causes latency hiccups, but
doesn't show up on flamegraphs. We see it in our nginx based proxy
with swap enabled (either zram or regular file on xfs on local nvme).

We solved the latter by disabling swap. I think the proper solution
would be for workingset_refault to be fast enough to be invisible, in
line with what was happening on Linux 5.10.