From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1E54BC433EF
	for <linux-mm@archiver.kernel.org>; Wed, 23 Feb 2022 15:57:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 74DDA8D0020; Wed, 23 Feb 2022 10:57:49 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6F98A8D0011; Wed, 23 Feb 2022 10:57:49 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5C2288D0020; Wed, 23 Feb 2022 10:57:49 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0188.hostedemail.com [216.40.44.188])
	by kanga.kvack.org (Postfix) with ESMTP id 325F48D0011
	for <linux-mm@kvack.org>; Wed, 23 Feb 2022 10:57:49 -0500 (EST)
Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id C99239A7EE
	for <linux-mm@kvack.org>; Wed, 23 Feb 2022 15:57:48 +0000 (UTC)
X-FDA: 79174500216.26.C0C6B6A
Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173])
	by imf14.hostedemail.com (Postfix) with ESMTP id 4D4E210000C
	for <linux-mm@kvack.org>; Wed, 23 Feb 2022 15:57:48 +0000 (UTC)
Received: by mail-pl1-f173.google.com with SMTP id ay3so10226073plb.1
        for <linux-mm@kvack.org>; Wed, 23 Feb 2022 07:57:48 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=Iw2Rn6hEwWrz/MbUy/nWBT4/5B/bAg/6lMZN10lmW1k=;
        b=VQaBd24Q1L6b6waa8eJm6VIBgsMCJQsdzfAJllEl8/swhByLRy6hDbDD9ldq6iD/0H
         CrgCBer+Z7qb6N7bWqoePwFK3Nbc3lTq+5Dd0V8LCxFpxcaGeS40qYfBLXYaiszLD8U5
         DacBu7/xJ3Z8V7kt+K70kAE/kWXJg2Ye4rH2obB+skYLf42RX7g1L9Ss3LVyrkpDwYwg
         hbe3H61LMtsGkDpIlgQ6cZKZWU0wKW0P7XwZ+nYODTHVOkoG9pJT5M4WIXkPJlELLjTK
         VtWHDDByHmbG3QwLayYao336e5m/lUNe6neA51dAxvKykGhyOWUu7cZXP6XvSTNfh94D
         O+2Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=Iw2Rn6hEwWrz/MbUy/nWBT4/5B/bAg/6lMZN10lmW1k=;
        b=7uQy4iiCfZILU06W9qy/ZGFFw1n9Js1XPg25IcLb5/Mw79Cdg4ONWHcY2EEtsnOlur
         UYp6n0ol2A6VB8K1deOh2zovkcnz7JEf2YwCqhHS2iYwZfxeAlZ0ONS2C2faLVn/H9vp
         mluC2aDOfGEpO+zISvX9uQ8/m65HnjAwu2W+7Tc8IDiJ+o6LSViVoboqFPb75ii1siZS
         P1bm5lNV9wDb9trJPvnb93jlL6ur11uLMSTGd6PCP5BD28DvktCKwroBAnzCtHHWSRMw
         nm2XPTv6gb4MGjl4BkZu7NwXVuDLw9rXeGlNvsIx7JixbXrZ79cXcrPYP7Ixi0fROgFX
         jAKw==
X-Gm-Message-State: AOAM53048VU/kyGEctWoJ7q6DjtIUJqEThicq15UIZ8REVRSVHPhsrrH
	dXy0ltqGyMGwEpSHXNMeLvXL6BPXaXDcDsl9DSVmxQ==
X-Google-Smtp-Source: ABdhPJwpPdc7x3Zwp/bUHc/wI7LP6UeLIpv9dGcg2MUkLLX9kaw64c7/u74SCjWqwx7WJVkHZ8EffJZpPHQZOtZLz04=
X-Received: by 2002:a17:903:2cb:b0:14f:4fb6:2fb0 with SMTP id
 s11-20020a17090302cb00b0014f4fb62fb0mr225173plk.172.1645631866705; Wed, 23
 Feb 2022 07:57:46 -0800 (PST)
MIME-Version: 1.0
References: <CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com>
In-Reply-To: <CA+wXwBSyO87ZX5PVwdHm-=dBjZYECGmfnydUicUyrQqndgX2MQ@mail.gmail.com>
From: Shakeel Butt <shakeelb@google.com>
Date: Wed, 23 Feb 2022 07:57:35 -0800
Message-ID: <CALvZod4+-q7ciz3A_-J0UWjQNe1P==WMyNNKFu4dqiy8oGisTQ@mail.gmail.com>
Subject: Re: Regression in workingset_refault latency on 5.15
To: Daniel Dao <dqminh@cloudflare.com>, Ivan Babrou <ivan@cloudflare.com>
Cc: kernel-team <kernel-team@cloudflare.com>, Linux MM <linux-mm@kvack.org>, 
	Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <guro@fb.com>, Feng Tang <feng.tang@intel.com>, 
	Michal Hocko <mhocko@kernel.org>, Hillf Danton <hdanton@sina.com>, 
	=?UTF-8?Q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>, 
	Andrew Morton <akpm@linux-foundation.org>, Linus Torvalds <torvalds@linux-foundation.org>
Content-Type: text/plain; charset="UTF-8"
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=VQaBd24Q;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf14.hostedemail.com: domain of shakeelb@google.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=shakeelb@google.com
X-Rspam-User: 
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 4D4E210000C
X-Stat-Signature: sgf46j5yqwk9wwcq3xudbqm4usegzhzi
X-HE-Tag: 1645631868-677161
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

+Ivan Babrou

Hi Daniel,

On Wed, Feb 23, 2022 at 5:51 AM Daniel Dao <dqminh@cloudflare.com> wrote:
>
> Hi all,
>
> We are observing some regressions in workingset_refault on our newly upgraded
> 5.15.19 nodes with zram as swap. This manifests in several ways:
>
> 1) Regression of workingset_refault duration observed in flamegraph
>
> We regularly collect flamegraphs for running services on the node. Since upgrade
> to 5.15.19, we see that workingset_refault occupied a more significant part of
> the service flamegraph (13%) with the following call trace
>
>   workingset_refault+0x128
>   add_to_page_cache_lru+0x9f
>   page_cache_ra_unbounded+0x154
>   force_page_cache_ra+0xe2
>   filemap_get_pages+0xe9
>   filemap_read+0xa4
>   xfs_file_buffered_read+0x98
>   xfs_file_read_iter+0x6a
>   new_sync_read+0x118
>   vfs_read+0xf2
>   __x64_sys_pread64+0x89
>   do_syscall_64+0x3b
>   entry_SYSCALL_64_after_hwframe+0x44
>
> 2) Regression of userspace performance sensitive code
>
> We have some performance sensentive code running in userspace that have their
> runtime measured by CLOCK_THREAD_CPUTIME_ID. They look roughly as:
>
>   now = clock_gettime(CLOCK_THREAD_CPUTIME_ID)
>   func()
>   elapsed = clock_gettime(CLOCK_THREAD_CPUTIME_ID) - now
>
> Since 5.15 upgrade, we observed long `elapsed` in the range of 4-10ms much more
> frequently than before.  This went away after we disabled swap for the service
> using `memory.swap.max=0` memcg configuration.
>
> The new thing in 5.15 workingset_refault seems to be introduction of
> mem_cgroup_flush_stats()
> by commit 1f828223b7991a228bc2aef837b78737946d44b2 (memcg: flush
> lruvec stats in the
> refault).
>
> Given that mem_cgroup_flush_stats can take quite a long time for us on the
> standard systemd cgroupv2 hierrachy ( root / system.slice / workload.service )
>
>   sudo /usr/share/bcc/tools/funcslower -m 10 -t mem_cgroup_flush_stats
>   Tracing function calls slower than 10 ms... Ctrl+C to quit.
>   TIME       COMM           PID    LAT(ms)             RVAL FUNC
>   0.000000   <redacted>     804776   11.50              200
> mem_cgroup_flush_stats
>   0.343383   <redacted>     647496   10.58              200
> mem_cgroup_flush_stats
>   0.604309   <redacted>     804776   10.50              200
> mem_cgroup_flush_stats
>   1.230416   <redacted>     803293   10.01              200
> mem_cgroup_flush_stats
>   1.248442   <redacted>     646400   11.02              200
> mem_cgroup_flush_stats
>
> could it be possible that workingset_refault in some unfortunate case can take
> much longer than before such that it increases the time observed by
> CLOCK_THREAD_CPUTIME_ID from userspace, or overall duration of
> workingset_refault
> observed by perf ?

Ivan reported this issue last month and since then we have backported
three patches to 5.15-stable to fix the issue. These got merged into
5.15.18. However you are reporting the issue is still present 5.15.19.

Can you share a bit more detail on your hardware configuration (num of
cpus) and if possible the flamegraph?

thanks,
Shakeel