Re: [PATCH v4 2/2] memcg: infrastructure to flush memcg stats

From: Marek Szyprowski <m.szyprowski@samsung.com>
To: Shakeel Butt <shakeelb@google.com>
Cc: "Tejun Heo" <tj@kernel.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Muchun Song" <songmuchun@bytedance.com>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Roman Gushchin" <guro@fb.com>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Huang Ying" <ying.huang@intel.com>,
	"Hillf Danton" <hdanton@sina.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	Cgroups <cgroups@vger.kernel.org>,
	"Linux MM" <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v4 2/2] memcg: infrastructure to flush memcg stats
Date: Fri, 16 Jul 2021 17:58:37 +0200	[thread overview]
Message-ID: <75599651-b3eb-45a7-56c8-f83546650c94@samsung.com> (raw)
In-Reply-To: <CALvZod5SONQ6=ewesLhMSampu=sxbA3iDS3f+rsHkEUY5G2Cyg@mail.gmail.com>

Hi,

On 16.07.2021 17:14, Shakeel Butt wrote:
> Hi Marek
>
> On Fri, Jul 16, 2021 at 8:03 AM Marek Szyprowski
> <m.szyprowski@samsung.com> wrote:
>> Hi,
>>
>> On 14.07.2021 03:39, Shakeel Butt wrote:
>>> At the moment memcg stats are read in four contexts:
>>>
>>> 1. memcg stat user interfaces
>>> 2. dirty throttling
>>> 3. page fault
>>> 4. memory reclaim
>>>
>>> Currently the kernel flushes the stats for first two cases. Flushing the
>>> stats for remaining two casese may have performance impact. Always
>>> flushing the memcg stats on the page fault code path may negatively
>>> impacts the performance of the applications. In addition flushing in the
>>> memory reclaim code path, though treated as slowpath, can become the
>>> source of contention for the global lock taken for stat flushing because
>>> when system or memcg is under memory pressure, many tasks may enter the
>>> reclaim path.
>>>
>>> This patch uses following mechanisms to solve these challenges:
>>>
>>> 1. Periodically flush the stats from root memcg every 2 seconds. This
>>> will time limit the out of sync stats.
>>>
>>> 2. Asynchronously flush the stats after fixed number of stat updates.
>>> In the worst case the stat can be out of sync by O(nr_cpus * BATCH) for
>>> 2 seconds.
>>>
>>> 3. For avoiding thundering herd to flush the stats particularly from the
>>> memory reclaim context, introduce memcg local spinlock and let only one
>>> flusher active at a time. This could have been done through
>>> cgroup_rstat_lock lock but that lock is used by other subsystem and for
>>> userspace reading memcg stats. So, it is better to keep flushers
>>> introduced by this patch decoupled from cgroup_rstat_lock.
>>>
>>> Signed-off-by: Shakeel Butt <shakeelb@google.com>
>> This patch landed in today's linux-next (next-20210716) as commit
>> 42265e014ac7 ("memcg: infrastructure to flush memcg stats"). On my test
>> system's I found that it triggers a kernel BUG on all ARM64 boards:
>>
>>    BUG: sleeping function called from invalid context at
>> kernel/cgroup/rstat.c:200
>>    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 7, name:
>> kworker/u8:0
>>    3 locks held by kworker/u8:0/7:
>>     #0: ffff00004000c938 ((wq_completion)events_unbound){+.+.}-{0:0}, at:
>> process_one_work+0x200/0x718
>>     #1: ffff80001334bdd0 ((stats_flush_dwork).work){+.+.}-{0:0}, at:
>> process_one_work+0x200/0x718
>>     #2: ffff8000124f6d40 (stats_flush_lock){+.+.}-{2:2}, at:
>> mem_cgroup_flush_stats+0x20/0x48
>>    CPU: 2 PID: 7 Comm: kworker/u8:0 Tainted: G        W 5.14.0-rc1+ #3713
>>    Hardware name: Raspberry Pi 4 Model B (DT)
>>    Workqueue: events_unbound flush_memcg_stats_dwork
>>    Call trace:
>>     dump_backtrace+0x0/0x1d0
>>     show_stack+0x14/0x20
>>     dump_stack_lvl+0x88/0xb0
>>     dump_stack+0x14/0x2c
>>     ___might_sleep+0x1dc/0x200
>>     __might_sleep+0x4c/0x88
>>     cgroup_rstat_flush+0x2c/0x58
>>     mem_cgroup_flush_stats+0x34/0x48
>>     flush_memcg_stats_dwork+0xc/0x38
>>     process_one_work+0x2a8/0x718
>>     worker_thread+0x48/0x460
>>     kthread+0x12c/0x160
>>     ret_from_fork+0x10/0x18
>>
>> This can be also reproduced with QEmu. Please let me know if I can help
>> fixing this issue.
>>
> Thanks for the report. The issue can be fixed by changing
> cgroup_rstat_flush() to cgroup_rstat_flush_irqsafe() in
> mem_cgroup_flush_stats(). I will send out the updated patch in a
> couple of hours after a bit more testing.

Right, this fixes the issue on my test systems. Feel free to add:

Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>

to the fixup patch if the target kernel tree won't be rebased and the 
original patch (42265e014ac7) stays.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland