From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C02DCC433F5 for ; Mon, 7 Mar 2022 22:53:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3DAEE8D0002; Mon, 7 Mar 2022 17:53:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 389B38D0001; Mon, 7 Mar 2022 17:53:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2385F8D0002; Mon, 7 Mar 2022 17:53:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0067.hostedemail.com [216.40.44.67]) by kanga.kvack.org (Postfix) with ESMTP id 11E928D0001 for ; Mon, 7 Mar 2022 17:53:53 -0500 (EST) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id B55441828D838 for ; Mon, 7 Mar 2022 22:53:52 +0000 (UTC) X-FDA: 79219094304.18.E9712D0 Received: from mail-il1-f173.google.com (mail-il1-f173.google.com [209.85.166.173]) by imf24.hostedemail.com (Postfix) with ESMTP id 33E8A180006 for ; Mon, 7 Mar 2022 22:53:52 +0000 (UTC) Received: by mail-il1-f173.google.com with SMTP id p2so2399855ile.2 for ; Mon, 07 Mar 2022 14:53:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=KiFAM4XrBoQM5du2GBIVri5LON2L76gOoBj3En/cZ9g=; b=qV2QSXKwcNYXUxUPCpj/8I9yM9ERn8yqKiLKL3ljdGqMtPZMyuIHJT2QGVZQ/NZ2nC iE4HV2UYYoj0pN+VkH+wEwwhQMpEuE3vpOxqPxzoNk9dhTuF38nDjX2dID7dPEggAOPK jq65wog+ExtAokNwjqg+xTSm7WZvzNKpfjghnY0PMmc8kFlRgS/2IT76IuQ8evcHF3yj 47ry4lRR0bdHFABbvkwvGyXNceA1AnmC3UwHz4mZqcJNZ63fDtIRk6kaVWvEs7ZWDdq9 8ft+IFa7lXYWudZnCDpK2gozhDxFb11LHZtOZBF/gtjC9uoz+CPZSLBb1XCVJXvqqL0k O87A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KiFAM4XrBoQM5du2GBIVri5LON2L76gOoBj3En/cZ9g=; b=iq/DXm+w70lLKeH8Y3dDj86bdktn+tGA7f77UhAmpxW0aALeNeK5zvs8C8NXgfWhOU rgj3NXxsnWbiorO/23MYx0z3jXqIzm/7aIS1QiK7DuvjkzxdRiGtU9Ze7YZvEPvGtVmS yQQfOH3xcXb2oL/wLPRnBb9R6tOs9zctNds0dKp/m1Di7w+E0nask+N07t8t7hz4ZO8b 89h7rQD68AnKnBkiBetYT/oxAPOraMjUh1zWoWzXO6qK0+UYQpo3fBot8wv1ZuQvGiUD 9pytkdfKcCIvfg1kaVd6bYoizXqA/0t2TtXJy5WKsje9QgAHaudS5+zrw33VlrmiGqmZ eVeQ== X-Gm-Message-State: AOAM5307qWvBl9LTaQJowcRfaAFyZnbZF7fuetsBW5fsD4loNLm+9AW9 ZGwL1D4JWb7pwWDUgBbcjXQEeRDpbmEupPOvSTXULQ== X-Google-Smtp-Source: ABdhPJzx0KFx3k9fN2k2gM9OJsbf9R9wSVwOu/+ms8mSPNyNFK8+aEeGn8pmX8TYB8mt0eygkNsnYElIpr1mQGrNgVI= X-Received: by 2002:a05:6e02:12e5:b0:2c1:ebc1:f871 with SMTP id l5-20020a056e0212e500b002c1ebc1f871mr12945919iln.164.1646693631281; Mon, 07 Mar 2022 14:53:51 -0800 (PST) MIME-Version: 1.0 References: <5df21376-7dd1-bf81-8414-32a73cea45dd@google.com> In-Reply-To: From: Wei Xu Date: Mon, 7 Mar 2022 14:53:40 -0800 Message-ID: Subject: Re: [RFC] Mechanism to induce memory reclaim To: Johannes Weiner Cc: David Rientjes , Andrew Morton , Michal Hocko , Yu Zhao , Dave Hansen , Linux MM , Yosry Ahmed , Shakeel Butt , Greg Thelen Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 33E8A180006 X-Stat-Signature: 6t63redg7r6f1j7c8qtgidhs6uk47wub Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=qV2QSXKw; spf=pass (imf24.hostedemail.com: domain of weixugc@google.com designates 209.85.166.173 as permitted sender) smtp.mailfrom=weixugc@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspam-User: X-HE-Tag: 1646693632-177433 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Mar 7, 2022 at 12:50 PM Johannes Weiner wrote: > > On Sun, Mar 06, 2022 at 03:11:23PM -0800, David Rientjes wrote: > > Hi everybody, > > > > We'd like to discuss formalizing a mechanism to induce memory reclaim by > > the kernel. > > > > The current multigenerational LRU proposal introduces a debugfs > > mechanism[1] for this. The "TMO: Transparent Memory Offloading in > > Datacenters" paper also discusses a per-memcg mechanism[2]. While the > > former can be used for debugging of MGLRU, both can quite powerfully be > > used for proactive reclaim. > > > > Google's datacenters use a similar per-memcg mechanism for the same > > purpose. Thus, formalizing the mechanism would allow our userspace to use > > an upstream supported interface that will be stable and consistent. > > > > This could be an incremental addition to MGLRU's lru_gen debugfs mechanism > > but, since the concept has no direct dependency on the work, we believe it > > is useful independent of the reclaim mechanism in use (both with and > > without CONFIG_LRU_GEN). > > > > Idea: introduce a per-node sysfs mechanism for inducing memory reclaim > > that can be useful for global (non-memcg constrained) reclaim and possible > > even if memcg is not enabled in the kernel or mounted. This could > > optionally take a memcg id to induce reclaim for a memcg hierarchy. > > > > IOW, this would be a /sys/devices/system/node/nodeN/reclaim mechanim for > > each NUMA node N on the system. (It would be similar to the existing > > per-node sysfs "compact" mechanism used to trigger compaction from > > userspace.) > > I generally think a proactive reclaim interface is a good idea. It is great to hear this. > A per-cgroup control knob would make more sense to me, as cgroupfs > takes care of delegation, namespacing etc. and so would permit > self-directed proactive reclaim inside containers. A per-cgroup control works perfectly for Google's data center use case as well. But a sysfs interface, such as /sys/kernel/mm/reclaim, that takes a node mask and a memcg id as the arguments can be used by proactive reclaimers on systems that don't use memcg (e.g. some desktop Linux distros) as well, which is more general. A special value for memcg id indicating global reclaim can be passed to support non-memcg use cases. > > Userspace would write the following to this file: > > - nr_to_reclaim pages > > This makes sense, although (and you hinted at this below), I'm > thinking it should be in bytes, especially if part of cgroupfs. > > > - swappiness factor > > This I'm not sure about. > > Mostly because I'm not sure about swappiness in general. It balances > between anon and file, but both of them are aged according to the same > LRU rules. The only reason to prefer one over the other seems to be > when the cost of reloading one (refault vs swapin) isn't the same as > the other. That's usually a hardware property, which in a perfect > world we'd auto-tune inside the kernel based on observed IO > performance. Not sure why you'd want this per reclaim request. The choice between anon and file pages is not only a hardware property, but also a matter of policy decisions. It is useful to allow the userspace policy daemon the flexibility to choose anon pages or file pages or both to reclaim from, for the exact reasons that you have described. This is important for the use cases in Google (where anon pages are the primary focus of proactive reclaim). Maybe instead of the swappiness factor, we can replace this parameter with a page type mask to more explicitly select which types of pages to reclaim. > > - flags to specify context, if any[**] > > > > [**] this is offered for extensibility to specify the context in which > > reclaim is being done (clean file pages only, demotion for memory > > tiering vs eviction, etc), otherwise 0 > > This one is curious. I don't understand the use cases for either of > these examples, and I can't think of other flags a user may pass on a > per-invocation basis. Would you care to elaborate some? One of the flag examples is to control whether the requested proactive reclaim can induce I/Os. This can be especially useful for memory tiering to lower cost memory devices, where I/Os would likely not be preferred for reclaim-based demotion requested proactively. Wei