From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83C02C433E0 for ; Wed, 1 Jul 2020 17:22:16 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4F3F220771 for ; Wed, 1 Jul 2020 17:22:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4F3F220771 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CFE288D0053; Wed, 1 Jul 2020 13:22:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CAE908D0016; Wed, 1 Jul 2020 13:22:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BC3F88D0053; Wed, 1 Jul 2020 13:22:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A73258D0016 for ; Wed, 1 Jul 2020 13:22:15 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 214C8180AD817 for ; Wed, 1 Jul 2020 17:22:15 +0000 (UTC) X-FDA: 76990175430.19.birds57_5c0526326e82 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin19.hostedemail.com (Postfix) with ESMTP id EF0A61AD1B5 for ; Wed, 1 Jul 2020 17:22:14 +0000 (UTC) X-HE-Tag: birds57_5c0526326e82 X-Filterd-Recvd-Size: 6240 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by imf45.hostedemail.com (Postfix) with ESMTP for ; Wed, 1 Jul 2020 17:22:13 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01355;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0U1PXvC4_1593624102; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0U1PXvC4_1593624102) by smtp.aliyun-inc.com(127.0.0.1); Thu, 02 Jul 2020 01:21:44 +0800 Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard To: David Rientjes Cc: Dave Hansen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kbusch@kernel.org, ying.huang@intel.com, dan.j.williams@intel.com References: <20200629234503.749E5340@viggo.jf.intel.com> <20200629234509.8F89C4EF@viggo.jf.intel.com> <039a5704-4468-f662-d660-668071842ca3@linux.alibaba.com> From: Yang Shi Message-ID: <33028a57-24fd-e618-7d89-5f35a35a6314@linux.alibaba.com> Date: Wed, 1 Jul 2020 10:21:24 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Rspamd-Queue-Id: EF0A61AD1B5 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam02 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 6/30/20 10:41 PM, David Rientjes wrote: > On Tue, 30 Jun 2020, Yang Shi wrote: > >>>> From: Dave Hansen >>>> >>>> If a memory node has a preferred migration path to demote cold pages, >>>> attempt to move those inactive pages to that migration node before >>>> reclaiming. This will better utilize available memory, provide a faster >>>> tier than swapping or discarding, and allow such pages to be reused >>>> immediately without IO to retrieve the data. >>>> >>>> When handling anonymous pages, this will be considered before swap if >>>> enabled. Should the demotion fail for any reason, the page reclaim >>>> will proceed as if the demotion feature was not enabled. >>>> >>> Thanks for sharing these patches and kick-starting the conversation, Dave. >>> >>> Could this cause us to break a user's mbind() or allow a user to >>> circumvent their cpuset.mems? >>> >>> Because we don't have a mapping of the page back to its allocation >>> context (or the process context in which it was allocated), it seems like >>> both are possible. >> Yes, this could break the memory placement policy enforced by mbind and >> cpuset. I discussed this with Michal on mailing list and tried to find a way >> to solve it, but unfortunately it seems not easy as what you mentioned above. >> The memory policy and cpuset is stored in task_struct rather than mm_struct. >> It is not easy to trace back to task_struct from page (owner field of >> mm_struct might be helpful, but it depends on CONFIG_MEMCG and is not >> preferred way). >> > Yeah, and Ying made a similar response to this message. > > We can do this if we consider pmem not to be a separate memory tier from > the system perspective, however, but rather the socket perspective. In > other words, a node can only demote to a series of exclusive pmem ranges > and promote to the same series of ranges in reverse order. So DRAM node 0 > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM > node 3 -- a pmem range cannot be demoted to, or promoted from, more than > one DRAM node. > > This naturally takes care of mbind() and cpuset.mems if we consider pmem > just to be slower volatile memory and we don't need to deal with the > latency concerns of cross socket migration. A user page will never be > demoted to a pmem range across the socket and will never be promoted to a > different DRAM node that it doesn't have access to. But I don't see too much benefit to limit the migration target to the so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a different socket) pmem node since even the cross socket access should be much faster then refault or swap from disk. > > That can work with the NUMA abstraction for pmem, but it could also > theoretically be a new memory zone instead. If all memory living on pmem > is migratable (the natural way that memory hotplug is done, so we can > offline), this zone would live above ZONE_MOVABLE. Zonelist ordering > would determine whether we can allocate directly from this memory based on > system config or a new gfp flag that could be set for users of a mempolicy > that allows allocations directly from pmem. If abstracted as a NUMA node > instead, interleave over nodes {0,2,3} or a cpuset.mems of {0,2,3} doesn't > make much sense. > > Kswapd would need to be enlightened for proper pgdat and pmem balancing > but in theory it should be simpler because it only has its own node to > manage. Existing per-zone watermarks might be easy to use to fine tune > the policy from userspace: the scale factor determines how much memory we > try to keep free on DRAM for migration from pmem, for example. We also > wouldn't have to deal with node hotplug or updating of demotion/promotion > node chains. > > Maybe the strongest advantage of the node abstraction is the ability to > use autonuma and migrate_pages()/move_pages() API for moving pages > explicitly? Mempolicies could be used for migration to "top-tier" memory, > i.e. ZONE_NORMAL or ZONE_MOVABLE, instead. I think using pmem as a node is more natural than zone and less intrusive since we can just reuse all the numa APIs. If we treat pmem as a new zone I think the implementation may be more intrusive and complicated (i.e. need a new gfp flag) and user can't control the memory placement. Actually there had been such proposal before, please see https://www.spinics.net/lists/linux-mm/msg151788.html