From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.1 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B857AC433DF for ; Wed, 1 Jul 2020 19:45:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7CD692082F for ; Wed, 1 Jul 2020 19:45:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Zv9TefdE" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7CD692082F Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BDE2C6B0027; Wed, 1 Jul 2020 15:45:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B8F236B0028; Wed, 1 Jul 2020 15:45:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A7EE46B002D; Wed, 1 Jul 2020 15:45:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0100.hostedemail.com [216.40.44.100]) by kanga.kvack.org (Postfix) with ESMTP id 930646B0027 for ; Wed, 1 Jul 2020 15:45:21 -0400 (EDT) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 0A45F1EE6 for ; Wed, 1 Jul 2020 19:45:21 +0000 (UTC) X-FDA: 76990536042.16.mouth97_2b1465b26e83 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin16.hostedemail.com (Postfix) with ESMTP id C8BD5100E6903 for ; Wed, 1 Jul 2020 19:45:20 +0000 (UTC) X-HE-Tag: mouth97_2b1465b26e83 X-Filterd-Recvd-Size: 6582 Received: from mail-pj1-f67.google.com (mail-pj1-f67.google.com [209.85.216.67]) by imf22.hostedemail.com (Postfix) with ESMTP for ; Wed, 1 Jul 2020 19:45:20 +0000 (UTC) Received: by mail-pj1-f67.google.com with SMTP id a9so976820pjh.5 for ; Wed, 01 Jul 2020 12:45:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=MmYcz0HvWzSHDRJWGpLrqkNktn4/JnvsoNfPiLm5KlQ=; b=Zv9TefdEvjl7gyFCZ5lVCVicTgJt1CjvgutyIOpBITiOFrSh6wxtGnedgQ09N0dmor pS7qMg+F/+kJrWhLAlfgi7xmRqLBVK2TeE1tJwOC4ppPtuJSyJGfJ8T/CVcpv5AZYVul SRuSYxrQgCjxe7Ye2OYgFdFsytY/MGoGO5XlsAPkSUtU4kzEIyLF7fz0u6VvQNojqA4b 8mBosRczpOzC7kixcI/LataPeCsZfBXYo70+i6WKTgBmUCQ1dmBMuFVJoJbxzm2siFMq y2bJrZH9P5MZgylpWyX4emAefMgXEDU1GSQbqIrL+2+KE4bJp5ivRUC+r+UL7JjO8AkT rSRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=MmYcz0HvWzSHDRJWGpLrqkNktn4/JnvsoNfPiLm5KlQ=; b=oskDdLdSCWqTME5n5MPth4xvohv+dXQ789WqnQLp4tWewLzoEb4d+KngMc7eHyiAjF rh5wB0yFEGdjqoO9Vp0SprsDhsxyWQ2WmPQOhselhkpaA026SILICf8Wy4hbjN/OgJGw nYfGs7E+pwufJ/YImhek8RVJUM5rDy4wkvMJ2/pSd341Vvh10/l5Z+cYty7Df1Hgtm8X NSoqTce0hEMW/GH066DcE2S3aJml0r73WL0MruAqthkcizHlktH3jqFmzYnPHsmAFBWe YY19HVd58cpRKlS+F9Owc0K2ojCpTfOSWlV9WoIoe/XjC1X4eqo5up53ODjOiTex69R6 NBFA== X-Gm-Message-State: AOAM531TcorAKSLnZ3qmCjphqL2D/JwPSjsOad5swJ9wcwf+9BQkoN7x MmHOXAEpYgXz7n8TX+E0MEzylg== X-Google-Smtp-Source: ABdhPJzaiU4A5yAHnkdIdNyPO//R2fZVeUF9j+XbG7glANtj0xNS71ARFDddH7qLf/8qAr6uIClZJQ== X-Received: by 2002:a17:90a:246:: with SMTP id t6mr29763294pje.230.1593632719108; Wed, 01 Jul 2020 12:45:19 -0700 (PDT) Received: from [2620:15c:17:3:4a0f:cfff:fe51:6667] ([2620:15c:17:3:4a0f:cfff:fe51:6667]) by smtp.gmail.com with ESMTPSA id u20sm6422921pfm.152.2020.07.01.12.45.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 01 Jul 2020 12:45:18 -0700 (PDT) Date: Wed, 1 Jul 2020 12:45:17 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Yang Shi cc: Dave Hansen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, kbusch@kernel.org, ying.huang@intel.com, dan.j.williams@intel.com Subject: Re: [RFC][PATCH 3/8] mm/vmscan: Attempt to migrate page in lieu of discard In-Reply-To: <33028a57-24fd-e618-7d89-5f35a35a6314@linux.alibaba.com> Message-ID: References: <20200629234503.749E5340@viggo.jf.intel.com> <20200629234509.8F89C4EF@viggo.jf.intel.com> <039a5704-4468-f662-d660-668071842ca3@linux.alibaba.com> <33028a57-24fd-e618-7d89-5f35a35a6314@linux.alibaba.com> User-Agent: Alpine 2.23 (DEB 453 2020-06-18) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Rspamd-Queue-Id: C8BD5100E6903 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 1 Jul 2020, Yang Shi wrote: > > We can do this if we consider pmem not to be a separate memory tier from > > the system perspective, however, but rather the socket perspective. In > > other words, a node can only demote to a series of exclusive pmem ranges > > and promote to the same series of ranges in reverse order. So DRAM node 0 > > can only demote to PMEM node 2 while DRAM node 1 can only demote to PMEM > > node 3 -- a pmem range cannot be demoted to, or promoted from, more than > > one DRAM node. > > > > This naturally takes care of mbind() and cpuset.mems if we consider pmem > > just to be slower volatile memory and we don't need to deal with the > > latency concerns of cross socket migration. A user page will never be > > demoted to a pmem range across the socket and will never be promoted to a > > different DRAM node that it doesn't have access to. > > But I don't see too much benefit to limit the migration target to the > so-called *paired* pmem node. IMHO it is fine to migrate to a remote (on a > different socket) pmem node since even the cross socket access should be much > faster then refault or swap from disk. > Hi Yang, Right, but any eventual promotion path would allow this to subvert the user mempolicy or cpuset.mems if the demoted memory is eventually promoted to a DRAM node on its socket. We've discussed not having the ability to map from the demoted page to either of these contexts and it becomes more difficult for shared memory. We have page_to_nid() and page_zone() so we can always find the appropriate demotion or promotion node for a given page if there is a 1:1 relationship. Do we lose anything with the strict 1:1 relationship between DRAM and PMEM nodes? It seems much simpler in terms of implementation and is more intuitive. > I think using pmem as a node is more natural than zone and less intrusive > since we can just reuse all the numa APIs. If we treat pmem as a new zone I > think the implementation may be more intrusive and complicated (i.e. need a > new gfp flag) and user can't control the memory placement. > This is an important decision to make, I'm not sure that we actually *want* all of these NUMA APIs :) If my memory is demoted, I can simply do migrate_pages() back to DRAM and cause other memory to be demoted in its place. Things like MPOL_INTERLEAVE over nodes {0,1,2} don't make sense. Kswapd for a DRAM node putting pressure on a PMEM node for demotion that then puts the kswapd for the PMEM node under pressure to reclaim it serves *only* to spend unnecessary cpu cycles. Users could control the memory placement through a new mempolicy flag, which I think are needed anyway for explicit allocation policies for PMEM nodes. Consider if PMEM is a zone so that it has the natural 1:1 relationship with DRAM, now your system only has nodes {0,1} as today, no new NUMA topology to consider, and a mempolicy flag MPOL_F_TOPTIER that specifies memory must be allocated from ZONE_MOVABLE or ZONE_NORMAL (and I can then mlock() if I want to disable demotion on memory pressure).