linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Yang Shi <yang.shi@linux.alibaba.com>
Cc: Michal Hocko <mhocko@kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	 Rik van Riel <riel@surriel.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Dave Hansen <dave.hansen@intel.com>,
	 Keith Busch <keith.busch@intel.com>,
	Fengguang Wu <fengguang.wu@intel.com>,
	 "Du, Fan" <fan.du@intel.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	Linux MM <linux-mm@kvack.org>,
	 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node
Date: Thu, 28 Mar 2019 01:21:03 -0700	[thread overview]
Message-ID: <CAPcyv4g2FuormkwNNWy7kU4JF6_-sX3WnSVS7YggMJMMOCehMQ@mail.gmail.com> (raw)
In-Reply-To: <6f8b4c51-3f3c-16f9-ca2f-dbcd08ea23e6@linux.alibaba.com>

On Wed, Mar 27, 2019 at 7:09 PM Yang Shi <yang.shi@linux.alibaba.com> wrote:
> On 3/27/19 1:09 PM, Michal Hocko wrote:
> > On Wed 27-03-19 11:59:28, Yang Shi wrote:
> >>
> >> On 3/27/19 10:34 AM, Dan Williams wrote:
> >>> On Wed, Mar 27, 2019 at 2:01 AM Michal Hocko <mhocko@kernel.org> wrote:
> >>>> On Tue 26-03-19 19:58:56, Yang Shi wrote:
> > [...]
> >>>>> It is still NUMA, users still can see all the NUMA nodes.
> >>>> No, Linux NUMA implementation makes all numa nodes available by default
> >>>> and provides an API to opt-in for more fine tuning. What you are
> >>>> suggesting goes against that semantic and I am asking why. How is pmem
> >>>> NUMA node any different from any any other distant node in principle?
> >>> Agree. It's just another NUMA node and shouldn't be special cased.
> >>> Userspace policy can choose to avoid it, but typical node distance
> >>> preference should otherwise let the kernel fall back to it as
> >>> additional memory pressure relief for "near" memory.
> >> In ideal case, yes, I agree. However, in real life world the performance is
> >> a concern. It is well-known that PMEM (not considering NVDIMM-F or HBM) has
> >> higher latency and lower bandwidth. We observed much higher latency on PMEM
> >> than DRAM with multi threads.
> > One rule of thumb is: Do not design user visible interfaces based on the
> > contemporary technology and its up/down sides. This will almost always
> > fire back.
>
> Thanks. It does make sense to me.
>
> >
> > Btw. if you keep arguing about performance without any numbers. Can you
> > present something specific?
>
> Yes, I did have some numbers. We did simple memory sequential rw latency
> test with a designed-in-house test program on PMEM (bind to PMEM) and
> DRAM (bind to DRAM). When running with 20 threads the result is as below:
>
>               Threads          w/lat            r/lat
> PMEM      20                537.15         68.06
> DRAM      20                14.19           6.47
>
> And, sysbench test with command: sysbench --time=600 memory
> --memory-block-size=8G --memory-total-size=1024T --memory-scope=global
> --memory-oper=read --memory-access-mode=rnd --rand-type=gaussian
> --rand-pareto-h=0.1 --threads=1 run
>
> The result is:
>                     lat/ms
> PMEM      103766.09
> DRAM      31946.30
>
> >
> >> In real production environment we don't know what kind of applications would
> >> end up on PMEM (DRAM may be full, allocation fall back to PMEM) then have
> >> unexpected performance degradation. I understand to have mempolicy to choose
> >> to avoid it. But, there might be hundreds or thousands of applications
> >> running on the machine, it sounds not that feasible to me to have each
> >> single application set mempolicy to avoid it.
> > we have cpuset cgroup controller to help here.
> >
> >> So, I think we still need a default allocation node mask. The default value
> >> may include all nodes or just DRAM nodes. But, they should be able to be
> >> override by user globally, not only per process basis.
> >>
> >> Due to the performance disparity, currently our usecases treat PMEM as
> >> second tier memory for demoting cold page or binding to not memory access
> >> sensitive applications (this is the reason for inventing a new mempolicy)
> >> although it is a NUMA node.
> > If the performance sucks that badly then do not use the pmem as NUMA,
> > really. There are certainly other ways to export the pmem storage. Use
> > it as a fast swap storage. Or try to work on a swap caching mechanism
> > that still allows much faster access than a slow swap storage. But do
> > not try to pretend to abuse the NUMA interface while you are breaking
> > some of its long term established semantics.
>
> Yes, we are looking into using it as a fast swap storage too and perhaps
> other usecases.
>
> Anyway, though nobody thought it makes sense to restrict default
> allocation nodes, it sounds over-engineered. I'm going to drop it.
>
> One question, when doing demote and promote we need define a path, for
> example, DRAM <-> PMEM (assume two tier memory). When determining what
> nodes are "DRAM" nodes, does it make sense to assume the nodes with both
> cpu and memory are DRAM nodes since PMEM nodes are typically cpuless nodes?

For ACPI platforms the HMAT is effectively going to enforce "cpu-less"
nodes for any memory range that has differentiated performance from
the conventional memory pool, or differentiated performance for a
specific initiator. So "memory-less == PMEM" is not a robust
assumption.

The plan is to use the HMAT to populate the default fallback order,
but allow for an override if the HMAT information is missing or
incorrect.


  parent reply	other threads:[~2019-03-28  8:21 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-23  4:44 [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Yang Shi
2019-03-23  4:44 ` [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory Yang Shi
2019-03-23 17:21   ` Dan Williams
2019-03-25 19:28     ` Yang Shi
2019-03-25 23:18       ` Dan Williams
2019-03-25 23:36         ` Yang Shi
2019-03-25 23:42           ` Dan Williams
2019-03-23  4:44 ` [PATCH 02/10] mm: mempolicy: introduce MPOL_HYBRID policy Yang Shi
2019-03-23  4:44 ` [PATCH 03/10] mm: mempolicy: promote page to DRAM for MPOL_HYBRID Yang Shi
2019-03-23  4:44 ` [PATCH 04/10] mm: numa: promote pages to DRAM when it is accessed twice Yang Shi
2019-03-29  0:31   ` kbuild test robot
2019-03-23  4:44 ` [PATCH 05/10] mm: page_alloc: make find_next_best_node could skip DRAM node Yang Shi
2019-03-23  4:44 ` [PATCH 06/10] mm: vmscan: demote anon DRAM pages to PMEM node Yang Shi
2019-03-23  6:03   ` Zi Yan
2019-03-25 21:49     ` Yang Shi
2019-03-24 22:20   ` Keith Busch
2019-03-25 19:49     ` Yang Shi
2019-03-27  0:35       ` Keith Busch
2019-03-27  3:41         ` Yang Shi
2019-03-27 13:08           ` Keith Busch
2019-03-27 17:00             ` Zi Yan
2019-03-27 17:05               ` Dave Hansen
2019-03-27 17:48                 ` Zi Yan
2019-03-27 18:00                   ` Dave Hansen
2019-03-27 20:37                     ` Zi Yan
2019-03-27 20:42                       ` Dave Hansen
2019-03-28 21:59             ` Yang Shi
2019-03-28 22:45               ` Keith Busch
2019-03-23  4:44 ` [PATCH 07/10] mm: vmscan: add page demotion counter Yang Shi
2019-03-23  4:44 ` [PATCH 08/10] mm: numa: add page promotion counter Yang Shi
2019-03-23  4:44 ` [PATCH 09/10] doc: add description for MPOL_HYBRID mode Yang Shi
2019-03-23  4:44 ` [PATCH 10/10] doc: elaborate the PMEM allocation rule Yang Shi
2019-03-25 16:15 ` [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node Brice Goglin
2019-03-25 16:56   ` Dan Williams
2019-03-25 17:45     ` Brice Goglin
2019-03-25 19:29       ` Dan Williams
2019-03-25 23:09         ` Brice Goglin
2019-03-25 23:37           ` Dan Williams
2019-03-26 12:19             ` Jonathan Cameron
2019-03-25 20:04   ` Yang Shi
2019-03-26 13:58 ` Michal Hocko
2019-03-26 18:33   ` Yang Shi
2019-03-26 18:37     ` Michal Hocko
2019-03-27  2:58       ` Yang Shi
2019-03-27  9:01         ` Michal Hocko
2019-03-27 17:34           ` Dan Williams
2019-03-27 18:59             ` Yang Shi
2019-03-27 20:09               ` Michal Hocko
2019-03-28  2:09                 ` Yang Shi
2019-03-28  6:58                   ` Michal Hocko
2019-03-28 18:58                     ` Yang Shi
2019-03-28 19:12                       ` Michal Hocko
2019-03-28 19:40                         ` Yang Shi
2019-03-28 20:40                           ` Michal Hocko
2019-03-28  8:21                   ` Dan Williams [this message]
2019-03-27 20:14               ` Dave Hansen
2019-03-27 20:35             ` Matthew Wilcox
2019-03-27 20:40               ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4g2FuormkwNNWy7kU4JF6_-sX3WnSVS7YggMJMMOCehMQ@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=fan.du@intel.com \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=keith.busch@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=riel@surriel.com \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).