linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: linux-mm@kvack.org, "Andrew Morton" <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Ross Zwisler" <ross.zwisler@linux.intel.com>,
	"Keith Busch" <keith.busch@intel.com>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Dave Hansen" <dave.hansen@intel.com>,
	"Haggai Eran" <haggaie@mellanox.com>,
	"Balbir Singh" <bsingharora@gmail.com>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Felix Kuehling" <felix.kuehling@amd.com>,
	"Philip Yang" <Philip.Yang@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Paul Blinzer" <Paul.Blinzer@amd.com>,
	"Logan Gunthorpe" <logang@deltatee.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Jonathan Cameron" <jonathan.cameron@huawei.com>,
	"Mark Hairgrove" <mhairgrove@nvidia.com>,
	"Vivek Kini" <vkini@nvidia.com>,
	"Mel Gorman" <mgorman@techsingularity.net>,
	"Dave Airlie" <airlied@redhat.com>,
	"Ben Skeggs" <bskeggs@redhat.com>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Rik van Riel" <riel@surriel.com>,
	"Ben Woodard" <woodard@redhat.com>,
	linux-acpi@vger.kernel.org
Subject: Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
Date: Tue, 4 Dec 2018 09:44:28 -0500	[thread overview]
Message-ID: <20181204144422.GA3917@redhat.com> (raw)
In-Reply-To: <d6899765-a3b1-464b-d310-862161e07d98@linux.ibm.com>

On Tue, Dec 04, 2018 at 01:14:14PM +0530, Aneesh Kumar K.V wrote:
> On 12/4/18 5:04 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>

[...]

> > This patchset use the above scheme to expose system topology through
> > sysfs under /sys/bus/hms/ with:
> >      - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
> >        each has a UID and you can usual value in that folder (node id,
> >        size, ...)
> > 
> >      - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
> >        (CPU or device), each has a HMS UID but also a CPU id for CPU
> >        (which match CPU id in (/sys/bus/cpu/). For device you have a
> >        path that can be PCIE BUS ID for instance)
> > 
> >      - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
> >        UID and a file per property (bandwidth, latency, ...) you also
> >        find a symlink to every target and initiator connected to that
> >        link.
> > 
> >      - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
> >        a UID and a file per property (bandwidth, latency, ...) you
> >        also find a symlink to all initiators that can use that bridge.
> 
> is that version tagging really needed? What changes do you envision with
> versions?

I kind of dislike it myself but this is really to keep userspace from
inadvertently using some kind of memory/initiator/link/bridge that it
should not be using if it does not understand what are the implication.

If it was a file inside the directory there is a big chance that user-
space will overlook it. So an old program on a new platform with a new
kind of weird memory like un-coherent memory might start using it and
get all weird result. If version is in the directory name it kind of
force userspace to only look at memory/initiator/link/bridge it does
understand and can use safely.

So i am doing this in hope that it will protect application when new
type of things pops up. We have too many example where we can not
evolve something because existing application have bake in assumptions
about it.


[...]

> > 3) Tracking and applying heterogeneous memory policies
> > ------------------------------------------------------
> > 
> > Current memory policy infrastructure is node oriented, instead of
> > changing that and risking breakage and regression this patchset add a
> > new heterogeneous policy tracking infra-structure. The expectation is
> > that existing application can keep using mbind() and all existing
> > infrastructure under-disturb and unaffected, while new application
> > will use the new API and should avoid mix and matching both (as they
> > can achieve the same thing with the new API).
> > 
> > Also the policy is not directly tie to the vma structure for a few
> > reasons:
> >      - avoid having to split vma for policy that do not cover full vma
> >      - avoid changing too much vma code
> >      - avoid growing the vma structure with an extra pointer
> > So instead this patchset use the mmu_notifier API to track vma liveness
> > (munmap(),mremap(),...).
> > 
> > This patchset is not tie to process memory allocation either (like said
> > at the begining this is not and end to end patchset but a starting
> > point). It does however demonstrate how migration to device memory can
> > work under this scheme (using nouveau as a demonstration vehicle).
> > 
> > The overall design is simple, on hbind() call a hms policy structure
> > is created for the supplied range and hms use the callback associated
> > with the target memory. This callback is provided by device driver
> > for device memory or by core HMS for regular main memory. The callback
> > can decide to migrate the range to the target memories or do nothing
> > (this can be influenced by flags provided to hbind() too).
> > 
> > 
> > Latter patches can tie page fault with HMS policy to direct memory
> > allocation to the right target. For now i would rather postpone that
> > discussion until a consensus is reach on how to move forward on all
> > the topics presented in this email. Start smalls, grow big ;)
> > 
> > 
> 
> I liked the simplicity of keeping it outside all the existing memory
> management policy code. But that that is also the drawback isn't it?
> We now have multiple entities tracking cpu and memory. (This reminded me of
> how we started with memcg in the early days).

This is a hard choice, the rational is that either application use this
new API either it use the old one. So the expectation is that both should
not co-exist in a process. Eventualy both can be consolidated into one
inside the kernel while maintaining the different userspace API. But i
feel that it is better to get to that point slowly while we experiment
with the new API. I feel that we need to gain some experience with the
new API on real workload to convince ourself that it is something we can
leave with. If we reach that point than we can work on consolidating
kernel code into one. In the meantime this experiment does not disrupt
or regress existing API. I took the cautionary road.


> Once we have these different types of targets, ideally the system should
> be able to place them in the ideal location based on the affinity of the
> access. ie. we should automatically place the memory such that
> initiator can access the target optimally. That is what we try to do with
> current system with autonuma. (You did mention that you are not looking at
> how this patch series will evolve to automatic handling of placement right
> now.) But i guess we want to see if the framework indeed help in achieving
> that goal. Having HMS outside the core memory
> handling routines will be a road blocker there?

So evolving autonuma gonna be a thing on its own, the issue is that auto-
numa revolve around CPU id and use a handful of bits to try to catch CPU
access pattern. With device in the mix it is much harder, first using the
page fault trick of autonuma might not be the best idea, second we can get
a lot of informations from IOMMU, bridge chipset or device itself on what
is accessed by who.

So my believe on that front is that its gonna be something different, like
tracking range of virtual address and maintaining a data structure for
range (not per page).

All this is done in core mm code, i am just keeping out of vma struct or
other struct to avoid growing them when and wasting thing when thit is not
in use. So it is very much inside core handling routines, it is just
optional.

In any case i believe that explicit placement (where application hbind()
thing) will be the first main use case. Once we have that figured out (or
at least once we believe we have it figured out :)) then we can look into
auto-heterogeneous.

Cheers,
Jérôme

  reply	other threads:[~2018-12-04 14:44 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
2018-12-03 23:34 ` [RFC PATCH 01/14] mm/hms: heterogeneous memory system (sysfs infrastructure) jglisse
2018-12-03 23:34 ` [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation jglisse
2018-12-04 17:06   ` Andi Kleen
2018-12-04 18:24     ` Jerome Glisse
2018-12-04 18:31       ` Dan Williams
2018-12-04 18:57         ` Jerome Glisse
2018-12-04 19:11           ` Logan Gunthorpe
2018-12-04 19:22             ` Jerome Glisse
2018-12-04 19:41               ` Logan Gunthorpe
2018-12-04 20:13                 ` Jerome Glisse
2018-12-04 20:30                   ` Logan Gunthorpe
2018-12-04 20:59                     ` Jerome Glisse
2018-12-04 21:19                       ` Logan Gunthorpe
2018-12-04 21:51                         ` Jerome Glisse
2018-12-04 22:16                           ` Logan Gunthorpe
2018-12-04 23:56                             ` Jerome Glisse
2018-12-05  1:15                               ` Logan Gunthorpe
2018-12-05  2:31                                 ` Jerome Glisse
2018-12-05 17:41                                   ` Logan Gunthorpe
2018-12-05 18:07                                     ` Jerome Glisse
2018-12-05 18:20                                       ` Logan Gunthorpe
2018-12-05 18:33                                         ` Jerome Glisse
2018-12-05 18:48                                           ` Logan Gunthorpe
2018-12-05 18:55                                             ` Jerome Glisse
2018-12-05 19:10                                               ` Logan Gunthorpe
2018-12-05 22:58                                                 ` Jerome Glisse
2018-12-05 23:09                                                   ` Logan Gunthorpe
2018-12-05 23:20                                                     ` Jerome Glisse
2018-12-05 23:23                                                       ` Logan Gunthorpe
2018-12-05 23:27                                                         ` Jerome Glisse
2018-12-06  0:08                                                           ` Dan Williams
2018-12-05  2:34                                 ` Dan Williams
2018-12-05  2:37                                   ` Jerome Glisse
2018-12-05 17:25                                     ` Logan Gunthorpe
2018-12-05 18:01                                       ` Jerome Glisse
2018-12-04 20:14             ` Andi Kleen
2018-12-04 20:47               ` Logan Gunthorpe
2018-12-04 21:15                 ` Jerome Glisse
2018-12-04 19:19           ` Dan Williams
2018-12-04 19:32             ` Jerome Glisse
2018-12-04 20:12       ` Andi Kleen
2018-12-04 20:41         ` Jerome Glisse
2018-12-05  4:36       ` Aneesh Kumar K.V
2018-12-05  4:41         ` Jerome Glisse
2018-12-05 10:52   ` Mike Rapoport
2018-12-03 23:34 ` [RFC PATCH 03/14] mm/hms: add target memory to heterogeneous memory system infrastructure jglisse
2018-12-03 23:34 ` [RFC PATCH 04/14] mm/hms: add initiator " jglisse
2018-12-03 23:35 ` [RFC PATCH 05/14] mm/hms: add link " jglisse
2018-12-03 23:35 ` [RFC PATCH 06/14] mm/hms: add bridge " jglisse
2018-12-03 23:35 ` [RFC PATCH 07/14] mm/hms: register main memory with heterogenenous memory system jglisse
2018-12-03 23:35 ` [RFC PATCH 08/14] mm/hms: register main CPUs " jglisse
2018-12-03 23:35 ` [RFC PATCH 09/14] mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS) jglisse
2018-12-03 23:35 ` [RFC PATCH 10/14] mm/hbind: add heterogeneous memory policy tracking infrastructure jglisse
2018-12-03 23:35 ` [RFC PATCH 11/14] mm/hbind: add bind command to heterogeneous memory policy jglisse
2018-12-03 23:35 ` [RFC PATCH 12/14] mm/hbind: add migrate command to hbind() ioctl jglisse
2018-12-03 23:35 ` [RFC PATCH 13/14] drm/nouveau: register GPU under heterogeneous memory system jglisse
2018-12-03 23:35 ` [RFC PATCH 14/14] test/hms: tests for " jglisse
2018-12-04  7:44 ` [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() Aneesh Kumar K.V
2018-12-04 14:44   ` Jerome Glisse [this message]
2018-12-04 18:02 ` Dave Hansen
2018-12-04 18:49   ` Jerome Glisse
2018-12-04 18:54     ` Dave Hansen
2018-12-04 19:11       ` Jerome Glisse
2018-12-04 21:37     ` Dave Hansen
2018-12-04 21:57       ` Jerome Glisse
2018-12-04 23:58         ` Dave Hansen
2018-12-05  0:29           ` Jerome Glisse
2018-12-05  1:22         ` Kuehling, Felix
2018-12-05 11:27     ` Aneesh Kumar K.V
2018-12-05 16:09       ` Jerome Glisse
2018-12-04 23:54 ` Dave Hansen
2018-12-05  0:15   ` Jerome Glisse
2018-12-05  1:06     ` Dave Hansen
2018-12-05  2:13       ` Jerome Glisse
2018-12-05 17:27         ` Dave Hansen
2018-12-05 17:53           ` Jerome Glisse
2018-12-06 18:25             ` Dave Hansen
2018-12-06 19:20               ` Jerome Glisse
2018-12-06 19:31                 ` Dave Hansen
2018-12-06 20:11                   ` Logan Gunthorpe
2018-12-06 22:04                     ` Dave Hansen
2018-12-06 22:39                       ` Jerome Glisse
2018-12-06 23:09                         ` Dave Hansen
2018-12-06 23:28                           ` Logan Gunthorpe
2018-12-06 23:34                             ` Dave Hansen
2018-12-06 23:38                             ` Dave Hansen
2018-12-06 23:48                               ` Logan Gunthorpe
2018-12-07  0:20                                 ` Jerome Glisse
2018-12-07 15:06                                   ` Jonathan Cameron
2018-12-07 19:37                                     ` Jerome Glisse
2018-12-07  0:15                           ` Jerome Glisse
2018-12-06 20:27                   ` Jerome Glisse
2018-12-06 21:46                     ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181204144422.GA3917@redhat.com \
    --to=jglisse@redhat.com \
    --cc=Paul.Blinzer@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=aarcange@redhat.com \
    --cc=airlied@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=bsingharora@gmail.com \
    --cc=bskeggs@redhat.com \
    --cc=christian.koenig@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=felix.kuehling@amd.com \
    --cc=haggaie@mellanox.com \
    --cc=jhubbard@nvidia.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=keith.busch@intel.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=logang@deltatee.com \
    --cc=mgorman@techsingularity.net \
    --cc=mhairgrove@nvidia.com \
    --cc=mhocko@kernel.org \
    --cc=rafael@kernel.org \
    --cc=rcampbell@nvidia.com \
    --cc=riel@surriel.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=vkini@nvidia.com \
    --cc=willy@infradead.org \
    --cc=woodard@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).