linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jerome Glisse <jglisse@redhat.com>
To: Dave Hansen <dave.hansen@intel.com>
Cc: linux-mm@kvack.org, "Andrew Morton" <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"Ross Zwisler" <ross.zwisler@linux.intel.com>,
	"Keith Busch" <keith.busch@intel.com>,
	"Dan Williams" <dan.j.williams@intel.com>,
	"Haggai Eran" <haggaie@mellanox.com>,
	"Balbir Singh" <bsingharora@gmail.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>,
	"Benjamin Herrenschmidt" <benh@kernel.crashing.org>,
	"Felix Kuehling" <felix.kuehling@amd.com>,
	"Philip Yang" <Philip.Yang@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Paul Blinzer" <Paul.Blinzer@amd.com>,
	"Logan Gunthorpe" <logang@deltatee.com>,
	"John Hubbard" <jhubbard@nvidia.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Jonathan Cameron" <jonathan.cameron@huawei.com>,
	"Mark Hairgrove" <mhairgrove@nvidia.com>,
	"Vivek Kini" <vkini@nvidia.com>,
	"Mel Gorman" <mgorman@techsingularity.net>,
	"Dave Airlie" <airlied@redhat.com>,
	"Ben Skeggs" <bskeggs@redhat.com>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Rik van Riel" <riel@surriel.com>,
	"Ben Woodard" <woodard@redhat.com>,
	linux-acpi@vger.kernel.org
Subject: Re: [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind()
Date: Thu, 6 Dec 2018 14:20:51 -0500	[thread overview]
Message-ID: <20181206192050.GC3544@redhat.com> (raw)
In-Reply-To: <b8fab9a7-62ed-5d8d-3cb1-aea6aacf77fe@intel.com>

On Thu, Dec 06, 2018 at 10:25:08AM -0800, Dave Hansen wrote:
> On 12/5/18 9:53 AM, Jerome Glisse wrote:
> > No so there is 2 kinds of applications:
> >     1) average one: i am using device {1, 3, 9} give me best memory for
> >        those devices
> ...
> > 
> > For case 1 you can pre-parse stuff but this can be done by helper library
> 
> How would that work?  Would each user/container/whatever do this once?
> Where would they keep the pre-parsed stuff?  How do they manage their
> cache if the topology changes?

Short answer i don't expect a cache, i expect that each program will have
a init function that query the topology and update the application codes
accordingly. This is what people do today, query all available devices,
decide which one to use and how, create context for each selected ones,
define a memory migration job/memory policy for each part of the program
so that memory is migrated/have proper policy in place when the code that
run on some device is executed.


Long answer:

I can not dictate how user folks do their program saddly :) I expect that
many application will do it once during start up. Then you will have all
those containers folks or VM folks that will get presure to react to hot-
plug. For instance if you upgrade your instance with your cloud provider
to have more GPUs or more TPUs ... It is likely to appear as an hotplug
from the VM/container point of view and thus as an hotplug from the
application point of view. So far demonstration i have seen do that by
relaunching the application ... More on that through the live re-patching
issues below.

Oh and i expect application will crash if you hot-unplug anything it is
using (this is what happens i believe now in most API). Again i expect
that some pressure from cloud user and provider will force programmer
to be a bit more reactive to this kind of event.


Live re-patching application code can be difficult i am told. Let say you
have:

void compute_serious0_stuff(accelerator_t *accelerator, void *inputA,
                            size_t sinputA, void *inputB, size_t sinputB,
                            void *outputA, size_t soutputA)
{
    ...

    // Migrate the inputA to the accelerator memory
    api_migrate_memory_to_accelerator(accelerator, inputA, sinputA);

    // The inputB buffer is fine in its default placement

    // The output is assume to be empty vma ie no page allocated yet
    // so set a policy to direct all allocation due to page fault to
    // use the accelerator memory
    api_set_memory_policy_to_accelerator(accelerator, outputA, soutputA);

    ...
    for_parallel<accelerator> (i = 0; i < THEYAREAMILLIONSITEMS; ++i) {
        // Do something serious
    }
    ...
}

void serious0_orchestrator(topology topology, void *inputA,
                           void *inputB, void *outputA)
{
    static accelerator_t **selected = NULL;
    static serious0_job_partition *partition;
    ...
    if (selected == NULL) {
        serious0_select_and_partition(topology, &selected, &partition,
                                      inputA, inputB, outputA)
    }
    ...
    for(i = 0; i < nselected; ++) {
        ...
        compute_serious0_stuff(selected[i],
                               inputA + partition[i].inputA_offset,
                               partition[i].inputA_size,
                               inputB + partition[i].inputB_offset,
                               partition[i].inputB_size,
                               outputA + partition[i].outputB_offset,
                               partition[i].outputA_size);
        ...
    }
    ...
    for(i = 0; i < nselected; ++) {
        accelerator_wait_finish(selected[i]);
    }
    ...
    // outputA is ready to be use by the next function in the program
}

If you start without a GPU/TPU your for_parallel will use the CPU and
with the code the compiler have emitted at built time. For GPU/TPU at
build time you compile your for_parallel loop to some intermediate
representation (a virtual ISA) then at runtime during the application
initialization that intermediate representation get lowered down to
all the available GPU/TPU on your system and each for_parallel loop
is patched to be turn into a call to:

void dispatch_accelerator_function(accelerator_t *accelerator,
                                   void *function, ...)
{
}

So in the above example the for_parallel loop becomes:
dispatch_accelerator_function(accelerator, i_compute_serious_stuff,
                              inputA, inputB, outputA);

This hot patching of code is easy to do when no CPU thread is running
the code. However when CPU threads are running it can be problematic,
i am sure you can do trickery like delay the patching only to the next
time the function get call by doing clever thing at build time like
prepending each for_parallel section with enough nop that would allow
you to replace it to a call to the dispatch function and a jump over
the normal CPU code.


I think compiler people want to solve the static case first ie during
application initializations decide what devices are gonna be use and
then update the application accordingly. But i expect it will grow
to support hotplug as relaunching the application is not that user
friendly even in this day an age where people starts millions of
container with one mouse click.


Anyway above example is how it looks today and accelerator can turn
up to be just regular CPU core if you do not have any devices. The
idea is that we would like a common API that cover both CPU thread
and device thread. Same for the migration/policy functions if it
happens that the accelerator is just plain old CPU then you want to
migrate memory to the CPU node and set memory policy to that node too.

Cheers,
Jérôme

  reply	other threads:[~2018-12-06 19:20 UTC|newest]

Thread overview: 94+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-03 23:34 [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() jglisse
2018-12-03 23:34 ` [RFC PATCH 01/14] mm/hms: heterogeneous memory system (sysfs infrastructure) jglisse
2018-12-03 23:34 ` [RFC PATCH 02/14] mm/hms: heterogenenous memory system (HMS) documentation jglisse
2018-12-04 17:06   ` Andi Kleen
2018-12-04 18:24     ` Jerome Glisse
2018-12-04 18:31       ` Dan Williams
2018-12-04 18:57         ` Jerome Glisse
2018-12-04 19:11           ` Logan Gunthorpe
2018-12-04 19:22             ` Jerome Glisse
2018-12-04 19:41               ` Logan Gunthorpe
2018-12-04 20:13                 ` Jerome Glisse
2018-12-04 20:30                   ` Logan Gunthorpe
2018-12-04 20:59                     ` Jerome Glisse
2018-12-04 21:19                       ` Logan Gunthorpe
2018-12-04 21:51                         ` Jerome Glisse
2018-12-04 22:16                           ` Logan Gunthorpe
2018-12-04 23:56                             ` Jerome Glisse
2018-12-05  1:15                               ` Logan Gunthorpe
2018-12-05  2:31                                 ` Jerome Glisse
2018-12-05 17:41                                   ` Logan Gunthorpe
2018-12-05 18:07                                     ` Jerome Glisse
2018-12-05 18:20                                       ` Logan Gunthorpe
2018-12-05 18:33                                         ` Jerome Glisse
2018-12-05 18:48                                           ` Logan Gunthorpe
2018-12-05 18:55                                             ` Jerome Glisse
2018-12-05 19:10                                               ` Logan Gunthorpe
2018-12-05 22:58                                                 ` Jerome Glisse
2018-12-05 23:09                                                   ` Logan Gunthorpe
2018-12-05 23:20                                                     ` Jerome Glisse
2018-12-05 23:23                                                       ` Logan Gunthorpe
2018-12-05 23:27                                                         ` Jerome Glisse
2018-12-06  0:08                                                           ` Dan Williams
2018-12-05  2:34                                 ` Dan Williams
2018-12-05  2:37                                   ` Jerome Glisse
2018-12-05 17:25                                     ` Logan Gunthorpe
2018-12-05 18:01                                       ` Jerome Glisse
2018-12-04 20:14             ` Andi Kleen
2018-12-04 20:47               ` Logan Gunthorpe
2018-12-04 21:15                 ` Jerome Glisse
2018-12-04 19:19           ` Dan Williams
2018-12-04 19:32             ` Jerome Glisse
2018-12-04 20:12       ` Andi Kleen
2018-12-04 20:41         ` Jerome Glisse
2018-12-05  4:36       ` Aneesh Kumar K.V
2018-12-05  4:41         ` Jerome Glisse
2018-12-05 10:52   ` Mike Rapoport
2018-12-03 23:34 ` [RFC PATCH 03/14] mm/hms: add target memory to heterogeneous memory system infrastructure jglisse
2018-12-03 23:34 ` [RFC PATCH 04/14] mm/hms: add initiator " jglisse
2018-12-03 23:35 ` [RFC PATCH 05/14] mm/hms: add link " jglisse
2018-12-03 23:35 ` [RFC PATCH 06/14] mm/hms: add bridge " jglisse
2018-12-03 23:35 ` [RFC PATCH 07/14] mm/hms: register main memory with heterogenenous memory system jglisse
2018-12-03 23:35 ` [RFC PATCH 08/14] mm/hms: register main CPUs " jglisse
2018-12-03 23:35 ` [RFC PATCH 09/14] mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS) jglisse
2018-12-03 23:35 ` [RFC PATCH 10/14] mm/hbind: add heterogeneous memory policy tracking infrastructure jglisse
2018-12-03 23:35 ` [RFC PATCH 11/14] mm/hbind: add bind command to heterogeneous memory policy jglisse
2018-12-03 23:35 ` [RFC PATCH 12/14] mm/hbind: add migrate command to hbind() ioctl jglisse
2018-12-03 23:35 ` [RFC PATCH 13/14] drm/nouveau: register GPU under heterogeneous memory system jglisse
2018-12-03 23:35 ` [RFC PATCH 14/14] test/hms: tests for " jglisse
2018-12-04  7:44 ` [RFC PATCH 00/14] Heterogeneous Memory System (HMS) and hbind() Aneesh Kumar K.V
2018-12-04 14:44   ` Jerome Glisse
2018-12-04 18:02 ` Dave Hansen
2018-12-04 18:49   ` Jerome Glisse
2018-12-04 18:54     ` Dave Hansen
2018-12-04 19:11       ` Jerome Glisse
2018-12-04 21:37     ` Dave Hansen
2018-12-04 21:57       ` Jerome Glisse
2018-12-04 23:58         ` Dave Hansen
2018-12-05  0:29           ` Jerome Glisse
2018-12-05  1:22         ` Kuehling, Felix
2018-12-05 11:27     ` Aneesh Kumar K.V
2018-12-05 16:09       ` Jerome Glisse
2018-12-04 23:54 ` Dave Hansen
2018-12-05  0:15   ` Jerome Glisse
2018-12-05  1:06     ` Dave Hansen
2018-12-05  2:13       ` Jerome Glisse
2018-12-05 17:27         ` Dave Hansen
2018-12-05 17:53           ` Jerome Glisse
2018-12-06 18:25             ` Dave Hansen
2018-12-06 19:20               ` Jerome Glisse [this message]
2018-12-06 19:31                 ` Dave Hansen
2018-12-06 20:11                   ` Logan Gunthorpe
2018-12-06 22:04                     ` Dave Hansen
2018-12-06 22:39                       ` Jerome Glisse
2018-12-06 23:09                         ` Dave Hansen
2018-12-06 23:28                           ` Logan Gunthorpe
2018-12-06 23:34                             ` Dave Hansen
2018-12-06 23:38                             ` Dave Hansen
2018-12-06 23:48                               ` Logan Gunthorpe
2018-12-07  0:20                                 ` Jerome Glisse
2018-12-07 15:06                                   ` Jonathan Cameron
2018-12-07 19:37                                     ` Jerome Glisse
2018-12-07  0:15                           ` Jerome Glisse
2018-12-06 20:27                   ` Jerome Glisse
2018-12-06 21:46                     ` Jerome Glisse

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181206192050.GC3544@redhat.com \
    --to=jglisse@redhat.com \
    --cc=Paul.Blinzer@amd.com \
    --cc=Philip.Yang@amd.com \
    --cc=aarcange@redhat.com \
    --cc=airlied@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=bsingharora@gmail.com \
    --cc=bskeggs@redhat.com \
    --cc=christian.koenig@amd.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=felix.kuehling@amd.com \
    --cc=haggaie@mellanox.com \
    --cc=jhubbard@nvidia.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=keith.busch@intel.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=logang@deltatee.com \
    --cc=mgorman@techsingularity.net \
    --cc=mhairgrove@nvidia.com \
    --cc=mhocko@kernel.org \
    --cc=rafael@kernel.org \
    --cc=rcampbell@nvidia.com \
    --cc=riel@surriel.com \
    --cc=ross.zwisler@linux.intel.com \
    --cc=vkini@nvidia.com \
    --cc=willy@infradead.org \
    --cc=woodard@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).