linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jerome Glisse <j.glisse@gmail.com>
To: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mhocko@suse.com, js1304@gmail.com, vbabka@suse.cz,
	mgorman@suse.de, minchan@kernel.org, akpm@linux-foundation.org,
	bsingharora@gmail.com
Subject: Re: [RFC 0/8] Define coherent device memory node
Date: Tue, 25 Oct 2016 14:52:47 -0400	[thread overview]
Message-ID: <20161025185247.GA7188@gmail.com> (raw)
In-Reply-To: <87shrkjpyb.fsf@linux.vnet.ibm.com>

On Tue, Oct 25, 2016 at 11:01:08PM +0530, Aneesh Kumar K.V wrote:
> Jerome Glisse <j.glisse@gmail.com> writes:
> 
> > On Tue, Oct 25, 2016 at 10:29:38AM +0530, Aneesh Kumar K.V wrote:
> >> Jerome Glisse <j.glisse@gmail.com> writes:
> >> > On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
> >
> > [...]
> >
> >> > You can take a look at hmm-v13 if you want to see how i do non LRU page
> >> > migration. While i put most of the migration code inside hmm_migrate.c it
> >> > could easily be move to migrate.c without hmm_ prefix.
> >> >
> >> > There is 2 missing piece with existing migrate code. First is to put memory
> >> > allocation for destination under control of who call the migrate code. Second
> >> > is to allow offloading the copy operation to device (ie not use the CPU to
> >> > copy data).
> >> >
> >> > I believe same requirement also make sense for platform you are targeting.
> >> > Thus same code can be use.
> >> >
> >> > hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
> >> >
> >> > I haven't posted this patchset yet because we are doing some modifications
> >> > to the device driver API to accomodate some new features. But the ZONE_DEVICE
> >> > changes and the overall migration code will stay the same more or less (i have
> >> > patches that move it to migrate.c and share more code with existing migrate
> >> > code).
> >> >
> >> > If you think i missed anything about lru and page cache please point it to
> >> > me. Because when i audited code for that i didn't see any road block with
> >> > the few fs i was looking at (ext4, xfs and core page cache code).
> >> >
> >> 
> >> The other restriction around ZONE_DEVICE is, it is not a managed zone.
> >> That prevents any direct allocation from coherent device by application.
> >> ie, we would like to force allocation from coherent device using
> >> interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?
> >
> > To achieve this we rely on device fault code path ie when device take a page fault
> > with help of HMM it will use existing memory if any for fault address but if CPU
> > page table is empty (and it is not file back vma because of readback) then device
> > can directly allocate device memory and HMM will update CPU page table to point to
> > newly allocated device memory.
> >
> 
> That is ok if the device touch the page first. What if we want the
> allocation touched first by cpu to come from GPU ?. Should we always
> depend on GPU driver to migrate such pages later from system RAM to GPU
> memory ?
> 

I am not sure what kind of workload would rather have every first CPU access for
a range to use device memory. So no my code does not handle that and it is pointless
for it as CPU can not access device memory for me.

That said nothing forbid to add support for ZONE_DEVICE with mbind() like syscall.
Thought my personnal preference would still be to avoid use of such generic syscall
but have device driver set allocation policy through its own userspace API (device
driver could reuse internal of mbind() to achieve the end result).

I am not saying that eveything you want to do is doable now with HMM but, nothing
preclude achieving what you want to achieve using ZONE_DEVICE. I really don't think
any of the existing mm mechanism (kswapd, lru, numa, ...) are nice fit and can be reuse
with device memory.

Each device is so different from the other that i don't believe in a one API fit all.
The drm GPU subsystem of the kernel is a testimony of how little can be share when it
comes to GPU. The only common code is modesetting. Everything that deals with how to
use GPU to compute stuff is per device and most of the logic is in userspace. So i do
not see any commonality that could be abstracted at syscall level. I would rather let
device driver stack (kernel and userspace) take such decision and have the higher level
API (OpenCL, Cuda, C++17, ...) expose something that make sense for each of them.
Programmer target those high level API and they intend to use the mechanism each offer
to manage memory and memory placement. I would say forcing them to use a second linux
specific API to achieve the latter is wrong, at lest for now.

So in the end if the mbind() syscall is done by the userspace side of the device driver
then why not just having the device driver communicate this through its own kernel
API (which can be much more expressive than what standardize syscall offers). I would
rather avoid making change to any syscall for now.

If latter, down the road, once the userspace ecosystem stabilize, we see that there
is a good level at which we can abstract memory policy for enough devices then and
only then it would make sense to either introduce new syscall or grow/modify existing
one. Right now i fear we could only make bad decision that we would regret down the
road.

I think we can achieve memory device support with the minimum amount of changes to mm
code and existing mm mechanism. Using ZONE_DEVICE already make sure that such memory
is kept out of most mm mechanism and hence avoid all the changes you had to make for
CDM node. It just looks a better fit from my point of view. I think it is worth
considering for your use case too. I am sure folks writting the device driver would
rather share more code between platform with grown up bus system (CAPI, CCIX, ...)
vs platform with kid bus system (PCIE let's forget about PCI and ISA :))

Cheers,
Jérôme

  reply	other threads:[~2016-10-25 18:52 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-24  4:31 [RFC 0/8] Define coherent device memory node Anshuman Khandual
2016-10-24  4:31 ` [RFC 1/8] mm: " Anshuman Khandual
2016-10-24 17:09   ` Dave Hansen
2016-10-25  1:22     ` Anshuman Khandual
2016-10-25 15:47       ` Dave Hansen
2016-10-24  4:31 ` [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes Anshuman Khandual
2016-10-24 17:10   ` Dave Hansen
2016-10-25  1:27     ` Anshuman Khandual
2016-11-17  7:40   ` Anshuman Khandual
2016-11-17  7:59     ` [DRAFT 1/2] mm/cpuset: Exclude CDM nodes from each task's mems_allowed node mask Anshuman Khandual
2016-11-17  7:59       ` [DRAFT 2/2] mm/hugetlb: Restrict HugeTLB allocations only to the system RAM nodes Anshuman Khandual
2016-11-17  8:28       ` [DRAFT 1/2] mm/cpuset: Exclude CDM nodes from each task's mems_allowed node mask kbuild test robot
2016-10-24  4:31 ` [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths Anshuman Khandual
2016-10-24 17:16   ` Dave Hansen
2016-10-25  4:15     ` Aneesh Kumar K.V
2016-10-25  7:17       ` Balbir Singh
2016-10-25  7:25         ` Balbir Singh
2016-10-24  4:31 ` [RFC 4/8] mm: Accommodate coherent device memory nodes in MPOL_BIND implementation Anshuman Khandual
2016-10-24  4:31 ` [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory Anshuman Khandual
2016-10-24 17:38   ` Dave Hansen
2016-10-24 18:00     ` Dave Hansen
2016-10-25 12:36     ` Balbir Singh
2016-10-25 19:20     ` Aneesh Kumar K.V
2016-10-25 20:01       ` Dave Hansen
2016-10-24  4:31 ` [RFC 6/8] mm: Make VM_CDM marked VMAs non migratable Anshuman Khandual
2016-10-24  4:31 ` [RFC 7/8] mm: Add a new migration function migrate_virtual_range() Anshuman Khandual
2016-10-24  4:31 ` [RFC 8/8] mm: Add N_COHERENT_DEVICE node type into node_states[] Anshuman Khandual
2016-10-25  7:22   ` Balbir Singh
2016-10-26  4:52     ` Anshuman Khandual
2016-10-24  4:42 ` [DEBUG 00/10] Test and debug patches for coherent device memory Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 01/10] dt-bindings: Add doc for ibm,hotplug-aperture Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 02/10] powerpc/mm: Create numa nodes for hotplug memory Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 03/10] powerpc/mm: Allow memory hotplug into a memory less node Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 04/10] mm: Enable CONFIG_MOVABLE_NODE on powerpc Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 05/10] powerpc/mm: Identify isolation seeking coherent memory nodes during boot Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 06/10] mm: Export definition of 'zone_names' array through mmzone.h Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 07/10] mm: Add debugfs interface to dump each node's zonelist information Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 08/10] powerpc: Enable CONFIG_MOVABLE_NODE for PPC64 platform Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 09/10] drivers: Add two drivers for coherent device memory tests Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 10/10] test: Add a script to perform random VMA migrations across nodes Anshuman Khandual
2016-10-24 17:09 ` [RFC 0/8] Define coherent device memory node Jerome Glisse
2016-10-25  4:26   ` Aneesh Kumar K.V
2016-10-25 15:16     ` Jerome Glisse
2016-10-26 11:09       ` Aneesh Kumar K.V
2016-10-26 16:07         ` Jerome Glisse
2016-10-28  5:29           ` Aneesh Kumar K.V
2016-10-28 16:16             ` Jerome Glisse
2016-11-05  5:21     ` Anshuman Khandual
2016-11-05 18:02       ` Jerome Glisse
2016-10-25  4:59   ` Aneesh Kumar K.V
2016-10-25 15:32     ` Jerome Glisse
2016-10-25 17:31       ` Aneesh Kumar K.V
2016-10-25 18:52         ` Jerome Glisse [this message]
2016-10-26 11:13           ` Anshuman Khandual
2016-10-26 16:02             ` Jerome Glisse
2016-10-27  4:38               ` Anshuman Khandual
2016-10-27  7:03                 ` Anshuman Khandual
2016-10-27 15:05                   ` Jerome Glisse
2016-10-28  5:47                     ` Anshuman Khandual
2016-10-28 16:08                       ` Jerome Glisse
2016-10-26 12:56           ` Anshuman Khandual
2016-10-26 16:28             ` Jerome Glisse
2016-10-27 10:23               ` Balbir Singh
2016-10-25 12:07   ` Balbir Singh
2016-10-25 15:21     ` Jerome Glisse
2016-10-24 18:04 ` Dave Hansen
2016-10-24 18:32   ` David Nellans
2016-10-24 19:36     ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161025185247.GA7188@gmail.com \
    --to=j.glisse@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.vnet.ibm.com \
    --cc=bsingharora@gmail.com \
    --cc=js1304@gmail.com \
    --cc=khandual@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).