linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
To: Jerome Glisse <j.glisse@gmail.com>,
	Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mhocko@suse.com, js1304@gmail.com, vbabka@suse.cz,
	mgorman@suse.de, minchan@kernel.org, akpm@linux-foundation.org,
	bsingharora@gmail.com
Subject: Re: [RFC 0/8] Define coherent device memory node
Date: Tue, 25 Oct 2016 10:29:38 +0530	[thread overview]
Message-ID: <877f8xaurp.fsf@linux.vnet.ibm.com> (raw)
In-Reply-To: <20161024170902.GA5521@gmail.com>

Jerome Glisse <j.glisse@gmail.com> writes:

> On Mon, Oct 24, 2016 at 10:01:49AM +0530, Anshuman Khandual wrote:
>
>> [...]
>
>> 	Core kernel memory features like reclamation, evictions etc. might
>> need to be restricted or modified on the coherent device memory node as
>> they can be performance limiting. The RFC does not propose anything on this
>> yet but it can be looked into later on. For now it just disables Auto NUMA
>> for any VMA which has coherent device memory.
>> 
>> 	Seamless integration of coherent device memory with system memory
>> will enable various other features, some of which can be listed as follows.
>> 
>> 	a. Seamless migrations between system RAM and the coherent memory
>> 	b. Will have asynchronous and high throughput migrations
>> 	c. Be able to allocate huge order pages from these memory regions
>> 	d. Restrict allocations to a large extent to the tasks using the
>> 	   device for workload acceleration
>> 
>> 	Before concluding, will look into the reasons why the existing
>> solutions don't work. There are two basic requirements which have to be
>> satisfies before the coherent device memory can be integrated with core
>> kernel seamlessly.
>> 
>> 	a. PFN must have struct page
>> 	b. Struct page must able to be inside standard LRU lists
>> 
>> 	The above two basic requirements discard the existing method of
>> device memory representation approaches like these which then requires the
>> need of creating a new framework.
>
> I do not believe the LRU list is a hard requirement, yes when faulting in
> a page inside the page cache it assumes it needs to be added to lru list.
> But i think this can easily be work around.
>
> In HMM i am using ZONE_DEVICE and because memory is not accessible from CPU
> (not everyone is bless with decent system bus like CAPI, CCIX, Gen-Z, ...)
> so in my case a file back page must always be spawn first from a regular
> page and once read from disk then i can migrate to GPU page.
>
> So if you accept this intermediary step you can easily use ZONE_DEVICE for
> device memory. This way no lru, no complex dance to make the memory out of
> reach from regular memory allocator.
>
> I think we would have much to gain if we pool our effort on a single common
> solution for device memory. In my case the device memory is not accessible
> by the CPU (because PCIE restrictions), in your case it is. Thus the only
> difference is that in my case it can not be map inside the CPU page table
> while in yours it can.
>
>> 
>> (1) Traditional ioremap
>> 
>> 	a. Memory is mapped into kernel (linear and virtual) and user space
>> 	b. These PFNs do not have struct pages associated with it
>> 	c. These special PFNs are marked with special flags inside the PTE
>> 	d. Cannot participate in core VM functions much because of this
>> 	e. Cannot do easy user space migrations
>> 
>> (2) Zone ZONE_DEVICE
>> 
>> 	a. Memory is mapped into kernel and user space
>> 	b. PFNs do have struct pages associated with it
>> 	c. These struct pages are allocated inside it's own memory range
>> 	d. Unfortunately the struct page's union containing LRU has been
>> 	   used for struct dev_pagemap pointer
>> 	e. Hence it cannot be part of any LRU (like Page cache)
>> 	f. Hence file cached mapping cannot reside on these PFNs
>> 	g. Cannot do easy migrations
>> 
>> 	I had also explored non LRU representation of this coherent device
>> memory where the integration with system RAM in the core VM is limited only
>> to the following functions. Not being inside LRU is definitely going to
>> reduce the scope of tight integration with system RAM.
>> 
>> (1) Migration support between system RAM and coherent memory
>> (2) Migration support between various coherent memory nodes
>> (3) Isolation of the coherent memory
>> (4) Mapping the coherent memory into user space through driver's
>>     struct vm_operations
>> (5) HW poisoning of the coherent memory
>> 
>> 	Allocating the entire memory of the coherent device node right
>> after hot plug into ZONE_MOVABLE (where the memory is already inside the
>> buddy system) will still expose a time window where other user space
>> allocations can come into the coherent device memory node and prevent the
>> intended isolation. So traditional hot plug is not the solution. Hence
>> started looking into CMA based non LRU solution but then hit the following
>> roadblocks.
>> 
>> (1) CMA does not support hot plugging of new memory node
>> 	a. CMA area needs to be marked during boot before buddy is
>> 	   initialized
>> 	b. cma_alloc()/cma_release() can happen on the marked area
>> 	c. Should be able to mark the CMA areas just after memory hot plug
>> 	d. cma_alloc()/cma_release() can happen later after the hot plug
>> 	e. This is not currently supported right now
>> 
>> (2) Mapped non LRU migration of pages
>> 	a. Recent work from Michan Kim makes non LRU page migratable
>> 	b. But it still does not support migration of mapped non LRU pages
>> 	c. With non LRU CMA reserved, again there are some additional
>> 	   challenges
>> 
>> 	With hot pluggable CMA and non LRU mapped migration support there
>> may be an alternate approach to represent coherent device memory. Please
>> do review this RFC proposal and let me know your comments or suggestions.
>> Thank you.
>
> You can take a look at hmm-v13 if you want to see how i do non LRU page
> migration. While i put most of the migration code inside hmm_migrate.c it
> could easily be move to migrate.c without hmm_ prefix.
>
> There is 2 missing piece with existing migrate code. First is to put memory
> allocation for destination under control of who call the migrate code. Second
> is to allow offloading the copy operation to device (ie not use the CPU to
> copy data).
>
> I believe same requirement also make sense for platform you are targeting.
> Thus same code can be use.
>
> hmm-v13 https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-v13
>
> I haven't posted this patchset yet because we are doing some modifications
> to the device driver API to accomodate some new features. But the ZONE_DEVICE
> changes and the overall migration code will stay the same more or less (i have
> patches that move it to migrate.c and share more code with existing migrate
> code).
>
> If you think i missed anything about lru and page cache please point it to
> me. Because when i audited code for that i didn't see any road block with
> the few fs i was looking at (ext4, xfs and core page cache code).
>

The other restriction around ZONE_DEVICE is, it is not a managed zone.
That prevents any direct allocation from coherent device by application.
ie, we would like to force allocation from coherent device using
interface like mbind(MPOL_BIND..) . Is that possible with ZONE_DEVICE ?

-aneeesh

  parent reply	other threads:[~2016-10-25  4:59 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-24  4:31 [RFC 0/8] Define coherent device memory node Anshuman Khandual
2016-10-24  4:31 ` [RFC 1/8] mm: " Anshuman Khandual
2016-10-24 17:09   ` Dave Hansen
2016-10-25  1:22     ` Anshuman Khandual
2016-10-25 15:47       ` Dave Hansen
2016-10-24  4:31 ` [RFC 2/8] mm: Add specialized fallback zonelist for coherent device memory nodes Anshuman Khandual
2016-10-24 17:10   ` Dave Hansen
2016-10-25  1:27     ` Anshuman Khandual
2016-11-17  7:40   ` Anshuman Khandual
2016-11-17  7:59     ` [DRAFT 1/2] mm/cpuset: Exclude CDM nodes from each task's mems_allowed node mask Anshuman Khandual
2016-11-17  7:59       ` [DRAFT 2/2] mm/hugetlb: Restrict HugeTLB allocations only to the system RAM nodes Anshuman Khandual
2016-11-17  8:28       ` [DRAFT 1/2] mm/cpuset: Exclude CDM nodes from each task's mems_allowed node mask kbuild test robot
2016-10-24  4:31 ` [RFC 3/8] mm: Isolate coherent device memory nodes from HugeTLB allocation paths Anshuman Khandual
2016-10-24 17:16   ` Dave Hansen
2016-10-25  4:15     ` Aneesh Kumar K.V
2016-10-25  7:17       ` Balbir Singh
2016-10-25  7:25         ` Balbir Singh
2016-10-24  4:31 ` [RFC 4/8] mm: Accommodate coherent device memory nodes in MPOL_BIND implementation Anshuman Khandual
2016-10-24  4:31 ` [RFC 5/8] mm: Add new flag VM_CDM for coherent device memory Anshuman Khandual
2016-10-24 17:38   ` Dave Hansen
2016-10-24 18:00     ` Dave Hansen
2016-10-25 12:36     ` Balbir Singh
2016-10-25 19:20     ` Aneesh Kumar K.V
2016-10-25 20:01       ` Dave Hansen
2016-10-24  4:31 ` [RFC 6/8] mm: Make VM_CDM marked VMAs non migratable Anshuman Khandual
2016-10-24  4:31 ` [RFC 7/8] mm: Add a new migration function migrate_virtual_range() Anshuman Khandual
2016-10-24  4:31 ` [RFC 8/8] mm: Add N_COHERENT_DEVICE node type into node_states[] Anshuman Khandual
2016-10-25  7:22   ` Balbir Singh
2016-10-26  4:52     ` Anshuman Khandual
2016-10-24  4:42 ` [DEBUG 00/10] Test and debug patches for coherent device memory Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 01/10] dt-bindings: Add doc for ibm,hotplug-aperture Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 02/10] powerpc/mm: Create numa nodes for hotplug memory Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 03/10] powerpc/mm: Allow memory hotplug into a memory less node Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 04/10] mm: Enable CONFIG_MOVABLE_NODE on powerpc Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 05/10] powerpc/mm: Identify isolation seeking coherent memory nodes during boot Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 06/10] mm: Export definition of 'zone_names' array through mmzone.h Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 07/10] mm: Add debugfs interface to dump each node's zonelist information Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 08/10] powerpc: Enable CONFIG_MOVABLE_NODE for PPC64 platform Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 09/10] drivers: Add two drivers for coherent device memory tests Anshuman Khandual
2016-10-24  4:42   ` [DEBUG 10/10] test: Add a script to perform random VMA migrations across nodes Anshuman Khandual
2016-10-24 17:09 ` [RFC 0/8] Define coherent device memory node Jerome Glisse
2016-10-25  4:26   ` Aneesh Kumar K.V
2016-10-25 15:16     ` Jerome Glisse
2016-10-26 11:09       ` Aneesh Kumar K.V
2016-10-26 16:07         ` Jerome Glisse
2016-10-28  5:29           ` Aneesh Kumar K.V
2016-10-28 16:16             ` Jerome Glisse
2016-11-05  5:21     ` Anshuman Khandual
2016-11-05 18:02       ` Jerome Glisse
2016-10-25  4:59   ` Aneesh Kumar K.V [this message]
2016-10-25 15:32     ` Jerome Glisse
2016-10-25 17:31       ` Aneesh Kumar K.V
2016-10-25 18:52         ` Jerome Glisse
2016-10-26 11:13           ` Anshuman Khandual
2016-10-26 16:02             ` Jerome Glisse
2016-10-27  4:38               ` Anshuman Khandual
2016-10-27  7:03                 ` Anshuman Khandual
2016-10-27 15:05                   ` Jerome Glisse
2016-10-28  5:47                     ` Anshuman Khandual
2016-10-28 16:08                       ` Jerome Glisse
2016-10-26 12:56           ` Anshuman Khandual
2016-10-26 16:28             ` Jerome Glisse
2016-10-27 10:23               ` Balbir Singh
2016-10-25 12:07   ` Balbir Singh
2016-10-25 15:21     ` Jerome Glisse
2016-10-24 18:04 ` Dave Hansen
2016-10-24 18:32   ` David Nellans
2016-10-24 19:36     ` Dave Hansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877f8xaurp.fsf@linux.vnet.ibm.com \
    --to=aneesh.kumar@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=bsingharora@gmail.com \
    --cc=j.glisse@gmail.com \
    --cc=js1304@gmail.com \
    --cc=khandual@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.com \
    --cc=minchan@kernel.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).