archive mirror
 help / color / mirror / Atom feed
From: Felix Kuehling <>
To: Dave Chinner <>
Cc: Christoph Hellwig <>,
	"Sierra Guiza, Alejandro (Alex)" <>,,,,,,,,,
Subject: Re: [PATCH v1 03/14] mm: add iomem vma selection for memory migration
Date: Thu, 9 Sep 2021 00:55:23 -0400	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <20210902011438.GM2566745@dread.disaster.area>

Am 2021-09-01 um 9:14 p.m. schrieb Dave Chinner:
> On Wed, Sep 01, 2021 at 07:07:34PM -0400, Felix Kuehling wrote:
>> On 2021-09-01 6:03 p.m., Dave Chinner wrote:
>>> On Wed, Sep 01, 2021 at 11:40:43AM -0400, Felix Kuehling wrote:
>>>> Am 2021-09-01 um 4:29 a.m. schrieb Christoph Hellwig:
>>>>> On Mon, Aug 30, 2021 at 01:04:43PM -0400, Felix Kuehling wrote:
>>>>>>>> driver code is not really involved in updating the CPU mappings. Maybe
>>>>>>>> it's something we need to do in the migration helpers.
>>>>>>> It looks like I'm totally misunderstanding what you are adding here
>>>>>>> then.  Why do we need any special treatment at all for memory that
>>>>>>> has normal struct pages and is part of the direct kernel map?
>>>>>> The pages are like normal memory for purposes of mapping them in CPU
>>>>>> page tables and for coherent access from the CPU.
>>>>> That's the user page tables.  What about the kernel direct map?
>>>>> If there is a normal kernel struct page backing there really should
>>>>> be no need for the pgmap.
>>>> I'm not sure. The physical address ranges are in the UEFI system address
>>>> map as special-purpose memory. Does Linux create the struct pages and
>>>> kernel direct map for that without a pgmap call? I didn't see that last
>>>> time I went digging through that code.
>>>>>>  From an application
>>>>>> perspective, we want file-backed and anonymous mappings to be able to
>>>>>> use DEVICE_PUBLIC pages with coherent CPU access. The goal is to
>>>>>> optimize performance for GPU heavy workloads while minimizing the need
>>>>>> to migrate data back-and-forth between system memory and device memory.
>>>>> I don't really understand that part.  file backed pages are always
>>>>> allocated by the file system using the pagecache helpers, that is
>>>>> using the page allocator.  Anonymouns memory also always comes from
>>>>> the page allocator.
>>>> I'm coming at this from my experience with DEVICE_PRIVATE. Both
>>>> anonymous and file-backed pages should be migrateable to DEVICE_PRIVATE
>>>> memory by the migrate_vma_* helpers for more efficient access by our
>>>> GPU. (*) It's part of the basic premise of HMM as I understand it. I
>>>> would expect the same thing to work for DEVICE_PUBLIC memory.
>>>> (*) I believe migrating file-backed pages to DEVICE_PRIVATE doesn't
>>>> currently work, but that's something I'm hoping to fix at some point.
>>> FWIW, I'd love to see the architecture documents that define how
>>> filesystems are supposed to interact with this device private
>>> memory. This whole "hand filesystem controlled memory to other
>>> devices" is a minefield that is trivial to get wrong iand very
>>> difficult to fix - just look at the historical mess that RDMA
>>> to/from file backed and/or DAX pages has been.
>>> So, really, from my perspective as a filesystem engineer, I want to
>>> see an actual specification for how this new memory type is going to
>>> interact with filesystem and the page cache so everyone has some
>>> idea of how this is going to work and can point out how it doesn't
>>> work before code that simply doesn't work is pushed out into
>>> production systems and then merged....
>> OK. To be clear, that's not part of this patch series. And I have no
>> authority to push anything in this part of the kernel, so you have nothing
>> to fear. ;)
> I know this isn't part of the series. but this patchset is laying
> the foundation for future work that will impact subsystems that
> currently have zero visibility and/or knowledge of these changes.

I don't think this patchset is the foundation. Jerome Glisse's work on
HMM is, which was merged 4 years ago and is being used by multiple
drivers now, with the AMD GPU driver being a fairly recent addition.

> There must be an overall architectural plan for this functionality,
> regardless of the current state of implementation. It's that overall
> architectural plan I'm asking about here, because I need to
> understand that before I can sanely comment on the page
> cache/filesystem aspect of the proposed functionality...

The overall HMM and ZONE_DEVICE architecture is documented to some
extent in Documentation/vm/hmm.rst, though it may not go into the level
of detail you are looking for.

>> FWIW, we already have the ability to map file-backed system memory pages
>> into device page tables with HMM and interval notifiers. But we cannot
>> currently migrate them to ZONE_DEVICE pages.
> Sure, but sharing page cache pages allocated and managed by the
> filesystem is not what you are talking about. You're talking about
> migrating page cache data to completely different memory allocated
> by a different memory manager that the filesystems currently have no
> knowledge of or have any way of interfacing with.

This is not part of the current patch series. It is only my intention to
look into ways to migrate file-backed pages to ZONE_DEVICE memory in the

> So I'm asking basic, fundamental questions about how these special
> device based pages are going to work.  How are these pages different
> to normal pages - does page_lock() still guarantee exclusive access
> to the page state across all hardware that can access it?

Yes. The migration API guarantees that pages are locked during the
migration. The driver code doesn't touch the page state itself. It only
uses the migrate_vma_* helpers to deal with that.

This is not new or changed by this patch series.

>  If not,
> what provides access serialisation for pages that are allocated in
> device memory rather than CPU memory (e.g. for truncate
> serialisation)?  Does the hardware that own these pages raise page
> faults on the CPU when those pages are accessed/dirtied?

Yes. This is done by the hmm_range_fault API, which the driver calls in
order to populate its device page tables. It is synchronized with any
mapping changes through mmu_interval_notifiers.

This is not new or changed by this patch series.

>  How does
> demand paging in and out of device memory work (i.e. mapping files
> larger than device memory).

That depends on how the device driver handles device page faults. The
AMD GPU driver can handle recoverable device page faults and update the
device page table on demand with updated pfns from hmm_range_fault.

This is not new or changed by this patch series.

>   How does IO to/from storage work - can
> the filesystem build normal bios out of these device pages and issue
> IO on them?

DEVICE_PUBLIC pages introduced by this patch series, are CPU and
peer-accessible like normal system memory.

DEVICE_PRIVATE pages are not CPU or peer-accessible. Any access to them
would go through the CPU page fault path and cause a
dev_pagemap_ops.migrate_to_ram callback into the AMD GPU driver to unmap
the memory from the GPU and migrate it back to system memory.

>   Are the additional constraints on IO because p2p DMA is
> needed to move the data from the storage HBA directly into/out of
> the GPU memory?
> I can think of lots more complex questions about how filesystems are
> supposed to manage remote device memory in the page cache, but these
> are just some of the basic things that make file-backed mappings
> different to anonymous mappings that I need to understand before I
> can make head or tail of what is being proposed here.....
>> Beyond that, my understanding
>> of how filesystems and page cache work is rather superficial at this point.
>> I'll keep your name in mind for when I am ready to discuss this in more
>> detail.
> If you don't know what the bigger picture is, then who does?
> Somebody built the design/architecture you are working towards, and
> they had to communicate it to you somehow. I'm asking for that
> information to documented and made available to all the people these
> changes might impact, not whether you personally know how it
> works....

This patch series builds on top of existing HMM work with major
contributions from several people on this thread: Jerome Glisse, Jason
Gunthorpe, Christoph Hellwig, Ralph Campbell.

Beyond the reintroduction of DEVICE_PUBLIC memory in this patch series
I'm not looking to invent a major new design here. Immediate future work
is more about chipping away on a few remaining limitations of the
implementation, with respect to migration of file-backed pages and maybe
transparent huge pages.


> Cheers,
> Dave.

  reply	other threads:[~2021-09-09  4:55 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-25  3:48 [PATCH v1 00/14] Add MEMORY_DEVICE_PUBLIC for CPU-accessible coherent device memory Alex Sierra
2021-08-25  3:48 ` [PATCH v1 01/14] ext4/xfs: add page refcount helper Alex Sierra
2021-08-25  7:35   ` Christoph Hellwig
2021-08-25 15:09   ` Theodore Ts'o
2021-08-25 15:49   ` Darrick J. Wong
2021-08-25  3:48 ` [PATCH v1 02/14] mm: remove extra ZONE_DEVICE struct page refcount Alex Sierra
2021-08-25  7:35   ` Christoph Hellwig
2021-08-25 11:15   ` Vlastimil Babka
2021-08-25 17:49     ` Ralph Campbell
2021-08-27 11:26       ` Vlastimil Babka
2021-08-25  3:48 ` [PATCH v1 03/14] mm: add iomem vma selection for memory migration Alex Sierra
2021-08-25  7:40   ` Christoph Hellwig
2021-08-25  7:46   ` Christoph Hellwig
2021-08-25 18:24     ` Sierra Guiza, Alejandro (Alex)
2021-08-26 22:27       ` Felix Kuehling
2021-08-30  8:28         ` Christoph Hellwig
2021-08-30 17:04           ` Felix Kuehling
2021-09-01  8:29             ` Christoph Hellwig
2021-09-01 15:40               ` Felix Kuehling
2021-09-01 22:03                 ` Dave Chinner
2021-09-01 23:07                   ` Felix Kuehling
2021-09-02  1:14                     ` Dave Chinner
2021-09-09  4:55                       ` Felix Kuehling [this message]
2021-09-02  8:18                 ` Christoph Hellwig
2021-09-02 18:07                   ` Dan Williams
2021-09-09  4:02                   ` Felix Kuehling
2021-08-25 14:24   ` Felix Kuehling
2021-08-25  3:48 ` [PATCH v1 04/14] mm: add zone device public type memory support Alex Sierra
2021-08-25  3:48 ` [PATCH v1 05/14] drm/amdkfd: ref count init for device pages Alex Sierra
2021-08-25 14:34   ` Felix Kuehling
2021-08-25  3:48 ` [PATCH v1 06/14] drm/amdkfd: add SPM support for SVM Alex Sierra
2021-08-25 14:45   ` Felix Kuehling
2021-08-25  3:48 ` [PATCH v1 07/14] drm/amdkfd: public type as sys mem on migration to ram Alex Sierra
2021-08-25  3:48 ` [PATCH v1 08/14] mm: add public type support to migrate_vma helpers Alex Sierra
2021-08-25  7:47   ` Christoph Hellwig
2021-08-25  3:48 ` [PATCH v1 09/14] mm: call pgmap->ops->page_free for DEVICE_PUBLIC pages Alex Sierra
2021-08-25  7:46   ` Christoph Hellwig
2021-08-25  3:48 ` [PATCH v1 10/14] lib: test_hmm add ioctl to get zone device type Alex Sierra
2021-08-25  3:48 ` [PATCH v1 11/14] lib: test_hmm add module param for " Alex Sierra
2021-08-25  3:48 ` [PATCH v1 12/14] lib: add support for device public type in test_hmm Alex Sierra
2021-08-25  3:48 ` [PATCH v1 13/14] tools: update hmm-test to support device public type Alex Sierra
2021-08-25  3:48 ` [PATCH v1 14/14] tools: update test_hmm script to support SP config Alex Sierra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).