From: Joao Martins <joao.m.martins@oracle.com>
To: linux-mm@kvack.org
Cc: Dan Williams <dan.j.williams@intel.com>,
Vishal Verma <vishal.l.verma@intel.com>,
Dave Jiang <dave.jiang@intel.com>,
Naoya Horiguchi <naoya.horiguchi@nec.com>,
Matthew Wilcox <willy@infradead.org>,
Jason Gunthorpe <jgg@ziepe.ca>,
John Hubbard <jhubbard@nvidia.com>,
Jane Chu <jane.chu@oracle.com>,
Muchun Song <songmuchun@bytedance.com>,
Mike Kravetz <mike.kravetz@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>,
Jonathan Corbet <corbet@lwn.net>, Christoph Hellwig <hch@lst.de>,
nvdimm@lists.linux.dev, linux-doc@vger.kernel.org,
Joao Martins <joao.m.martins@oracle.com>
Subject: [PATCH v7 00/11] mm, device-dax: Introduce compound pages in devmap
Date: Thu, 2 Dec 2021 20:44:11 +0000 [thread overview]
Message-ID: <20211202204422.26777-1-joao.m.martins@oracle.com> (raw)
Changes since v6[10]:
* Patch 4, Wrap commit message to 73 characters max (Christoph Hellwig)
* Patch 4, Move pfn_next() in for_each_device_pfn() to new line (Christoph Hellwig)
* Patch 4, Move pfn range computation to a pfn_len() helper. (Christoph Hellwig)
* Patch 9, Remove @fault_size as it's no longer used (also reported by kbuild robot).
* New Patch 10, remove unneeded @pfn output parameter from dev_dax_huge_fault()
(Christoph Helwig) -- this is done in a new patch
"device-dax: remove pfn from __dev_dax_{pte,pmd,pud}_fault()"
Series is meant to replace what's merged in mmotm/linux-next. Only patch 4 has
changed but I added a new cleanup patch suggested by Christoph which is what
prompted to send the entire series. This was based on linux-next tag
next-20211124 (commit 4b74e088fef6) same as to be replaced v6.
Let me know if there's another preferred way of doing this (e.g. send patch 10
separate as a follow up and just picking up this series patch 4 as mmotm
already has patch 9 fix)).
---
This series converts device-dax to use compound pages, and moves away from the
'struct page per basepage on PMD/PUD' that is done today. Doing so, 1) unlocks
a few noticeable improvements on unpin_user_pages() and makes device-dax+altmap
case 4x times faster in pinning (numbers below and in last patch) 2) as
mentioned in various other threads it's one important step towards cleaning up
ZONE_DEVICE refcounting.
I've split the compound pages on devmap part from the rest based on recent
discussions on devmap pending and future work planned[5][6]. There is consensus
that device-dax should be using compound pages to represent its PMD/PUDs just
like HugeTLB and THP, and that leads to less specialization of the dax parts.
I will pursue the rest of the work in parallel once this part is merged,
particular the GUP-{slow,fast} improvements [7] and the tail struct page
deduplication memory savings part[8].
To summarize what the series does:
Patch 1: Prepare hwpoisoning to work with dax compound pages.
Patches 2-3: Split the current utility function of prep_compound_page()
into head and tail and use those two helpers where appropriate to take
advantage of caches being warm after __init_single_page(). This is used
when initializing zone device when we bring up device-dax namespaces.
Patches 4-10: Add devmap support for compound pages in device-dax.
memmap_init_zone_device() initialize its metadata as compound pages, and it
introduces a new devmap property known as vmemmap_shift which
outlines how the vmemmap is structured (defaults to base pages as done today).
The property describe the page order of the metadata essentially.
While at it do a few cleanups in device-dax in patches 5-9.
Finally enable device-dax usage of devmap @vmemmap_shift to a value
based on its own @align property. @vmemmap_shift returns 0 by default (which
is today's case of base pages in devmap, like fsdax or the others) and the
usage of compound devmap is optional. Starting with device-dax (*not* fsdax) we
enable it by default. There are a few pinning improvements particular on the
unpinning case and altmap, as well as unpin_user_page_range_dirty_lock() being
just as effective as THP/hugetlb[0] pages.
$ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
[altmap]
(pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
$ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
(pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
[altmap with -m 127004]
(pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms
Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with and
without altmap), alongside gup_test selftests with dynamic dax regions and
static dax regions. Coupled with ndctl unit tests for dynamic dax devices
that exercise all of this. Note, for dynamic dax regions I had to revert
commit 8aa83e6395 ("x86/setup: Call early_reserve_memory() earlier"), it
is a known issue that this commit broke efi_fake_mem=.
Patches apply on top of linux-next tag next-20211124 (commit 4b74e088fef6).
Thanks for all the review so far.
As always, Comments and suggestions very much appreciated!
Older Changelog,
v5[9] -> v6[10]:
* Keep @dev on the previous line to improve readability on
patch 5 (Christoph Hellwig)
* Document is_static() function to clarify what are static and
dynamic dax regions in patch 7 (Christoph Hellwig)
* Deduce @f_mapping and @pgmap from vmf->vma->vm_file to reduce
the number of arguments of set_{page,compound}_mapping() in last
patch (Christoph Hellwig)
* Factor out @mapping initialization to a separate helper ([new] patch 8)
and rename set_page_mapping() to dax_set_mapping() in the process.
* Remove set_compound_mapping() and instead adjust dax_set_mapping()
to handle @vmemmap_shift case on the last patch. This greatly
simplifies the last patch, and addresses a similar comment by Christoph
on having an earlier return. No functional change on the changes
to dax_set_mapping compared to its earlier version so I retained
Dan's Rb on last patch.
* Initialize the mapping prior to inserting the PTE/PMD/PUD as opposed
to after the fact. ([new] patch 9, Jason Gunthorpe)
Patches 8 and 9 are new (small cleanups) in v6.
Patches 6 - 9 are the ones missing Rb tags.
v4[4] -> v5[9]:
* Remove patches 8-14 as they will go in 2 separate (parallel) series;
* Rename @geometry to @vmemmap_shift (Christoph Hellwig)
* Make @vmemmap_shift an order rather than nr of pages (Christoph Hellwig)
* Consequently remove helper pgmap_geometry_order() as it's no longer
needed, in place of accessing directly the structure member [Patch 4 and 8]
* Rename pgmap_geometry() to pgmap_vmemmap_nr() in patches 4 and 8;
* Remove usage of pgmap_geometry() in favour for testing
@vmemmap_shift for non-zero directly directly in patch 8;
* Patch 5 is new for using `struct_size()` (Dan Williams)
* Add a 'static_dev_dax()' helper for testing pgmap == NULL handling
for dynamic dax devices.
* Expand patch 6 to be explicitly on those !pgmap cases, and replace
those with static_dev_dax().
* Add performance numbers on patch 8 on gup/pin_user_pages() numbers with
this series.
* Massage commit description to remove mentions of @geometry.
* Add Dan's Reviewed-by on patch 8 (Dan Williams)
v3[3] -> v4[4]:
* Collect Dan's Reviewed-by on patches 1-5,8,9,11
* Collect Muchun Reviewed-by on patch 1,2,11
* Reorder patches to first introduce compound pages in ZONE_DEVICE with
device-dax (for pmem) as first user (patches 1-8) followed by implementing
the sparse-vmemmap changes for minimize struct page overhead for devmap (patches 9-14)
* Eliminate remnant @align references to use @geometry (Dan)
* Convert mentions of 'compound pagemap' to 'compound devmap' throughout
the series to avoid confusions of this work conflicting/referring to
anything Folio or pagemap related.
* Delete pgmap_pfn_geometry() on patch 4
and rework other patches to use pgmap_geometry() instead (Dan)
* Convert @geometry to be a number of pages rather than page size in patch 4 (Dan)
* Make pgmap_geometry() more readable (Christoph)
* Simplify pgmap refcount pfn computation in memremap_pages() (Christoph)
* Rework memmap_init_compound() in patch 4 to use the same style as
memmap_init_zone_device i.e. iterating over PFNs, rather than struct pages (Dan)
* Add comment on devmap prep_compound_head callsite explaining why it needs
to be used after first+second tail pages have been initialized (Dan, Jane)
* Initialize tail page refcount to zero in patch 4
* Make sure pfn_next() iterate over compound pages (rather than base page) in
patch 4 to tackle the zone_device elevated page refcount.
[ Note these last two bullet points above are unneeded once this patch is merged:
https://lore.kernel.org/linux-mm/20210825034828.12927-3-alex.sierra@amd.com/ ]
* Remove usage of ternary operator when computing @end in gup_device_huge() in patch 8 (Dan)
* Remove pinned_head variable in patch 8
* Remove put_dev_pagemap() need for compound case as that is now fixed for the general case
in patch 8
* Switch to PageHead() instead of PageCompound() as we only work with either base pages
or head pages in patch 8 (Matthew)
* Fix kdoc of @altmap and improve kdoc for @pgmap in patch 9 (Dan)
* Fix up missing return in vmemmap_populate_address() in patch 10
* Change error handling style in all patches (Dan)
* Change title of vmemmap_dedup.rst to be more representative of the purpose in patch 12 (Dan)
* Move some of the section and subsection tail page reuse code into helpers
reuse_compound_section() and compound_section_tail_page() for readability in patch 12 (Dan)
* Commit description fixes for clearity in various patches (Dan)
* Add pgmap_geometry_order() helper and
drop unneeded geometry_size, order variables in patch 12
* Drop unneeded byte based computation to be PFN in patch 12
* Handle the dynamic dax region properly when ensuring a stable dev_dax->pgmap in patch 6.
* Add a compound_nr_pages() helper and use it in memmap_init_zone_device to calculate
the number of unique struct pages to initialize depending on @altmap existence in patch 13 (Dan)
* Add compound_section_tail_huge_page() for the tail page PMD reuse in patch 14 (Dan)
* Reword cover letter.
v2 -> v3[3]:
* Collect Mike's Ack on patch 2 (Mike)
* Collect Naoya's Reviewed-by on patch 1 (Naoya)
* Rename compound_pagemaps.rst doc page (and its mentions) to vmemmap_dedup.rst (Mike, Muchun)
* Rebased to next-20210714
v1[1] -> v2[2]:
(New patches 7, 10, 11)
* Remove occurences of 'we' in the commit descriptions (now for real) [Dan]
* Add comment on top of compound_head() for fsdax (Patch 1) [Dan]
* Massage commit descriptions of cleanup/refactor patches to reflect [Dan]
that it's in preparation for bigger infra in sparse-vmemmap. (Patch 2,3,5) [Dan]
* Greatly improve all commit messages in terms of grammar/wording and clearity. [Dan]
* Rename variable/helpers from dev_pagemap::align to @geometry, reflecting
tht it's not the same thing as dev_dax->align, Patch 4 [Dan]
* Move compound page init logic into separate memmap_init_compound() helper, Patch 4 [Dan]
* Simplify patch 9 as a result of having compound initialization differently [Dan]
* Rename @pfn_align variable in memmap_init_zone_device to @pfns_per_compound [Dan]
* Rename Subject of patch 6 [Dan]
* Move hugetlb_vmemmap.c comment block to Documentation/vm Patch 7 [Dan]
* Add some type-safety to @block and use 'struct page *' rather than
void, Patch 8 [Dan]
* Add some comments to less obvious parts on 1G compound page case, Patch 8 [Dan]
* Remove vmemmap lookup function in place of
pmd_off_k() + pte_offset_kernel() given some guarantees on section onlining
serialization, Patch 8
* Add a comment to get_page() mentioning where/how it is, Patch 8 freed [Dan]
* Add docs about device-dax usage of tail dedup technique in newly added
compound_pagemaps.rst doc entry.
* Add cleanup patch for device-dax for ensuring dev_dax::pgmap is always set [Dan]
* Add cleanup patch for device-dax for using ALIGN() [Dan]
* Store pinned head in separate @pinned_head variable and fix error case, patch 13 [Dan]
* Add comment on difference of @next value for PageCompound(), patch 13 [Dan]
* Move PUD compound page to be last patch [Dan]
* Add vmemmap layout for PUD compound geometry in compound_pagemaps.rst doc, patch 14 [Dan]
* Rebased to next-20210617
RFC[0] -> v1:
(New patches 1-3, 5-8 but the diffstat isn't that different)
* Fix hwpoisoning of devmap pages reported by Jane (Patch 1 is new in v1)
* Fix/Massage commit messages to be more clear and remove the 'we' occurences (Dan, John, Matthew)
* Use pfn_align to be clear it's nr of pages for @align value (John, Dan)
* Add two helpers pgmap_align() and pgmap_pfn_align() as accessors of pgmap->align;
* Remove the gup_device_compound_huge special path and have the same code
work both ways while special casing when devmap page is compound (Jason, John)
* Avoid usage of vmemmap_populate_basepages() and introduce a first class
loop that doesn't care about passing an altmap for memmap reuse. (Dan)
* Completely rework the vmemmap_populate_compound() to avoid the sparse_add_section
hack into passing block across sparse_add_section calls. It's a lot easier to
follow and more explicit in what it does.
* Replace the vmemmap refactoring with adding a @pgmap argument and moving
parts of the vmemmap_populate_base_pages(). (Patch 5 and 6 are new as a result)
* Add PMD tail page vmemmap area reuse for 1GB pages. (Patch 8 is new)
* Improve memmap_init_zone_device() to initialize compound pages when
struct pages are cache warm. That lead to a even further speed up further
from RFC series from 190ms -> 80-120ms. Patches 2 and 3 are the new ones
as a result (Dan)
* Remove PGMAP_COMPOUND and use @align as the property to detect whether
or not to reuse vmemmap areas (Dan)
[0] https://lore.kernel.org/linux-mm/20201208172901.17384-1-joao.m.martins@oracle.com/
[1] https://lore.kernel.org/linux-mm/20210325230938.30752-1-joao.m.martins@oracle.com/
[2] https://lore.kernel.org/linux-mm/20210617184507.3662-1-joao.m.martins@oracle.com/
[3] https://lore.kernel.org/linux-mm/20210714193542.21857-1-joao.m.martins@oracle.com/
[4] https://lore.kernel.org/linux-mm/20210827145819.16471-1-joao.m.martins@oracle.com/
[5] https://lore.kernel.org/linux-mm/20211018182559.GC3686969@ziepe.ca/
[6] https://lore.kernel.org/linux-mm/499043a0-b3d8-7a42-4aee-84b81f5b633f@oracle.com/
[7] https://lore.kernel.org/linux-mm/20210827145819.16471-9-joao.m.martins@oracle.com/
[8] https://lore.kernel.org/linux-mm/20210827145819.16471-13-joao.m.martins@oracle.com/
[9] https://lore.kernel.org/linux-mm/20211112150824.11028-1-joao.m.martins@oracle.com/
[10] https://lore.kernel.org/linux-mm/20211124191005.20783-1-joao.m.martins@oracle.com/
Joao Martins (11):
memory-failure: fetch compound_head after pgmap_pfn_valid()
mm/page_alloc: split prep_compound_page into head and tail subparts
mm/page_alloc: refactor memmap_init_zone_device() page init
mm/memremap: add ZONE_DEVICE support for compound pages
device-dax: use ALIGN() for determining pgoff
device-dax: use struct_size()
device-dax: ensure dev_dax->pgmap is valid for dynamic devices
device-dax: factor out page mapping initialization
device-dax: set mapping prior to vmf_insert_pfn{,_pmd,pud}()
device-dax: remove pfn from __dev_dax_{pte,pmd,pud}_fault()
device-dax: compound devmap support
drivers/dax/bus.c | 32 +++++++++
drivers/dax/bus.h | 1 +
drivers/dax/device.c | 124 +++++++++++++++++++++--------------
include/linux/memremap.h | 11 ++++
mm/memory-failure.c | 6 ++
mm/memremap.c | 18 +++--
mm/page_alloc.c | 138 +++++++++++++++++++++++++++------------
7 files changed, 233 insertions(+), 97 deletions(-)
--
2.17.2
next reply other threads:[~2021-12-02 20:46 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-02 20:44 Joao Martins [this message]
2021-12-02 20:44 ` [PATCH v7 01/11] memory-failure: fetch compound_head after pgmap_pfn_valid() Joao Martins
2021-12-02 20:44 ` [PATCH v7 02/11] mm/page_alloc: split prep_compound_page into head and tail subparts Joao Martins
2021-12-02 20:44 ` [PATCH v7 03/11] mm/page_alloc: refactor memmap_init_zone_device() page init Joao Martins
2021-12-02 20:44 ` [PATCH v7 04/11] mm/memremap: add ZONE_DEVICE support for compound pages Joao Martins
2021-12-02 20:44 ` [PATCH v7 05/11] device-dax: use ALIGN() for determining pgoff Joao Martins
2021-12-02 20:44 ` [PATCH v7 06/11] device-dax: use struct_size() Joao Martins
2021-12-02 20:44 ` [PATCH v7 07/11] device-dax: ensure dev_dax->pgmap is valid for dynamic devices Joao Martins
2021-12-02 20:44 ` [PATCH v7 08/11] device-dax: factor out page mapping initialization Joao Martins
2021-12-02 20:44 ` [PATCH v7 09/11] device-dax: set mapping prior to vmf_insert_pfn{,_pmd,pud}() Joao Martins
2021-12-02 20:44 ` [PATCH v7 10/11] device-dax: remove pfn from __dev_dax_{pte,pmd,pud}_fault() Joao Martins
2021-12-03 7:30 ` kernel test robot
2021-12-02 20:44 ` [PATCH v7 11/11] device-dax: compound devmap support Joao Martins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211202204422.26777-1-joao.m.martins@oracle.com \
--to=joao.m.martins@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=hch@lst.de \
--cc=jane.chu@oracle.com \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=naoya.horiguchi@nec.com \
--cc=nvdimm@lists.linux.dev \
--cc=songmuchun@bytedance.com \
--cc=vishal.l.verma@intel.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).