From: Dan Williams <dan.j.williams@intel.com> To: akpm@linux-foundation.org Cc: mhocko@suse.com, david@redhat.com, linux-nvdimm@lists.01.org, stable@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, "Jérôme Glisse" <jglisse@redhat.com>, "Vlastimil Babka" <vbabka@suse.cz> Subject: [PATCH v6 00/12] mm: Sub-section memory hotplug support Date: Wed, 17 Apr 2019 11:38:55 -0700 [thread overview] Message-ID: <155552633539.2015392.2477781120122237934.stgit@dwillia2-desk3.amr.corp.intel.com> (raw) Changes since v5 [1]: - Rebase on next-20190416 and the new 'struct mhp_restrictions' infrastructure. - Extend mhp_restrictions to the 'remove' case so the sub-section policy can be clarified with respect to the memblock-api in a symmetric manner with the 'add' case. - Kill is_dev_zone() since cleanups have now made it moot [1]: https://lwn.net/Articles/783808/ --- The memory hotplug section is an arbitrary / convenient unit for memory hotplug. 'Section-size' units have bled into the user interface ('memblock' sysfs) and can not be changed without breaking existing userspace. The section-size constraint, while mostly benign for typical memory hotplug, has and continues to wreak havoc with 'device-memory' use cases, persistent memory (pmem) in particular. Recall that pmem uses devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a 'struct page' memmap for pmem. However, it does not use the 'bottom half' of memory hotplug, i.e. never marks pmem pages online and never exposes the userspace memblock interface for pmem. This leaves an opening to redress the section-size constraint. To date, the libnvdimm subsystem has attempted to inject padding to satisfy the internal constraints of arch_add_memory(). Beyond complicating the code, leading to bugs [2], wasting memory, and limiting configuration flexibility, the padding hack is broken when the platform changes this physical memory alignment of pmem from one boot to the next. Device failure (intermittent or permanent) and physical reconfiguration are events that can cause the platform firmware to change the physical placement of pmem on a subsequent boot, and device failure is an everyday event in a data-center. It turns out that sections are only a hard requirement of the user-facing interface for memory hotplug and with a bit more infrastructure sub-section arch_add_memory() support can be added for kernel internal usages like devm_memremap_pages(). Here is an analysis of the current design assumptions in the current code and how they are addressed in the new implementation: Current design assumptions: - Sections that describe boot memory (early sections) are never unplugged / removed. - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a valid_section() check - __add_pages() and helper routines assume all operations occur in PAGES_PER_SECTION units. - The memblock sysfs interface only comprehends full sections New design assumptions: - Sections are instrumented with a sub-section bitmask to track (on x86) individual 2MB sub-divisions of a 128MB section. - Partially populated early sections can be extended with additional sub-sections, and those sub-sections can be removed with arch_remove_memory(). With this in place we no longer lose usable memory capacity to padding. - pfn_valid() is updated to look deeper than valid_section() to also check the active-sub-section mask. This indication is in the same cacheline as the valid_section() so the performance impact is expected to be negligible. So far the lkp robot has not reported any regressions. - Outside of the core vmemmap population routines which are replaced, other helper routines like shrink_{zone,pgdat}_span() are updated to handle the smaller granularity. Core memory hotplug routines that deal with online memory are not touched. - The existing memblock sysfs user api guarantees / assumptions are not touched since this capability is limited to !online !memblock-sysfs-accessible sections. Meanwhile the issue reports continue to roll in from users that do not understand when and how the 128MB constraint will bite them. The current implementation relied on being able to support at least one misaligned namespace, but that immediately falls over on any moderately complex namespace creation attempt. Beyond the initial problem of 'System RAM' colliding with pmem, and the unsolvable problem of physical alignment changes, Linux is now being exposed to platforms that collide pmem ranges with other pmem ranges by default [3]. In short, devm_memremap_pages() has pushed the venerable section-size constraint past the breaking point, and the simplicity of section-aligned arch_add_memory() is no longer tenable. These patches are exposed to the kbuild robot on my libnvdimm-pending branch [4], and a preview of the unit test for this functionality is available on the 'subsection-pending' branch of ndctl [5]. [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com [3]: https://github.com/pmem/ndctl/issues/76 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c --- Dan Williams (12): mm/sparsemem: Introduce struct mem_section_usage mm/sparsemem: Introduce common definitions for the size and mask of a section mm/sparsemem: Add helpers track active portions of a section at boot mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap() mm/hotplug: Add mem-hotplug restrictions for remove_memory() mm: Kill is_dev_zone() helper mm/sparsemem: Prepare for sub-section ranges mm/sparsemem: Support sub-section hotplug mm/devm_memremap_pages: Enable sub-section remap libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields libnvdimm/pfn: Stop padding pmem namespaces to section alignment arch/ia64/mm/init.c | 4 arch/powerpc/mm/mem.c | 5 - arch/s390/mm/init.c | 2 arch/sh/mm/init.c | 4 arch/x86/mm/init_32.c | 4 arch/x86/mm/init_64.c | 9 + drivers/nvdimm/dax_devs.c | 2 drivers/nvdimm/pfn.h | 12 - drivers/nvdimm/pfn_devs.c | 93 +++------- include/linux/memory_hotplug.h | 12 + include/linux/mm.h | 4 include/linux/mmzone.h | 72 ++++++-- kernel/memremap.c | 70 +++----- mm/hmm.c | 2 mm/memory_hotplug.c | 148 +++++++++------- mm/page_alloc.c | 8 + mm/sparse-vmemmap.c | 21 ++ mm/sparse.c | 371 +++++++++++++++++++++++++++------------- 18 files changed, 503 insertions(+), 340 deletions(-) _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
WARNING: multiple messages have this Message-ID (diff)
From: Dan Williams <dan.j.williams@intel.com> To: akpm@linux-foundation.org Cc: "David Hildenbrand" <david@redhat.com>, "Jérôme Glisse" <jglisse@redhat.com>, "Logan Gunthorpe" <logang@deltatee.com>, "Toshi Kani" <toshi.kani@hpe.com>, "Jeff Moyer" <jmoyer@redhat.com>, "Michal Hocko" <mhocko@suse.com>, "Vlastimil Babka" <vbabka@suse.cz>, stable@vger.kernel.org, linux-mm@kvack.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, mhocko@suse.com, david@redhat.com Subject: [PATCH v6 00/12] mm: Sub-section memory hotplug support Date: Wed, 17 Apr 2019 11:38:55 -0700 [thread overview] Message-ID: <155552633539.2015392.2477781120122237934.stgit@dwillia2-desk3.amr.corp.intel.com> (raw) Changes since v5 [1]: - Rebase on next-20190416 and the new 'struct mhp_restrictions' infrastructure. - Extend mhp_restrictions to the 'remove' case so the sub-section policy can be clarified with respect to the memblock-api in a symmetric manner with the 'add' case. - Kill is_dev_zone() since cleanups have now made it moot [1]: https://lwn.net/Articles/783808/ --- The memory hotplug section is an arbitrary / convenient unit for memory hotplug. 'Section-size' units have bled into the user interface ('memblock' sysfs) and can not be changed without breaking existing userspace. The section-size constraint, while mostly benign for typical memory hotplug, has and continues to wreak havoc with 'device-memory' use cases, persistent memory (pmem) in particular. Recall that pmem uses devm_memremap_pages(), and subsequently arch_add_memory(), to allocate a 'struct page' memmap for pmem. However, it does not use the 'bottom half' of memory hotplug, i.e. never marks pmem pages online and never exposes the userspace memblock interface for pmem. This leaves an opening to redress the section-size constraint. To date, the libnvdimm subsystem has attempted to inject padding to satisfy the internal constraints of arch_add_memory(). Beyond complicating the code, leading to bugs [2], wasting memory, and limiting configuration flexibility, the padding hack is broken when the platform changes this physical memory alignment of pmem from one boot to the next. Device failure (intermittent or permanent) and physical reconfiguration are events that can cause the platform firmware to change the physical placement of pmem on a subsequent boot, and device failure is an everyday event in a data-center. It turns out that sections are only a hard requirement of the user-facing interface for memory hotplug and with a bit more infrastructure sub-section arch_add_memory() support can be added for kernel internal usages like devm_memremap_pages(). Here is an analysis of the current design assumptions in the current code and how they are addressed in the new implementation: Current design assumptions: - Sections that describe boot memory (early sections) are never unplugged / removed. - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a valid_section() check - __add_pages() and helper routines assume all operations occur in PAGES_PER_SECTION units. - The memblock sysfs interface only comprehends full sections New design assumptions: - Sections are instrumented with a sub-section bitmask to track (on x86) individual 2MB sub-divisions of a 128MB section. - Partially populated early sections can be extended with additional sub-sections, and those sub-sections can be removed with arch_remove_memory(). With this in place we no longer lose usable memory capacity to padding. - pfn_valid() is updated to look deeper than valid_section() to also check the active-sub-section mask. This indication is in the same cacheline as the valid_section() so the performance impact is expected to be negligible. So far the lkp robot has not reported any regressions. - Outside of the core vmemmap population routines which are replaced, other helper routines like shrink_{zone,pgdat}_span() are updated to handle the smaller granularity. Core memory hotplug routines that deal with online memory are not touched. - The existing memblock sysfs user api guarantees / assumptions are not touched since this capability is limited to !online !memblock-sysfs-accessible sections. Meanwhile the issue reports continue to roll in from users that do not understand when and how the 128MB constraint will bite them. The current implementation relied on being able to support at least one misaligned namespace, but that immediately falls over on any moderately complex namespace creation attempt. Beyond the initial problem of 'System RAM' colliding with pmem, and the unsolvable problem of physical alignment changes, Linux is now being exposed to platforms that collide pmem ranges with other pmem ranges by default [3]. In short, devm_memremap_pages() has pushed the venerable section-size constraint past the breaking point, and the simplicity of section-aligned arch_add_memory() is no longer tenable. These patches are exposed to the kbuild robot on my libnvdimm-pending branch [4], and a preview of the unit test for this functionality is available on the 'subsection-pending' branch of ndctl [5]. [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com [3]: https://github.com/pmem/ndctl/issues/76 [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c --- Dan Williams (12): mm/sparsemem: Introduce struct mem_section_usage mm/sparsemem: Introduce common definitions for the size and mask of a section mm/sparsemem: Add helpers track active portions of a section at boot mm/hotplug: Prepare shrink_{zone,pgdat}_span for sub-section removal mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap() mm/hotplug: Add mem-hotplug restrictions for remove_memory() mm: Kill is_dev_zone() helper mm/sparsemem: Prepare for sub-section ranges mm/sparsemem: Support sub-section hotplug mm/devm_memremap_pages: Enable sub-section remap libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields libnvdimm/pfn: Stop padding pmem namespaces to section alignment arch/ia64/mm/init.c | 4 arch/powerpc/mm/mem.c | 5 - arch/s390/mm/init.c | 2 arch/sh/mm/init.c | 4 arch/x86/mm/init_32.c | 4 arch/x86/mm/init_64.c | 9 + drivers/nvdimm/dax_devs.c | 2 drivers/nvdimm/pfn.h | 12 - drivers/nvdimm/pfn_devs.c | 93 +++------- include/linux/memory_hotplug.h | 12 + include/linux/mm.h | 4 include/linux/mmzone.h | 72 ++++++-- kernel/memremap.c | 70 +++----- mm/hmm.c | 2 mm/memory_hotplug.c | 148 +++++++++------- mm/page_alloc.c | 8 + mm/sparse-vmemmap.c | 21 ++ mm/sparse.c | 371 +++++++++++++++++++++++++++------------- 18 files changed, 503 insertions(+), 340 deletions(-)
next reply other threads:[~2019-04-17 18:52 UTC|newest] Thread overview: 111+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-04-17 18:38 Dan Williams [this message] 2019-04-17 18:38 ` [PATCH v6 00/12] mm: Sub-section memory hotplug support Dan Williams 2019-04-17 18:39 ` [PATCH v6 01/12] mm/sparsemem: Introduce struct mem_section_usage Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-05-01 23:25 ` Pavel Tatashin 2019-05-01 23:25 ` Pavel Tatashin 2019-05-02 6:07 ` Dan Williams 2019-05-02 6:07 ` Dan Williams 2019-05-02 14:16 ` Pavel Tatashin 2019-05-02 14:16 ` Pavel Tatashin 2019-05-04 0:22 ` Dan Williams 2019-05-04 0:22 ` Dan Williams 2019-05-04 0:22 ` Dan Williams 2019-05-04 15:55 ` Pavel Tatashin 2019-05-04 15:55 ` Pavel Tatashin 2019-05-04 15:55 ` Pavel Tatashin 2019-04-17 18:39 ` [PATCH v6 02/12] mm/sparsemem: Introduce common definitions for the size and mask of a section Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-05-02 14:53 ` Pavel Tatashin 2019-05-03 0:41 ` Dan Williams 2019-05-03 0:41 ` Dan Williams 2019-05-03 10:35 ` Robin Murphy 2019-05-03 10:35 ` Robin Murphy 2019-05-03 12:57 ` Pavel Tatashin 2019-05-03 13:00 ` Oscar Salvador 2019-05-03 13:00 ` Oscar Salvador 2019-04-17 18:39 ` [PATCH v6 03/12] mm/sparsemem: Add helpers track active portions of a section at boot Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-04-25 14:33 ` Oscar Salvador 2019-04-25 14:43 ` Oscar Salvador 2019-04-25 14:43 ` Oscar Salvador 2019-04-25 14:43 ` Oscar Salvador 2019-04-26 12:57 ` Oscar Salvador 2019-04-26 12:57 ` Oscar Salvador 2019-05-02 16:12 ` Pavel Tatashin 2019-05-02 16:12 ` Pavel Tatashin 2019-05-02 16:12 ` Pavel Tatashin 2019-05-04 19:26 ` Dan Williams 2019-05-04 19:26 ` Dan Williams 2019-05-04 19:26 ` Dan Williams 2019-05-04 19:40 ` Pavel Tatashin 2019-05-04 19:40 ` Pavel Tatashin 2019-04-17 18:39 ` [PATCH v6 04/12] mm/hotplug: Prepare shrink_{zone, pgdat}_span for sub-section removal Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-04-19 23:09 ` Ralph Campbell 2019-04-19 23:09 ` Ralph Campbell 2019-04-19 23:13 ` Dan Williams 2019-04-19 23:13 ` Dan Williams 2019-04-19 23:13 ` Dan Williams 2019-04-26 13:59 ` Oscar Salvador 2019-04-26 14:00 ` Oscar Salvador 2019-05-02 19:18 ` Pavel Tatashin 2019-05-02 19:18 ` Pavel Tatashin 2019-05-02 19:18 ` Pavel Tatashin 2019-04-17 18:39 ` [PATCH v6 05/12] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap() Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-05-02 19:28 ` Pavel Tatashin 2019-05-02 19:28 ` Pavel Tatashin 2019-04-17 18:39 ` [PATCH v6 06/12] mm/hotplug: Add mem-hotplug restrictions for remove_memory() Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-04-23 21:21 ` David Hildenbrand 2019-04-23 21:21 ` David Hildenbrand 2019-04-24 18:07 ` Dan Williams 2019-04-24 18:07 ` Dan Williams 2019-04-17 18:39 ` [PATCH v6 07/12] mm: Kill is_dev_zone() helper Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-04-17 20:17 ` David Hildenbrand 2019-04-17 20:17 ` David Hildenbrand 2019-04-26 14:04 ` Oscar Salvador 2019-04-26 14:04 ` Oscar Salvador 2019-05-02 20:37 ` Pavel Tatashin 2019-05-02 20:37 ` Pavel Tatashin 2019-05-02 20:37 ` Pavel Tatashin 2019-04-17 18:39 ` [PATCH v6 08/12] mm/sparsemem: Prepare for sub-section ranges Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-05-02 21:25 ` Pavel Tatashin 2019-05-02 21:25 ` Pavel Tatashin 2019-05-02 21:25 ` Pavel Tatashin 2019-04-17 18:39 ` [PATCH v6 09/12] mm/sparsemem: Support sub-section hotplug Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-04-17 18:39 ` [PATCH v6 10/12] mm/devm_memremap_pages: Enable sub-section remap Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-04-17 18:39 ` [PATCH v6 11/12] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-04-17 22:02 ` Andrew Morton 2019-04-17 22:02 ` Andrew Morton 2019-04-17 22:09 ` Dan Williams 2019-04-17 22:09 ` Dan Williams 2019-04-17 18:39 ` [PATCH v6 12/12] libnvdimm/pfn: Stop padding pmem namespaces to section alignment Dan Williams 2019-04-17 18:39 ` Dan Williams 2019-04-17 22:03 ` [PATCH v6 00/12] mm: Sub-section memory hotplug support Andrew Morton 2019-04-17 22:03 ` Andrew Morton 2019-04-17 22:59 ` Dan Williams 2019-04-17 22:59 ` Dan Williams 2019-04-18 2:09 ` Dan Williams 2019-04-18 2:09 ` Dan Williams 2019-04-18 12:45 ` Jeff Moyer 2019-04-18 12:45 ` Jeff Moyer 2019-04-19 3:25 ` Dan Williams 2019-04-19 3:25 ` Dan Williams 2019-04-23 13:16 ` Oscar Salvador 2019-04-23 13:16 ` Oscar Salvador 2019-04-24 20:43 ` Pavel Tatashin 2019-04-24 20:43 ` Pavel Tatashin 2019-05-02 22:46 ` Pavel Tatashin 2019-05-02 23:20 ` Dan Williams 2019-05-02 23:20 ` Dan Williams 2019-05-02 23:21 ` Dan Williams 2019-05-02 23:21 ` Dan Williams 2019-05-03 10:48 ` Oscar Salvador 2019-05-03 10:48 ` Oscar Salvador
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=155552633539.2015392.2477781120122237934.stgit@dwillia2-desk3.amr.corp.intel.com \ --to=dan.j.williams@intel.com \ --cc=akpm@linux-foundation.org \ --cc=david@redhat.com \ --cc=jglisse@redhat.com \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=linux-nvdimm@lists.01.org \ --cc=mhocko@suse.com \ --cc=stable@vger.kernel.org \ --cc=vbabka@suse.cz \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.