From: Dan Williams <dan.j.williams@intel.com>
To: David Hildenbrand <david@redhat.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
"Jérôme Glisse" <jglisse@redhat.com>,
"Logan Gunthorpe" <logang@deltatee.com>,
"Toshi Kani" <toshi.kani@hpe.com>,
"Jeff Moyer" <jmoyer@redhat.com>,
"Michal Hocko" <mhocko@suse.com>,
"Vlastimil Babka" <vbabka@suse.cz>,
stable <stable@vger.kernel.org>, "Linux MM" <linux-mm@kvack.org>,
linux-nvdimm <linux-nvdimm@lists.01.org>,
"Linux Kernel Mailing List" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v5 00/10] mm: Sub-section memory hotplug support
Date: Thu, 28 Mar 2019 14:32:42 -0700 [thread overview]
Message-ID: <CAPcyv4ivBagzsZ1fCDb2Cr3scz+R8ZVgyie5c=LWNd6QZuw36g@mail.gmail.com> (raw)
In-Reply-To: <24c163f2-3b78-827f-257e-70e5a9655806@redhat.com>
On Thu, Mar 28, 2019 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> You are using the term "Sub-section memory hotplug support", but is it
> >> actually what you mean? To rephrase, aren't we talking here about
> >> "Sub-section device memory hotplug support" or similar?
> >
> > Specifically it is support for passing @start and @size arguments to
> > arch_add_memory() that are not section aligned. It's not limited to
> > "device memory" which is otherwise not a concept that
> > arch_add_memory() understands, it just groks spans of pfns.
>
> Okay, so everything that does not have a memory block devices as of now.
>
> >
> >> Reason I am asking is because I wonder how that would interact with the
> >> memory block device infrastructure and hotplugging of system ram -
> >> add_memory()/add_memory_resource(). I *assume* you are not changing the
> >> add_memory() interface, so that one still only works with whole sections
> >> (or well, memory_block_size_bytes()) - check_hotplug_memory_range().
> >
> > Like you found below, the implementation enforces that add_memory_*()
> > interfaces maintain section alignment for @start and @size.
> >
> >> In general, mix and matching system RAM and persistent memory per
> >> section, I am not a friend of that.
> >
> > You have no choice. The platform may decide to map PMEM and System RAM
> > in the same section because the Linux section is too large compared to
> > typical memory controller mapping granularity capability.
>
> I might be very wrong here, but do we actually care about something like
> 64MB getting lost in the cracks? I mean if it simplifies core MM, let go
> of the couple of MB of system ram and handle the PMEM part only. Treat
> the system ram parts like memory holes we already have in ordinary
> sections (well, there we simply set the relevant struct pages to
> PG_reserved). Of course, if we have hundreds of unaligned devices and
> stuff will start to add up ... but I assume this is not the case?
That's precisely what we do today and it has become untenable as the
collision scenarios pile up. This thread [1] is worth a read if you
care about some of the gory details why I'm back to pushing for
sub-section support, but most if it has already been summarized in the
current discussion on this thread.
[1]: https://lore.kernel.org/lkml/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> >
> >> Especially when it comes to memory
> >> block devices. But I am getting the feeling that we are rather targeting
> >> PMEM vs. PMEM with this patch series.
> >
> > The collisions are between System RAM, PMEM regions, and PMEM
> > namespaces (sub-divisions of regions that each need their own mapping
> > lifetime).
>
> Understood. I wonder if that PMEM only mapping (including separate
> lifetime) could be handled differently. But I am absolutely no expert,
> just curious.
I refer you to the above thread trying to fix the libnvdimm-local hacks.
>
> >
> >>> Quote patch7:
> >>>
> >>> "The libnvdimm sub-system has suffered a series of hacks and broken
> >>> workarounds for the memory-hotplug implementation's awkward
> >>> section-aligned (128MB) granularity. For example the following backtrace
> >>> is emitted when attempting arch_add_memory() with physical address
> >>> ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> >>> within a given section:
> >>>
> >>> WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> >>> devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> >>> [..]
> >>> Call Trace:
> >>> dump_stack+0x86/0xc3
> >>> __warn+0xcb/0xf0
> >>> warn_slowpath_fmt+0x5f/0x80
> >>> devm_memremap_pages+0x3b5/0x4c0
> >>> __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> >>> pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >>>
> >>> Recently it was discovered that the problem goes beyond RAM vs PMEM
> >>> collisions as some platform produce PMEM vs PMEM collisions within a
> >>
> >> As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be
> >> implemented "on top" of what we have right now. Or is this what we
> >> already have that you call "hacks in nvdimm" code? (no NVDIMM expert,
> >> sorry for the stupid questions)
> >
> > It doesn't work, because even if the padding was implemented 100%
> > correct, which thus far has failed to be the case, the platform may
> > change physical alignments from one boot to the next for a variety of
> > reasons.
>
> Would ignoring the System RAM parts (as mentioned above) help or doesn't
> it make any difference in terms of complexity?
Doesn't help much, that's only one of many collision sources.
> >>> given section. The libnvdimm workaround for that case revealed that the
> >>> libnvdimm section-alignment-padding implementation has been broken for a
> >>> long while. A fix for that long-standing breakage introduces as many
> >>> problems as it solves as it would require a backward-incompatible change
> >>> to the namespace metadata interpretation. Instead of that dubious route
> >>> [2], address the root problem in the memory-hotplug implementation."
> >>>
> >>> The approach is taken is to observe that each section already maintains
> >>> an array of 'unsigned long' values to hold the pageblock_flags. A single
> >>> additional 'unsigned long' is added to house a 'sub-section active'
> >>> bitmask. Each bit tracks the mapped state of one sub-section's worth of
> >>> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> >>>
> >>> The implication of allowing sections to be piecemeal mapped/unmapped is
> >>> that the valid_section() helper is no longer authoritative to determine
> >>> if a section is fully mapped. Instead pfn_valid() is updated to consult
> >>> the section-active bitmask. Given that typical memory hotplug still has
> >>> deep "section" dependencies the sub-section capability is limited to
> >>> 'want_memblock=false' invocations of arch_add_memory(), effectively only
> >>> devm_memremap_pages() users for now.
> >>
> >> Ah, there it is. And my point would be, please don't ever unlock
> >> something like that for want_memblock=true. Especially not for memory
> >> added after boot via device drivers (add_memory()).
> >
> > I don't see a strong reason why not, as long as it does not regress
> > existing use cases. It might need to be an opt-in for new tooling that
> > is aware of finer granularity hotplug. That said, I have no pressing
> > need to go there and just care about the arch_add_memory() capability
> > for now.
>
> Especially onlining/offlining of memory might end up very ugly. And that
> goes hand in hand with memory block devices. They are either online or
> offline, not something in between. (I went that path and Michal
> correctly told me why it is not a good idea)
Thread reference?
> I was recently trying to teach memory block devices who their owner is /
> of which type they are. Right now I am looking into the option of using
> drivers. Memory block devices that could belong to different drivers at
> a time are well ... totally broken.
Sub-section support is aimed at a similar case where different
portions of an 128MB span need to handed out to devices / drivers with
independent lifetimes.
> I assume it would still be a special
> case, though, but conceptually speaking about the interface it would be
> allowed.
>
> Memory block devices (and therefore 1..X sections) should have one owner
> only. Anything else just does not fit.
Yes, but I would say the problem there is that the
memory-block-devices interface design is showing its age and is being
pressured with how systems want to deploy and use memory today.
next prev parent reply other threads:[~2019-03-28 21:32 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-03-22 16:57 [PATCH v5 00/10] mm: Sub-section memory hotplug support Dan Williams
2019-03-22 16:57 ` [PATCH v5 01/10] mm/sparsemem: Introduce struct mem_section_usage Dan Williams
2019-03-22 16:58 ` [PATCH v5 02/10] mm/sparsemem: Introduce common definitions for the size and mask of a section Dan Williams
2019-03-22 16:58 ` [PATCH v5 03/10] mm/sparsemem: Add helpers track active portions of a section at boot Dan Williams
2019-03-22 16:58 ` [PATCH v5 04/10] mm/hotplug: Prepare shrink_{zone, pgdat}_span for sub-section removal Dan Williams
2019-03-22 16:58 ` [PATCH v5 05/10] mm/sparsemem: Convert kmalloc_section_memmap() to populate_section_memmap() Dan Williams
2019-03-22 16:58 ` [PATCH v5 06/10] mm/sparsemem: Prepare for sub-section ranges Dan Williams
2019-03-22 16:58 ` [PATCH v5 07/10] mm/sparsemem: Support sub-section hotplug Dan Williams
2019-03-22 16:58 ` [PATCH v5 08/10] mm/devm_memremap_pages: Enable sub-section remap Dan Williams
2019-03-22 16:58 ` [PATCH v5 09/10] libnvdimm/pfn: Fix fsdax-mode namespace info-block zero-fields Dan Williams
2019-03-27 14:00 ` Sasha Levin
2019-03-22 16:58 ` [PATCH v5 10/10] libnvdimm/pfn: Stop padding pmem namespaces to section alignment Dan Williams
2019-03-22 18:05 ` [PATCH v5 00/10] mm: Sub-section memory hotplug support Michal Hocko
2019-03-22 18:32 ` Dan Williams
2019-03-25 10:19 ` Michal Hocko
2019-03-25 14:28 ` Jeff Moyer
2019-03-25 14:50 ` Michal Hocko
2019-03-25 20:03 ` Dan Williams
2019-03-26 8:04 ` Michal Hocko
2019-03-27 0:20 ` Dan Williams
2019-03-27 16:13 ` Michal Hocko
2019-03-27 16:17 ` Dan Williams
2019-03-28 13:38 ` David Hildenbrand
2019-03-28 14:16 ` Michal Hocko
2019-04-01 9:18 ` David Hildenbrand
2019-03-28 20:10 ` David Hildenbrand
2019-03-28 20:43 ` Dan Williams
2019-03-28 21:17 ` David Hildenbrand
2019-03-28 21:32 ` Dan Williams [this message]
2019-03-28 21:54 ` David Hildenbrand
2019-04-10 9:51 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAPcyv4ivBagzsZ1fCDb2Cr3scz+R8ZVgyie5c=LWNd6QZuw36g@mail.gmail.com' \
--to=dan.j.williams@intel.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=jglisse@redhat.com \
--cc=jmoyer@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=logang@deltatee.com \
--cc=mhocko@suse.com \
--cc=stable@vger.kernel.org \
--cc=toshi.kani@hpe.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).