From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Qu8K=R7=vger.kernel.org=linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIMWL_WL_MED,DKIM_SIGNED,
	DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS
	autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C3110C43381
	for <linux-kernel@archiver.kernel.org>; Thu, 28 Mar 2019 21:32:55 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 7BF342184C
	for <linux-kernel@archiver.kernel.org>; Thu, 28 Mar 2019 21:32:55 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=intel-com.20150623.gappssmtp.com header.i=@intel-com.20150623.gappssmtp.com header.b="XUSFmkDp"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727677AbfC1Vcy (ORCPT
        <rfc822;linux-kernel@archiver.kernel.org>);
        Thu, 28 Mar 2019 17:32:54 -0400
Received: from mail-oi1-f195.google.com ([209.85.167.195]:43938 "EHLO
        mail-oi1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726254AbfC1Vcy (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 28 Mar 2019 17:32:54 -0400
Received: by mail-oi1-f195.google.com with SMTP id t81so44875oig.10
        for <linux-kernel@vger.kernel.org>; Thu, 28 Mar 2019 14:32:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=intel-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=RRoz1sH8AOofMZyH/MnRoYJlFBjef59CBPNfTKquYe0=;
        b=XUSFmkDpyT1FlmTVuJgj8AVeBKaifRG50kAWV2XVx1ZSbspXYH/stLc58OnC8J6CFm
         /ty5nihn/W3YNpRhmPaVs/5w2gcTBkydt/gis8VHVR9VlX+TepxgfRvCYTEGCzVjvd++
         I7DhOv9kbOqFuMgFml173hKY7f3blb/ng6FKqkyjRl1UYff7tgasvA72sBVCgr+E3Oy4
         z3RLbTYXJR7DyP4ZdGQalj82e/4fION5qdG5Oaq/vZ6rGiFl23k9Ag/OUdWGL1PWysT7
         jVd/xFg120XvqaZmXDD5I0pxQCpHDRrp13LdWfLyyGJ53+KBwvUyfqGA3VvI+kHI5T0C
         lvyA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=RRoz1sH8AOofMZyH/MnRoYJlFBjef59CBPNfTKquYe0=;
        b=Zz+aGrVz30EjtSn96WryCuTufHSILvYTNFA5gDJMF7MFH6g2p6ZQugxzTVkZvnM+Ds
         yPeTVw1PrFraL2quXOkmBiF3/w6qgB6WFzbQvlauufksmw8LsmlCamW36TK7HGJctNn8
         dGQKde3Rt59WjLy689XeytQkdVZGuZ0dTUmBEsHjPZBnL2omg1Boh7l3sC9ZJ4hlpGL9
         812DWSFU+p2fgUgOSxk4tiRUV64tckGuDEj/k1fxlSPrMlXr8HGHILrS70vBFIUe6nk/
         z0SLlI4Gt7WggdLQa1ROKEJrcu6VgirTWN8AE7hoBM7G++byQhys2VvCnWtp7Ar/xTZ/
         pZmw==
X-Gm-Message-State: APjAAAWlGJFwXMgLTSjy0WF67ORFM6WmU0vr2ORfm9Vf4HbRwU0KeuzF
        dtGRSD9bgHnrLCErhsMHfykIywPLuS3r0q6cL883uA==
X-Google-Smtp-Source: APXvYqxqLnrUqVLmH+LUT6WoSkbGJrNWil1XO16wT4zRg/nNkV4EjSSHUQ8yfhx/VHyBOly9bybJR+XutJw5+D+MRzQ=
X-Received: by 2002:aca:f581:: with SMTP id t123mr1614964oih.0.1553808772940;
 Thu, 28 Mar 2019 14:32:52 -0700 (PDT)
MIME-Version: 1.0
References: <155327387405.225273.9325594075351253804.stgit@dwillia2-desk3.amr.corp.intel.com>
 <cf304a31-70a6-e701-ec3e-c47dc84b81d2@redhat.com> <CAPcyv4hgAM=ex0B4EBZ40RNf=bXk2WkEzySTUV4ZzOWd_HZwSQ@mail.gmail.com>
 <24c163f2-3b78-827f-257e-70e5a9655806@redhat.com>
In-Reply-To: <24c163f2-3b78-827f-257e-70e5a9655806@redhat.com>
From:   Dan Williams <dan.j.williams@intel.com>
Date:   Thu, 28 Mar 2019 14:32:42 -0700
Message-ID: <CAPcyv4ivBagzsZ1fCDb2Cr3scz+R8ZVgyie5c=LWNd6QZuw36g@mail.gmail.com>
Subject: Re: [PATCH v5 00/10] mm: Sub-section memory hotplug support
To:     David Hildenbrand <david@redhat.com>
Cc:     Andrew Morton <akpm@linux-foundation.org>,
        =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= <jglisse@redhat.com>,
        Logan Gunthorpe <logang@deltatee.com>,
        Toshi Kani <toshi.kani@hpe.com>,
        Jeff Moyer <jmoyer@redhat.com>, Michal Hocko <mhocko@suse.com>,
        Vlastimil Babka <vbabka@suse.cz>,
        stable <stable@vger.kernel.org>, Linux MM <linux-mm@kvack.org>,
        linux-nvdimm <linux-nvdimm@lists.01.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Mar 28, 2019 at 2:17 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> You are using the term "Sub-section memory hotplug support", but is it
> >> actually what you mean? To rephrase, aren't we talking here about
> >> "Sub-section device memory hotplug support" or similar?
> >
> > Specifically it is support for passing @start and @size arguments to
> > arch_add_memory() that are not section aligned. It's not limited to
> > "device memory" which is otherwise not a concept that
> > arch_add_memory() understands, it just groks spans of pfns.
>
> Okay, so everything that does not have a memory block devices as of now.
>
> >
> >> Reason I am asking is because I wonder how that would interact with the
> >> memory block device infrastructure and hotplugging of system ram -
> >> add_memory()/add_memory_resource(). I *assume* you are not changing the
> >> add_memory() interface, so that one still only works with whole sections
> >> (or well, memory_block_size_bytes()) - check_hotplug_memory_range().
> >
> > Like you found below, the implementation enforces that add_memory_*()
> > interfaces maintain section alignment for @start and @size.
> >
> >> In general, mix and matching system RAM and persistent memory per
> >> section, I am not a friend of that.
> >
> > You have no choice. The platform may decide to map PMEM and System RAM
> > in the same section because the Linux section is too large compared to
> > typical memory controller mapping granularity capability.
>
> I might be very wrong here, but do we actually care about something like
> 64MB getting lost in the cracks? I mean if it simplifies core MM, let go
> of the couple of MB of system ram and handle the PMEM part only. Treat
> the system ram parts like memory holes we already have in ordinary
> sections (well, there we simply set the relevant struct pages to
> PG_reserved). Of course, if we have hundreds of unaligned devices and
> stuff will start to add up ... but I assume this is not the case?

That's precisely what we do today and it has become untenable as the
collision scenarios pile up. This thread [1] is worth a read if you
care about  some of the gory details why I'm back to pushing for
sub-section support, but most if it has already been summarized in the
current discussion on this thread.

[1]: https://lore.kernel.org/lkml/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com/

>
> >
> >> Especially when it comes to memory
> >> block devices. But I am getting the feeling that we are rather targeting
> >> PMEM vs. PMEM with this patch series.
> >
> > The collisions are between System RAM, PMEM regions, and PMEM
> > namespaces (sub-divisions of regions that each need their own mapping
> > lifetime).
>
> Understood. I wonder if that PMEM only mapping (including separate
> lifetime) could be handled differently. But I am absolutely no expert,
> just curious.

I refer you to the above thread trying to fix the libnvdimm-local hacks.

>
> >
> >>> Quote patch7:
> >>>
> >>> "The libnvdimm sub-system has suffered a series of hacks and broken
> >>>  workarounds for the memory-hotplug implementation's awkward
> >>>  section-aligned (128MB) granularity. For example the following backtrace
> >>>  is emitted when attempting arch_add_memory() with physical address
> >>>  ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> >>>  within a given section:
> >>>
> >>>   WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> >>>   devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> >>>   [..]
> >>>   Call Trace:
> >>>     dump_stack+0x86/0xc3
> >>>     __warn+0xcb/0xf0
> >>>     warn_slowpath_fmt+0x5f/0x80
> >>>     devm_memremap_pages+0x3b5/0x4c0
> >>>     __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> >>>     pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >>>
> >>>  Recently it was discovered that the problem goes beyond RAM vs PMEM
> >>>  collisions as some platform produce PMEM vs PMEM collisions within a
> >>
> >> As side-noted by Michal, I wonder if PMEM vs. PMEM cannot rather be
> >> implemented "on top" of what we have right now. Or is this what we
> >> already have that you call "hacks in nvdimm" code? (no NVDIMM expert,
> >> sorry for the stupid questions)
> >
> > It doesn't work, because even if the padding was implemented 100%
> > correct, which thus far has failed to be the case, the platform may
> > change physical alignments from one boot to the next for a variety of
> > reasons.
>
> Would ignoring the System RAM parts (as mentioned above) help or doesn't
> it make any difference in terms of complexity?

Doesn't help much, that's only one of many collision sources.

> >>>  given section. The libnvdimm workaround for that case revealed that the
> >>>  libnvdimm section-alignment-padding implementation has been broken for a
> >>>  long while. A fix for that long-standing breakage introduces as many
> >>>  problems as it solves as it would require a backward-incompatible change
> >>>  to the namespace metadata interpretation. Instead of that dubious route
> >>>  [2], address the root problem in the memory-hotplug implementation."
> >>>
> >>> The approach is taken is to observe that each section already maintains
> >>> an array of 'unsigned long' values to hold the pageblock_flags. A single
> >>> additional 'unsigned long' is added to house a 'sub-section active'
> >>> bitmask. Each bit tracks the mapped state of one sub-section's worth of
> >>> capacity which is SECTION_SIZE / BITS_PER_LONG, or 2MB on x86-64.
> >>>
> >>> The implication of allowing sections to be piecemeal mapped/unmapped is
> >>> that the valid_section() helper is no longer authoritative to determine
> >>> if a section is fully mapped. Instead pfn_valid() is updated to consult
> >>> the section-active bitmask. Given that typical memory hotplug still has
> >>> deep "section" dependencies the sub-section capability is limited to
> >>> 'want_memblock=false' invocations of arch_add_memory(), effectively only
> >>> devm_memremap_pages() users for now.
> >>
> >> Ah, there it is. And my point would be, please don't ever unlock
> >> something like that for want_memblock=true. Especially not for memory
> >> added after boot via device drivers (add_memory()).
> >
> > I don't see a strong reason why not, as long as it does not regress
> > existing use cases. It might need to be an opt-in for new tooling that
> > is aware of finer granularity hotplug. That said, I have no pressing
> > need to go there and just care about the arch_add_memory() capability
> > for now.
>
> Especially onlining/offlining of memory might end up very ugly. And that
> goes hand in hand with memory block devices. They are either online or
> offline, not something in between. (I went that path and Michal
> correctly told me why it is not a good idea)

Thread reference?

> I was recently trying to teach memory block devices who their owner is /
> of which type they are. Right now I am looking into the option of using
> drivers. Memory block devices that could belong to different drivers at
> a time are well ... totally broken.

Sub-section support is aimed at a similar case where different
portions of an 128MB span need to handed out to devices / drivers with
independent lifetimes.

> I assume it would still be a special
> case, though, but conceptually speaking about the interface it would be
> allowed.
>
> Memory block devices (and therefore 1..X sections) should have one owner
> only. Anything else just does not fit.

Yes, but I would say the problem there is that the
memory-block-devices interface design is showing its age and is being
pressured with how systems want to deploy and use memory today.