From: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
To: Michal Hocko <mhocko@kernel.org>, Igor Mammedov <imammedo@redhat.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Greg KH <gregkh@linuxfoundation.org>,
"K. Y. Srinivasan" <kys@microsoft.com>,
David Rientjes <rientjes@google.com>,
Daniel Kiper <daniel.kiper@oracle.com>,
linux-api@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
linux-s390@vger.kernel.org, xen-devel@lists.xenproject.org,
linux-acpi@vger.kernel.org, qiuxishi@huawei.com,
toshi.kani@hpe.com, xieyisheng1@huawei.com, slaoub@gmail.com,
iamjoonsoo.kim@lge.com, vbabka@suse.cz,
Zhang Zhen <zhenzhang.zhang@huawei.com>,
Reza Arbab <arbab@linux.vnet.ibm.com>,
Tang Chen <tangchen@cn.fujitsu.com>
Subject: Re: WTH is going on with memory hotplug sysf interface
Date: Fri, 10 Mar 2017 12:39:27 -0500 [thread overview]
Message-ID: <75ee9d3f-7027-782a-9cde-5192396a4a8c@gmail.com> (raw)
In-Reply-To: <20170310135807.GI3753@dhcp22.suse.cz>
On 03/10/2017 08:58 AM, Michal Hocko wrote:
> Let's CC people touching this logic. A short summary is that onlining
> memory via udev is currently unusable for online_movable because blocks
> are added from lower addresses while movable blocks are allowed from
> last blocks. More below.
>
> On Thu 09-03-17 13:54:00, Michal Hocko wrote:
>> On Tue 07-03-17 13:40:04, Igor Mammedov wrote:
>>> On Mon, 6 Mar 2017 15:54:17 +0100
>>> Michal Hocko <mhocko@kernel.org> wrote:
>>>
>>>> On Fri 03-03-17 18:34:22, Igor Mammedov wrote:
>> [...]
>>>>> in current mainline kernel it triggers following code path:
>>>>>
>>>>> online_pages()
>>>>> ...
>>>>> if (online_type == MMOP_ONLINE_KERNEL) {
>>>>> if (!zone_can_shift(pfn, nr_pages, ZONE_NORMAL, &zone_shift))
>>>>> return -EINVAL;
>>>>
>>>> Are you sure? I would expect MMOP_ONLINE_MOVABLE here
>>> pretty much, reproducer is above so try and see for yourself
>>
>> I will play with this...
>
> OK so I did with -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G which generated
> [...]
> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff]
> [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff]
> [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug
> [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff]
> [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff]
> [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff]
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff]
> [ 0.000000] Normal empty
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff]
> [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff]
>
> so there is neither any normal zone nor movable one at the boot time.
> Then I hotplugged 1G slot
> (qemu) object_add memory-backend-ram,id=mem1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
>
> unfortunatelly the memory didn't show up automatically and I got
> [ 116.375781] acpi PNP0C80:00: Enumeration failure
>
> so I had to probe it manually (prbably the BIOS my qemu uses doesn't
> support auto probing - I haven't really dug further). Anyway the SRAT
> table printed during the boot told that we should start at 0x100000000
>
> # echo 0x100000000 > /sys/devices/system/memory/probe
> # grep . /sys/devices/system/memory/memory32/valid_zones
> Normal Movable
>
> which looks reasonably right? Both Normal and Movable zones are allowed
>
> # echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
> # grep . /sys/devices/system/memory/memory3?/valid_zones
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
>
> Huh, so our valid_zones have changed under our feet...
>
> # echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
> # grep . /sys/devices/system/memory/memory3?/valid_zones
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal
> /sys/devices/system/memory/memory34/valid_zones:Normal Movable
>
> and again. So only the last memblock is considered movable. Let's try to
> online them now.
>
> # echo online_movable > /sys/devices/system/memory/memory34/state
> # grep . /sys/devices/system/memory/memory3?/valid_zones
> /sys/devices/system/memory/memory32/valid_zones:Normal
> /sys/devices/system/memory/memory33/valid_zones:Normal Movable
> /sys/devices/system/memory/memory34/valid_zones:Movable Normal
>
I think there is no strong reason which kernel has the restriction.
By setting the restrictions, it seems to have made management of
these zone structs simple.
Thanks,
Yasuaki Ishimatsu
> This would explain why onlining from the last block actually works but
> to me this sounds like a completely crappy behavior. All we need to
> guarantee AFAICS is that Normal and Movable zones do not overlap. I
> believe there is even no real requirement about ordering of the physical
> memory in Normal vs. Movable zones as long as they do not overlap. But
> let's keep it simple for the start and always enforce the current status
> quo that Normal zone is physically preceeding Movable zone.
> Can somebody explain why we cannot have a simple rule for Normal vs.
> Movable which would be:
> - block [pfn, pfn+block_size] can be Normal if
> !zone_populated(MOVABLE) || pfn+block_size < ZONE_MOVABLE->zone_start_pfn
> - block [pfn, pfn+block_size] can be Movable if
> !zone_populated(NORMAL) || ZONE_NORMAL->zone_end_pfn < pfn
>
> I haven't fully grokked all the restrictions on the movable zone size
> based on the kernel parameters (find_zone_movable_pfns_for_nodes) but
> this shouldn't really make the situation really much more complicated I
> believe because those parameters should be mostly about static
> initialization rather than hotplug but I might be easily missing
> something.
>
next prev parent reply other threads:[~2017-03-10 17:39 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-27 9:28 [RFC PATCH] mm, hotplug: get rid of auto_online_blocks Michal Hocko
2017-02-27 10:02 ` Vitaly Kuznetsov
2017-02-27 10:21 ` Michal Hocko
2017-02-27 10:49 ` Vitaly Kuznetsov
2017-02-27 12:56 ` Michal Hocko
2017-02-27 13:17 ` Vitaly Kuznetsov
2017-02-27 11:25 ` Heiko Carstens
2017-02-27 11:50 ` Vitaly Kuznetsov
2017-02-27 15:43 ` Michal Hocko
2017-02-28 10:21 ` Heiko Carstens
2017-03-02 13:53 ` Igor Mammedov
2017-03-02 14:28 ` Michal Hocko
2017-03-02 17:03 ` Igor Mammedov
2017-03-03 8:27 ` Michal Hocko
2017-03-03 17:34 ` Igor Mammedov
2017-03-06 14:54 ` Michal Hocko
2017-03-07 12:40 ` Igor Mammedov
2017-03-09 12:54 ` Michal Hocko
2017-03-10 13:58 ` WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks) Michal Hocko
2017-03-10 15:53 ` Michal Hocko
2017-03-10 19:00 ` Reza Arbab
2017-03-13 9:21 ` Michal Hocko
2017-03-13 14:58 ` Reza Arbab
2017-03-14 19:35 ` Andrea Arcangeli
2017-03-15 7:57 ` Michal Hocko
2017-03-13 15:11 ` Michal Hocko
2017-03-13 23:16 ` Andi Kleen
2017-03-10 17:39 ` Yasuaki Ishimatsu [this message]
2017-03-13 9:19 ` WTH is going on with memory hotplug sysf interface Michal Hocko
2017-03-14 16:05 ` YASUAKI ISHIMATSU
2017-03-14 16:20 ` Michal Hocko
2017-03-13 10:31 ` WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks) Igor Mammedov
2017-03-13 10:43 ` Michal Hocko
2017-03-13 13:57 ` Igor Mammedov
2017-03-13 14:36 ` Michal Hocko
2017-03-13 10:55 ` [RFC PATCH] mm, hotplug: get rid of auto_online_blocks Igor Mammedov
2017-03-13 12:28 ` Michal Hocko
2017-03-13 12:54 ` Vitaly Kuznetsov
2017-03-13 13:19 ` Michal Hocko
2017-03-13 13:42 ` Vitaly Kuznetsov
2017-03-13 14:32 ` Michal Hocko
2017-03-13 15:10 ` Vitaly Kuznetsov
2017-03-14 13:20 ` Igor Mammedov
2017-03-15 7:53 ` Michal Hocko
2017-03-10 22:00 ` Daniel Kiper
2017-02-27 17:28 ` Reza Arbab
2017-02-27 17:34 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=75ee9d3f-7027-782a-9cde-5192396a4a8c@gmail.com \
--to=yasu.isimatu@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=arbab@linux.vnet.ibm.com \
--cc=daniel.kiper@oracle.com \
--cc=gregkh@linuxfoundation.org \
--cc=heiko.carstens@de.ibm.com \
--cc=iamjoonsoo.kim@lge.com \
--cc=imammedo@redhat.com \
--cc=kys@microsoft.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=mhocko@kernel.org \
--cc=qiuxishi@huawei.com \
--cc=rientjes@google.com \
--cc=slaoub@gmail.com \
--cc=tangchen@cn.fujitsu.com \
--cc=toshi.kani@hpe.com \
--cc=vbabka@suse.cz \
--cc=vkuznets@redhat.com \
--cc=xen-devel@lists.xenproject.org \
--cc=xieyisheng1@huawei.com \
--cc=zhenzhang.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).