From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michal Hocko Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks) Date: Mon, 13 Mar 2017 15:36:17 +0100 Message-ID: <20170313143617.GR31518@dhcp22.suse.cz> References: <20170302180315.78975d4b@nial.brq.redhat.com> <20170303082723.GB31499@dhcp22.suse.cz> <20170303183422.6358ee8f@nial.brq.redhat.com> <20170306145417.GG27953@dhcp22.suse.cz> <20170307134004.58343e14@nial.brq.redhat.com> <20170309125400.GI11592@dhcp22.suse.cz> <20170310135807.GI3753@dhcp22.suse.cz> <20170313113110.6a9636a1@nial.brq.redhat.com> <20170313104302.GK31518@dhcp22.suse.cz> <20170313145712.49a2d346@nial.brq.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20170313145712.49a2d346@nial.brq.redhat.com> Sender: owner-linux-mm@kvack.org To: Igor Mammedov Cc: Heiko Carstens , Vitaly Kuznetsov , linux-mm@kvack.org, Andrew Morton , Greg KH , "K. Y. Srinivasan" , David Rientjes , Daniel Kiper , linux-api@vger.kernel.org, LKML , linux-s390@vger.kernel.org, xen-devel@lists.xenproject.org, linux-acpi@vger.kernel.org, qiuxishi@huawei.com, toshi.kani@hpe.com, xieyisheng1@huawei.com, slaoub@gmail.com, iamjoonsoo.kim@lge.com, vbabka@suse.cz, Zhang Zhen , Reza Arbab , Yasuaki Ishimatsu , Tang Chen List-Id: linux-acpi@vger.kernel.org On Mon 13-03-17 14:57:12, Igor Mammedov wrote: > On Mon, 13 Mar 2017 11:43:02 +0100 > Michal Hocko wrote: > > > On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > > > On Fri, 10 Mar 2017 14:58:07 +0100 > > [...] > > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] > > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff] > > > > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff] > > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug > > > > [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff] > > > > [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff] > > > > [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff] > > > > [ 0.000000] Zone ranges: > > > > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] > > > > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff] > > > > [ 0.000000] Normal empty > > > > [ 0.000000] Movable zone start for each node > > > > [ 0.000000] Early memory node ranges > > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff] > > > > [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff] > > > > > > > > so there is neither any normal zone nor movable one at the boot time. > > > it could be if hotpluggable memory were present at boot time in E802 table > > > (if I remember right when running on hyperv there is movable zone at boot time), > > > > > > but in qemu hotpluggable memory isn't put into E820, > > > so zone is allocated later when memory is enumerated > > > by ACPI subsystem and onlined. > > > It causes less issues wrt movable zone and works for > > > different versions of linux/windows as well. > > > > > > That's where in kernel auto-onlining could be also useful, > > > since user would be able to start-up with with small > > > non removable memory plus several removable DIMMs > > > and have all the memory onlined/available by the time > > > initrd is loaded. (missing piece here is onling > > > removable memory as movable by default). > > > > Why we should even care to online that memory that early rather than > > making it available via e820? > > It's not forbidden by spec and has less complications > when it comes to removable memory. Declaring it in E820 > would add following limitations/drawbacks: > - firmware should be able to exclude removable memory > from its usage (currently SeaBIOS nor EFI have to > know/care about it) => less qemu-guest ABI to maintain. > - OS should be taught to avoid/move (early) nonmovable > allocations from removable address ranges. > There were patches targeting that in recent kernels, > but it won't work with older kernels that don't have it. > So limiting a range of OSes that could run on QEMU > and do memory removal. > > E820 less approach works reasonably well with wide range > of guest OSes and less complex that if removable memory > were present it E820. Hence I don't have a compelling > reason to introduce removable memory in E820 as it > only adds to hot(un)plug issues. OK I see and that sounds like an argument to not put those ranges to E820. I still fail to see why we haeve to online the memory early during the boot and cannot wait for userspace to run? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751359AbdCMOg3 (ORCPT ); Mon, 13 Mar 2017 10:36:29 -0400 Received: from mx2.suse.de ([195.135.220.15]:41284 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750902AbdCMOgV (ORCPT ); Mon, 13 Mar 2017 10:36:21 -0400 Date: Mon, 13 Mar 2017 15:36:17 +0100 From: Michal Hocko To: Igor Mammedov Cc: Heiko Carstens , Vitaly Kuznetsov , linux-mm@kvack.org, Andrew Morton , Greg KH , "K. Y. Srinivasan" , David Rientjes , Daniel Kiper , linux-api@vger.kernel.org, LKML , linux-s390@vger.kernel.org, xen-devel@lists.xenproject.org, linux-acpi@vger.kernel.org, qiuxishi@huawei.com, toshi.kani@hpe.com, xieyisheng1@huawei.com, slaoub@gmail.com, iamjoonsoo.kim@lge.com, vbabka@suse.cz, Zhang Zhen , Reza Arbab , Yasuaki Ishimatsu , Tang Chen Subject: Re: WTH is going on with memory hotplug sysf interface (was: Re: [RFC PATCH] mm, hotplug: get rid of auto_online_blocks) Message-ID: <20170313143617.GR31518@dhcp22.suse.cz> References: <20170302180315.78975d4b@nial.brq.redhat.com> <20170303082723.GB31499@dhcp22.suse.cz> <20170303183422.6358ee8f@nial.brq.redhat.com> <20170306145417.GG27953@dhcp22.suse.cz> <20170307134004.58343e14@nial.brq.redhat.com> <20170309125400.GI11592@dhcp22.suse.cz> <20170310135807.GI3753@dhcp22.suse.cz> <20170313113110.6a9636a1@nial.brq.redhat.com> <20170313104302.GK31518@dhcp22.suse.cz> <20170313145712.49a2d346@nial.brq.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170313145712.49a2d346@nial.brq.redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 13-03-17 14:57:12, Igor Mammedov wrote: > On Mon, 13 Mar 2017 11:43:02 +0100 > Michal Hocko wrote: > > > On Mon 13-03-17 11:31:10, Igor Mammedov wrote: > > > On Fri, 10 Mar 2017 14:58:07 +0100 > > [...] > > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff] > > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x3fffffff] > > > > [ 0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x40000000-0x7fffffff] > > > > [ 0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x27fffffff] hotplug > > > > [ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0x3fffffff] -> [mem 0x00000000-0x3fffffff] > > > > [ 0.000000] NODE_DATA(0) allocated [mem 0x3fffc000-0x3fffffff] > > > > [ 0.000000] NODE_DATA(1) allocated [mem 0x7ffdc000-0x7ffdffff] > > > > [ 0.000000] Zone ranges: > > > > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff] > > > > [ 0.000000] DMA32 [mem 0x0000000001000000-0x000000007ffdffff] > > > > [ 0.000000] Normal empty > > > > [ 0.000000] Movable zone start for each node > > > > [ 0.000000] Early memory node ranges > > > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff] > > > > [ 0.000000] node 0: [mem 0x0000000000100000-0x000000003fffffff] > > > > [ 0.000000] node 1: [mem 0x0000000040000000-0x000000007ffdffff] > > > > > > > > so there is neither any normal zone nor movable one at the boot time. > > > it could be if hotpluggable memory were present at boot time in E802 table > > > (if I remember right when running on hyperv there is movable zone at boot time), > > > > > > but in qemu hotpluggable memory isn't put into E820, > > > so zone is allocated later when memory is enumerated > > > by ACPI subsystem and onlined. > > > It causes less issues wrt movable zone and works for > > > different versions of linux/windows as well. > > > > > > That's where in kernel auto-onlining could be also useful, > > > since user would be able to start-up with with small > > > non removable memory plus several removable DIMMs > > > and have all the memory onlined/available by the time > > > initrd is loaded. (missing piece here is onling > > > removable memory as movable by default). > > > > Why we should even care to online that memory that early rather than > > making it available via e820? > > It's not forbidden by spec and has less complications > when it comes to removable memory. Declaring it in E820 > would add following limitations/drawbacks: > - firmware should be able to exclude removable memory > from its usage (currently SeaBIOS nor EFI have to > know/care about it) => less qemu-guest ABI to maintain. > - OS should be taught to avoid/move (early) nonmovable > allocations from removable address ranges. > There were patches targeting that in recent kernels, > but it won't work with older kernels that don't have it. > So limiting a range of OSes that could run on QEMU > and do memory removal. > > E820 less approach works reasonably well with wide range > of guest OSes and less complex that if removable memory > were present it E820. Hence I don't have a compelling > reason to introduce removable memory in E820 as it > only adds to hot(un)plug issues. OK I see and that sounds like an argument to not put those ranges to E820. I still fail to see why we haeve to online the memory early during the boot and cannot wait for userspace to run? -- Michal Hocko SUSE Labs