From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=MAILING_LIST_MULTI,SPF_PASS, USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0698C6778D for ; Tue, 11 Sep 2018 09:16:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7FCA220865 for ; Tue, 11 Sep 2018 09:16:13 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7FCA220865 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726933AbeIKOOf (ORCPT ); Tue, 11 Sep 2018 10:14:35 -0400 Received: from mx2.suse.de ([195.135.220.15]:33022 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726572AbeIKOOf (ORCPT ); Tue, 11 Sep 2018 10:14:35 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 4ADEEAFB5; Tue, 11 Sep 2018 09:16:09 +0000 (UTC) Date: Tue, 11 Sep 2018 11:16:08 +0200 From: Michal Hocko To: Pasha Tatashin Cc: "zaslonko@linux.ibm.com" , Andrew Morton , LKML , Linux Memory Management List , "osalvador@suse.de" , "gerald.schaefer@de.ibm.com" Subject: Re: [PATCH] memory_hotplug: fix the panic when memory end is not on the section boundary Message-ID: <20180911091608.GQ10951@dhcp22.suse.cz> References: <20180910123527.71209-1-zaslonko@linux.ibm.com> <20180910131754.GG10951@dhcp22.suse.cz> <20180910135959.GI10951@dhcp22.suse.cz> <20180910141946.GJ10951@dhcp22.suse.cz> <20180910144152.GL10951@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon 10-09-18 15:26:55, Pavel Tatashin wrote: > > > On 9/10/18 10:41 AM, Michal Hocko wrote: > > On Mon 10-09-18 14:32:16, Pavel Tatashin wrote: > >> On Mon, Sep 10, 2018 at 10:19 AM Michal Hocko wrote: > >>> > >>> On Mon 10-09-18 14:11:45, Pavel Tatashin wrote: > >>>> Hi Michal, > >>>> > >>>> It is tricky, but probably can be done. Either change > >>>> memmap_init_zone() or its caller to also cover the ends and starts of > >>>> unaligned sections to initialize and reserve pages. > >>>> > >>>> The same thing would also need to be done in deferred_init_memmap() to > >>>> cover the deferred init case. > >>> > >>> Well, I am not sure TBH. I have to think about that much more. Maybe it > >>> would be much more simple to make sure that we will never add incomplete > >>> memblocks and simply refuse them during the discovery. At least for now. > >> > >> On x86 memblocks can be upto 2G on machines with over 64G of RAM. > > > > sorry I meant pageblock_nr_pages rather than memblocks. > > OK. This sound reasonable, but, to be honest I am not sure how to > achieve this yet, I need to think more about this. In theory, if we have > sparse memory model, it makes sense to enforce memory alignment to > section sizes, sounds a lot safer. Memory hotplug is sparsemem only. You do not have to think about other memory models fortunately. > >> Also, memory size is way to easy too change via qemu arguments when VM > >> starts. If we simply disable unaligned trailing memblocks, I am sure > >> we would get tons of noise of missing memory. > >> > >> I think, adding check_hotplug_memory_range() would work to fix the > >> immediate problem. But, we do need to figure out a better solution. > >> > >> memblock design is based on archaic assumption that hotplug units are > >> physical dimms. VMs and hypervisors changed all of that, and we can > >> have much finer hotplug requests on machines with huge DIMMs. Yet, we > >> do not want to pollute sysfs with millions of tiny memory devices. I > >> am not sure what a long term proper solution for this problem should > >> be, but I see that linux hotplug/hotremove subsystems must be > >> redesigned based on the new requirements. > > > > Not an easy task though. Anyway, sparse memory modely is highly based on > > memory sections so it makes some sense to have hotplug section based as > > well. Memblocks as a higher logical unit on top of that is kinda hack. > > The userspace API has never been properly thought through I am afraid. > > I agree memoryblock is a hack, it fails to do both things it was > designed to do: > > 1. On bare metal you cannot free a physical dimm of memory using > memoryblock granularity because memory devices do not equal to physical > dimms. Thus, if for some reason a particular dimm must be > remove/replaced, memoryblock does not help us. agreed > 2. On machines with hypervisors it fails to provide an adequate > granularity to add/remove memory. > > We should define a new user interface where memory can be added/removed > at a finer granularity: sparse section size, but without a memory > devices for each section. We should also provide an optional access to > legacy interface where memory devices are exported but each is of > section size. > > So, when legacy interface is enabled, current way would work: > > echo offline > /sys/devices/system/memory/memoryXXX/state > > And new interface would allow us to do something like this: > > echo offline 256M > /sys/devices/system/node/nodeXXX/memory > > With optional start address for offline memory. > echo offline [start_pa] size > /sys/devices/system/node/nodeXXX/memory > start_pa and size must be section size aligned (128M). I am not sure what is the expected semantic of the version without start_pa. > It would probably be a good discussion for the next MM Summit how to > solve the current memory hotplug interface limitations. Yes, sounds good to me. In any case let's not pollute this email thread with this discussion now. -- Michal Hocko SUSE Labs