From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754312AbdDDTla (ORCPT ); Tue, 4 Apr 2017 15:41:30 -0400 Received: from mx2.suse.de ([195.135.220.15]:44495 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753403AbdDDTl2 (ORCPT ); Tue, 4 Apr 2017 15:41:28 -0400 Date: Tue, 4 Apr 2017 21:41:22 +0200 From: Michal Hocko To: Reza Arbab Cc: Mel Gorman , linux-mm@kvack.org, Andrew Morton , Vlastimil Babka , Andrea Arcangeli , Yasuaki Ishimatsu , Tang Chen , qiuxishi@huawei.com, Kani Toshimitsu , slaoub@gmail.com, Joonsoo Kim , Andi Kleen , Zhang Zhen , David Rientjes , Daniel Kiper , Igor Mammedov , Vitaly Kuznetsov , LKML , Chris Metcalf , Dan Williams , Heiko Carstens , Lai Jiangshan , Martin Schwidefsky Subject: Re: [PATCH 0/6] mm: make movable onlining suck less Message-ID: <20170404194122.GS15132@dhcp22.suse.cz> References: <20170403115545.GK24661@dhcp22.suse.cz> <20170403195830.64libncet5l6vuvb@arbab-laptop> <20170403202337.GA12482@dhcp22.suse.cz> <20170403204213.rs7k2cvsnconel2z@arbab-laptop> <20170404072329.GA15132@dhcp22.suse.cz> <20170404073412.GC15132@dhcp22.suse.cz> <20170404082302.GE15132@dhcp22.suse.cz> <20170404160239.ftvuxklioo6zvuxl@arbab-laptop> <20170404164452.GQ15132@dhcp22.suse.cz> <20170404183012.a6biape5y7vu6cjm@arbab-laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170404183012.a6biape5y7vu6cjm@arbab-laptop> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 04-04-17 13:30:13, Reza Arbab wrote: > On Tue, Apr 04, 2017 at 06:44:53PM +0200, Michal Hocko wrote: > >Thanks for your testing! This is highly appreciated. > >Can I assume your Tested-by? > > Of course! Not quite done, though. Ohh, I didn't mean to rush you to that! > I think I found another edge case. You > get an oops when removing all of a node's memory: > > __nr_to_section > __pfn_to_section > find_biggest_section_pfn > shrink_pgdat_span > __remove_zone > __remove_section > __remove_pages > arch_remove_memory > remove_memory Is this something new or an old issue? I believe the state after the online should be the same as before. So if you onlined the full node then there shouldn't be any difference. Let me have a look... > I stuck some debugging prints in, for context: > > shrink_pgdat_span: start_pfn=0x10000, end_pfn=0x10100, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000 > shrink_pgdat_span: start_pfn=0x10100, end_pfn=0x10200, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000 > ...%<... > shrink_pgdat_span: start_pfn=0x1fe00, end_pfn=0x1ff00, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000 > shrink_pgdat_span: start_pfn=0x1ff00, end_pfn=0x20000, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000 > find_biggest_section_pfn: start_pfn=0x0, end_pfn=0x1ff00 > find_biggest_section_pfn loop: pfn=0x1feff, sec_nr = 0x1fe > find_biggest_section_pfn loop: pfn=0x1fdff, sec_nr = 0x1fd > ...%<... > find_biggest_section_pfn loop: pfn=0x1ff, sec_nr = 0x1 > find_biggest_section_pfn loop: pfn=0xff, sec_nr = 0x0 > find_biggest_section_pfn loop: pfn=0xffffffffffffffff, sec_nr = 0xffffffffffffff > Unable to handle kernel paging request for data at address 0xc000800000f19e78 ...this looks like a straight underflow and it is clear that the code is just broken. Have a look at the loop pfn = end_pfn - 1; for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) { assume that end_pfn is properly PAGES_PER_SECTION aligned (start_pfn would be 0 obviously). This is unsigned arithmetic and so it cannot work for the first section. So the code is broken and has been broken since it has been introduced. Nobody has noticed because the low pfns are usually reserved and out of the hotplug reach. We could tweak it but I am not even sure we really want/need this behavior. It complicates the code and am not really sure we need to support online_movable(range) offline_movable(range) online_kernel(range) While the flexibility is attractive I do not think it is worth the additional complexity without any proof of the usecase. Especially when we consider that this only work when we offline from the start or end of the zone or whole zone. I guess it would be the best to simply revert this whole thing. It is quite a lot of code with a dubious use. What do Futjitsu guys think about it? --- >>From 1b08ecef3e8ebcef585fe8f2b23155be54cce335 Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Tue, 4 Apr 2017 21:09:00 +0200 Subject: [PATCH] mm, hotplug: get rid of zone/node shrinking this is basically a revert of 815121d2b5cd ("memory_hotplug: clear zone when removing the memory"). While the node/zone shrinking sounds attractive at first sight because it allows to online_movable(range) [...] offline_movable(range) [...] online_kernel(range) but this requires that the range is in the beginning or the end of a zone or operate on the whole zone basis. This code is even broken as noticed by Reza Arbab. He has triggered an oops when doing the full node offline Unable to handle kernel paging request for data at address 0xc000800000f19e78 __nr_to_section __pfn_to_section find_biggest_section_pfn shrink_pgdat_span __remove_zone __remove_section __remove_pages arch_remove_memory remove_memory which is caused by an overflow in find_biggest_section_pfn. This code simply cannot work on the first section [0, PAGES_PER_SECTION] because pfn = end_pfn - 1; for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) { pfn would underflow in the unsigned arithmetic. This doesn't happen usually because the lowest pfns are usually reserved and out of the hotplug reach. The changelog of the above commit doesn't mention any such usecase and sounds more like an nice-to-have and inverse to __add_zone which we are trying to get rid of in this series. So let's simplify the code and remove the complication. We can reintroduce it back along with a valid usecase description. Signed-off-by: Michal Hocko --- mm/memory_hotplug.c | 207 ---------------------------------------------------- 1 file changed, 207 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index a358d7a67651..d48a4456b20d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -379,212 +379,9 @@ EXPORT_SYMBOL_GPL(__add_pages); #ifdef CONFIG_MEMORY_HOTREMOVE /* find the smallest valid pfn in the range [start_pfn, end_pfn) */ -static int find_smallest_section_pfn(int nid, struct zone *zone, - unsigned long start_pfn, - unsigned long end_pfn) -{ - struct mem_section *ms; - - for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) { - ms = __pfn_to_section(start_pfn); - - if (unlikely(!valid_section(ms))) - continue; - - if (unlikely(pfn_to_nid(start_pfn) != nid)) - continue; - - if (zone && zone != page_zone(pfn_to_page(start_pfn))) - continue; - - return start_pfn; - } - - return 0; -} - -/* find the biggest valid pfn in the range [start_pfn, end_pfn). */ -static int find_biggest_section_pfn(int nid, struct zone *zone, - unsigned long start_pfn, - unsigned long end_pfn) -{ - struct mem_section *ms; - unsigned long pfn; - - /* pfn is the end pfn of a memory section. */ - pfn = end_pfn - 1; - for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) { - ms = __pfn_to_section(pfn); - - if (unlikely(!valid_section(ms))) - continue; - - if (unlikely(pfn_to_nid(pfn) != nid)) - continue; - - if (zone && zone != page_zone(pfn_to_page(pfn))) - continue; - - return pfn; - } - - return 0; -} - -static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, - unsigned long end_pfn) -{ - unsigned long zone_start_pfn = zone->zone_start_pfn; - unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */ - unsigned long zone_end_pfn = z; - unsigned long pfn; - struct mem_section *ms; - int nid = zone_to_nid(zone); - - zone_span_writelock(zone); - if (zone_start_pfn == start_pfn) { - /* - * If the section is smallest section in the zone, it need - * shrink zone->zone_start_pfn and zone->zone_spanned_pages. - * In this case, we find second smallest valid mem_section - * for shrinking zone. - */ - pfn = find_smallest_section_pfn(nid, zone, end_pfn, - zone_end_pfn); - if (pfn) { - zone->zone_start_pfn = pfn; - zone->spanned_pages = zone_end_pfn - pfn; - } - } else if (zone_end_pfn == end_pfn) { - /* - * If the section is biggest section in the zone, it need - * shrink zone->spanned_pages. - * In this case, we find second biggest valid mem_section for - * shrinking zone. - */ - pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn, - start_pfn); - if (pfn) - zone->spanned_pages = pfn - zone_start_pfn + 1; - } - - /* - * The section is not biggest or smallest mem_section in the zone, it - * only creates a hole in the zone. So in this case, we need not - * change the zone. But perhaps, the zone has only hole data. Thus - * it check the zone has only hole or not. - */ - pfn = zone_start_pfn; - for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) { - ms = __pfn_to_section(pfn); - - if (unlikely(!valid_section(ms))) - continue; - - if (page_zone(pfn_to_page(pfn)) != zone) - continue; - - /* If the section is current section, it continues the loop */ - if (start_pfn == pfn) - continue; - - /* If we find valid section, we have nothing to do */ - zone_span_writeunlock(zone); - return; - } - - /* The zone has no valid section */ - zone->zone_start_pfn = 0; - zone->spanned_pages = 0; - zone_span_writeunlock(zone); -} - -static void shrink_pgdat_span(struct pglist_data *pgdat, - unsigned long start_pfn, unsigned long end_pfn) -{ - unsigned long pgdat_start_pfn = pgdat->node_start_pfn; - unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */ - unsigned long pgdat_end_pfn = p; - unsigned long pfn; - struct mem_section *ms; - int nid = pgdat->node_id; - - if (pgdat_start_pfn == start_pfn) { - /* - * If the section is smallest section in the pgdat, it need - * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages. - * In this case, we find second smallest valid mem_section - * for shrinking zone. - */ - pfn = find_smallest_section_pfn(nid, NULL, end_pfn, - pgdat_end_pfn); - if (pfn) { - pgdat->node_start_pfn = pfn; - pgdat->node_spanned_pages = pgdat_end_pfn - pfn; - } - } else if (pgdat_end_pfn == end_pfn) { - /* - * If the section is biggest section in the pgdat, it need - * shrink pgdat->node_spanned_pages. - * In this case, we find second biggest valid mem_section for - * shrinking zone. - */ - pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn, - start_pfn); - if (pfn) - pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1; - } - - /* - * If the section is not biggest or smallest mem_section in the pgdat, - * it only creates a hole in the pgdat. So in this case, we need not - * change the pgdat. - * But perhaps, the pgdat has only hole data. Thus it check the pgdat - * has only hole or not. - */ - pfn = pgdat_start_pfn; - for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) { - ms = __pfn_to_section(pfn); - - if (unlikely(!valid_section(ms))) - continue; - - if (pfn_to_nid(pfn) != nid) - continue; - - /* If the section is current section, it continues the loop */ - if (start_pfn == pfn) - continue; - - /* If we find valid section, we have nothing to do */ - return; - } - - /* The pgdat has no valid section */ - pgdat->node_start_pfn = 0; - pgdat->node_spanned_pages = 0; -} - -static void __remove_zone(struct zone *zone, unsigned long start_pfn) -{ - struct pglist_data *pgdat = zone->zone_pgdat; - int nr_pages = PAGES_PER_SECTION; - int zone_type; - unsigned long flags; - - zone_type = zone - pgdat->node_zones; - - pgdat_resize_lock(zone->zone_pgdat, &flags); - shrink_zone_span(zone, start_pfn, start_pfn + nr_pages); - shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages); - pgdat_resize_unlock(zone->zone_pgdat, &flags); -} - static int __remove_section(struct zone *zone, struct mem_section *ms, unsigned long map_offset) { - unsigned long start_pfn; - int scn_nr; int ret = -EINVAL; if (!valid_section(ms)) @@ -594,10 +391,6 @@ static int __remove_section(struct zone *zone, struct mem_section *ms, if (ret) return ret; - scn_nr = __section_nr(ms); - start_pfn = section_nr_to_pfn(scn_nr); - __remove_zone(zone, start_pfn); - sparse_remove_one_section(zone, ms, map_offset); return 0; } -- 2.11.0 -- Michal Hocko SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f200.google.com (mail-wr0-f200.google.com [209.85.128.200]) by kanga.kvack.org (Postfix) with ESMTP id B55C46B0038 for ; Tue, 4 Apr 2017 15:41:29 -0400 (EDT) Received: by mail-wr0-f200.google.com with SMTP id x61so2910891wrb.8 for ; Tue, 04 Apr 2017 12:41:29 -0700 (PDT) Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15]) by mx.google.com with ESMTPS id a2si25806813wra.327.2017.04.04.12.41.27 for (version=TLS1 cipher=AES128-SHA bits=128/128); Tue, 04 Apr 2017 12:41:28 -0700 (PDT) Date: Tue, 4 Apr 2017 21:41:22 +0200 From: Michal Hocko Subject: Re: [PATCH 0/6] mm: make movable onlining suck less Message-ID: <20170404194122.GS15132@dhcp22.suse.cz> References: <20170403115545.GK24661@dhcp22.suse.cz> <20170403195830.64libncet5l6vuvb@arbab-laptop> <20170403202337.GA12482@dhcp22.suse.cz> <20170403204213.rs7k2cvsnconel2z@arbab-laptop> <20170404072329.GA15132@dhcp22.suse.cz> <20170404073412.GC15132@dhcp22.suse.cz> <20170404082302.GE15132@dhcp22.suse.cz> <20170404160239.ftvuxklioo6zvuxl@arbab-laptop> <20170404164452.GQ15132@dhcp22.suse.cz> <20170404183012.a6biape5y7vu6cjm@arbab-laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170404183012.a6biape5y7vu6cjm@arbab-laptop> Sender: owner-linux-mm@kvack.org List-ID: To: Reza Arbab Cc: Mel Gorman , linux-mm@kvack.org, Andrew Morton , Vlastimil Babka , Andrea Arcangeli , Yasuaki Ishimatsu , Tang Chen , qiuxishi@huawei.com, Kani Toshimitsu , slaoub@gmail.com, Joonsoo Kim , Andi Kleen , Zhang Zhen , David Rientjes , Daniel Kiper , Igor Mammedov , Vitaly Kuznetsov , LKML , Chris Metcalf , Dan Williams , Heiko Carstens , Lai Jiangshan , Martin Schwidefsky On Tue 04-04-17 13:30:13, Reza Arbab wrote: > On Tue, Apr 04, 2017 at 06:44:53PM +0200, Michal Hocko wrote: > >Thanks for your testing! This is highly appreciated. > >Can I assume your Tested-by? > > Of course! Not quite done, though. Ohh, I didn't mean to rush you to that! > I think I found another edge case. You > get an oops when removing all of a node's memory: > > __nr_to_section > __pfn_to_section > find_biggest_section_pfn > shrink_pgdat_span > __remove_zone > __remove_section > __remove_pages > arch_remove_memory > remove_memory Is this something new or an old issue? I believe the state after the online should be the same as before. So if you onlined the full node then there shouldn't be any difference. Let me have a look... > I stuck some debugging prints in, for context: > > shrink_pgdat_span: start_pfn=0x10000, end_pfn=0x10100, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000 > shrink_pgdat_span: start_pfn=0x10100, end_pfn=0x10200, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000 > ...%<... > shrink_pgdat_span: start_pfn=0x1fe00, end_pfn=0x1ff00, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000 > shrink_pgdat_span: start_pfn=0x1ff00, end_pfn=0x20000, pgdat_start_pfn=0x0, pgdat_end_pfn=0x20000 > find_biggest_section_pfn: start_pfn=0x0, end_pfn=0x1ff00 > find_biggest_section_pfn loop: pfn=0x1feff, sec_nr = 0x1fe > find_biggest_section_pfn loop: pfn=0x1fdff, sec_nr = 0x1fd > ...%<... > find_biggest_section_pfn loop: pfn=0x1ff, sec_nr = 0x1 > find_biggest_section_pfn loop: pfn=0xff, sec_nr = 0x0 > find_biggest_section_pfn loop: pfn=0xffffffffffffffff, sec_nr = 0xffffffffffffff > Unable to handle kernel paging request for data at address 0xc000800000f19e78 ...this looks like a straight underflow and it is clear that the code is just broken. Have a look at the loop pfn = end_pfn - 1; for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) { assume that end_pfn is properly PAGES_PER_SECTION aligned (start_pfn would be 0 obviously). This is unsigned arithmetic and so it cannot work for the first section. So the code is broken and has been broken since it has been introduced. Nobody has noticed because the low pfns are usually reserved and out of the hotplug reach. We could tweak it but I am not even sure we really want/need this behavior. It complicates the code and am not really sure we need to support online_movable(range) offline_movable(range) online_kernel(range) While the flexibility is attractive I do not think it is worth the additional complexity without any proof of the usecase. Especially when we consider that this only work when we offline from the start or end of the zone or whole zone. I guess it would be the best to simply revert this whole thing. It is quite a lot of code with a dubious use. What do Futjitsu guys think about it? ---