From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754257AbdGUOlC (ORCPT ); Fri, 21 Jul 2017 10:41:02 -0400 Received: from mail-wm0-f68.google.com ([74.125.82.68]:36576 "EHLO mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754106AbdGUOjd (ORCPT ); Fri, 21 Jul 2017 10:39:33 -0400 From: Michal Hocko To: Andrew Morton Cc: Mel Gorman , Johannes Weiner , Vlastimil Babka , , LKML , Michal Hocko Subject: [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Date: Fri, 21 Jul 2017 16:39:13 +0200 Message-Id: <20170721143915.14161-8-mhocko@kernel.org> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20170721143915.14161-1-mhocko@kernel.org> References: <20170721143915.14161-1-mhocko@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Michal Hocko build_all_zonelists has been (ab)using stop_machine to make sure that zonelists do not change while somebody is looking at them. This is is just a gross hack because a) it complicates the context from which we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug: switch locking to a percpu rwsem")) and b) is is not really necessary especially after "mm, page_alloc: simplify zonelist initialization" and c) it doesn't really provide the protection it claims (see below). Updates of the zonelists happen very seldom, basically only when a zone becomes populated during memory online or when it loses all the memory during offline. A racing iteration over zonelists could either miss a zone or try to work on one zone twice. Both of these are something we can live with occasionally because there will always be at least one zone visible so we are not likely to fail allocation too easily for example. Please note that the original stop_machine approach doesn't really provide a better exclusion because the iteration might be interrupted half way (unless the whole iteration is preempt disabled which is not the case in most cases) so the some zones could still be seen twice or a zone missed. I have run the pathological online/offline of the single memblock in the movable zone while stressing the same small node with some memory pressure. Node 1, zone DMA pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 943, 943, 943) Node 1, zone DMA32 pages free 227310 min 8294 low 10367 high 12440 spanned 262112 present 262112 managed 241436 protection: (0, 0, 0, 0) Node 1, zone Normal pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 0, 1024) Node 1, zone Movable pages free 32722 min 85 low 117 high 149 spanned 32768 present 32768 managed 32768 protection: (0, 0, 0, 0) root@test1:/sys/devices/system/node/node1# while true do echo offline > memory34/state echo online_movable > memory34/state done root@test1:/mnt/data/test/linux-3.7-rc5# numactl --preferred=1 make -j4 and it survived without any unexpected behavior. While this is not really a great testing coverage it should exercise the allocation path quite a lot. Acked-by: Vlastimil Babka Signed-off-by: Michal Hocko --- mm/page_alloc.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0d78dc5a708f..cf2eb3cf2cc5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5073,8 +5073,7 @@ static DEFINE_PER_CPU(struct per_cpu_nodestat, boot_nodestats); */ DEFINE_MUTEX(zonelists_mutex); -/* return values int ....just for stop_machine() */ -static int __build_all_zonelists(void *data) +static void __build_all_zonelists(void *data) { int nid; int __maybe_unused cpu; @@ -5110,8 +5109,6 @@ static int __build_all_zonelists(void *data) set_cpu_numa_mem(cpu, local_memory_node(cpu_to_node(cpu))); #endif } - - return 0; } static noinline void __init @@ -5153,9 +5150,7 @@ void __ref build_all_zonelists(pg_data_t *pgdat) if (system_state == SYSTEM_BOOTING) { build_all_zonelists_init(); } else { - /* we have to stop all cpus to guarantee there is no user - of zonelist */ - stop_machine_cpuslocked(__build_all_zonelists, pgdat, NULL); + __build_all_zonelists(pgdat); /* cpuset refresh routine should be here */ } vm_total_pages = nr_free_pagecache_pages(); -- 2.11.0