From: Mel Gorman <mgorman@suse.de> To: Michal Hocko <mhocko@kernel.org> Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Vlastimil Babka <vbabka@suse.cz>, LKML <linux-kernel@vger.kernel.org>, Michal Hocko <mhocko@suse.com> Subject: Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Date: Fri, 14 Jul 2017 13:46:46 +0100 [thread overview] Message-ID: <20170714124645.i3duhuie6cczlybr@suse.de> (raw) In-Reply-To: <20170714080006.7250-7-mhocko@kernel.org> On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > build_zonelists gradually builds zonelists from the nearest to the most > distant node. As we do not know how many populated zones we will have in > each node we rely on the _zoneref to terminate initialized part of the > zonelist by a NULL zone. While this is functionally correct it is quite > suboptimal because we cannot allow updaters to race with zonelists > users because they could see an empty zonelist and fail the allocation > or hit the OOM killer in the worst case. > > We can do much better, though. We can store the node ordering into an > already existing node_order array and then give this array to > build_zonelists_in_node_order and do the whole initialization at once. > zonelists consumers still might see halfway initialized state but that > should be much more tolerateable because the list will not be empty and > they would either see some zone twice or skip over some zone(s) in the > worst case which shouldn't lead to immediate failures. > > This patch alone doesn't introduce any functional change yet, though, it > is merely a preparatory work for later changes. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/page_alloc.c | 42 ++++++++++++++++++------------------------ > 1 file changed, 18 insertions(+), 24 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 00e117922b3f..78bd62418380 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) > * This results in maximum locality--normal zone overflows into local > * DMA zone, if any--but risks exhausting DMA zone. > */ > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order) > { > - int j; > struct zonelist *zonelist; > + int i, zoneref_idx = 0; > > zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK]; > - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) > - ; > - j = build_zonelists_node(NODE_DATA(node), zonelist, j); > - zonelist->_zonerefs[j].zone = NULL; > - zonelist->_zonerefs[j].zone_idx = 0; > + > + for (i = 0; i < MAX_NUMNODES; i++) { > + pg_data_t *node = NODE_DATA(node_order[i]); > + > + zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > + } The naming here is weird to say the least and makes this a lot more confusing than it needs to be. Primarily, it's because the zoneref_idx parameter gets renamed to nr_zones in build_zonelists_node where it's nothing to do with the number of zones at all. It also iterates for longer than it needs to. MAX_NUMNODES can be a large value of mostly empty nodes but it happily goes through them anyway. Pass zoneref_idx in as a pointer that is updated by the function and use the return value to break the loop when an empty node is encountered? > + zonelist->_zonerefs[zoneref_idx].zone = NULL; > + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; > } > It *might* be safer given the next patch to zero out the remainder of the _zonerefs to that there is no combination of node add/remove that has an iterator working with a semi-valid _zoneref which is beyond the last correct value. It *should* be safe as the very last entry will always be null but if you don't zero it out, it is possible for iterators to be working beyond the "end" of the zonelist for a short window. Otherwise think it's ok including my stupid comment about node_order stack usage. -- Mel Gorman SUSE Labs
WARNING: multiple messages have this Message-ID (diff)
From: Mel Gorman <mgorman@suse.de> To: Michal Hocko <mhocko@kernel.org> Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, Vlastimil Babka <vbabka@suse.cz>, LKML <linux-kernel@vger.kernel.org>, Michal Hocko <mhocko@suse.com> Subject: Re: [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Date: Fri, 14 Jul 2017 13:46:46 +0100 [thread overview] Message-ID: <20170714124645.i3duhuie6cczlybr@suse.de> (raw) In-Reply-To: <20170714080006.7250-7-mhocko@kernel.org> On Fri, Jul 14, 2017 at 10:00:03AM +0200, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > build_zonelists gradually builds zonelists from the nearest to the most > distant node. As we do not know how many populated zones we will have in > each node we rely on the _zoneref to terminate initialized part of the > zonelist by a NULL zone. While this is functionally correct it is quite > suboptimal because we cannot allow updaters to race with zonelists > users because they could see an empty zonelist and fail the allocation > or hit the OOM killer in the worst case. > > We can do much better, though. We can store the node ordering into an > already existing node_order array and then give this array to > build_zonelists_in_node_order and do the whole initialization at once. > zonelists consumers still might see halfway initialized state but that > should be much more tolerateable because the list will not be empty and > they would either see some zone twice or skip over some zone(s) in the > worst case which shouldn't lead to immediate failures. > > This patch alone doesn't introduce any functional change yet, though, it > is merely a preparatory work for later changes. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/page_alloc.c | 42 ++++++++++++++++++------------------------ > 1 file changed, 18 insertions(+), 24 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 00e117922b3f..78bd62418380 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4913,17 +4913,20 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask) > * This results in maximum locality--normal zone overflows into local > * DMA zone, if any--but risks exhausting DMA zone. > */ > -static void build_zonelists_in_node_order(pg_data_t *pgdat, int node) > +static void build_zonelists_in_node_order(pg_data_t *pgdat, int *node_order) > { > - int j; > struct zonelist *zonelist; > + int i, zoneref_idx = 0; > > zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK]; > - for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++) > - ; > - j = build_zonelists_node(NODE_DATA(node), zonelist, j); > - zonelist->_zonerefs[j].zone = NULL; > - zonelist->_zonerefs[j].zone_idx = 0; > + > + for (i = 0; i < MAX_NUMNODES; i++) { > + pg_data_t *node = NODE_DATA(node_order[i]); > + > + zoneref_idx = build_zonelists_node(node, zonelist, zoneref_idx); > + } The naming here is weird to say the least and makes this a lot more confusing than it needs to be. Primarily, it's because the zoneref_idx parameter gets renamed to nr_zones in build_zonelists_node where it's nothing to do with the number of zones at all. It also iterates for longer than it needs to. MAX_NUMNODES can be a large value of mostly empty nodes but it happily goes through them anyway. Pass zoneref_idx in as a pointer that is updated by the function and use the return value to break the loop when an empty node is encountered? > + zonelist->_zonerefs[zoneref_idx].zone = NULL; > + zonelist->_zonerefs[zoneref_idx].zone_idx = 0; > } > It *might* be safer given the next patch to zero out the remainder of the _zonerefs to that there is no combination of node add/remove that has an iterator working with a semi-valid _zoneref which is beyond the last correct value. It *should* be safe as the very last entry will always be null but if you don't zero it out, it is possible for iterators to be working beyond the "end" of the zonelist for a short window. Otherwise think it's ok including my stupid comment about node_order stack usage. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-07-14 12:46 UTC|newest] Thread overview: 117+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-07-14 7:59 [PATCH 0/9] cleanup zonelists initialization Michal Hocko 2017-07-14 7:59 ` Michal Hocko 2017-07-14 7:59 ` Michal Hocko 2017-07-14 7:59 ` [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE Michal Hocko 2017-07-14 7:59 ` Michal Hocko 2017-07-14 9:36 ` Mel Gorman 2017-07-14 9:36 ` Mel Gorman 2017-07-14 9:36 ` Mel Gorman 2017-07-14 10:47 ` Michal Hocko 2017-07-14 10:47 ` Michal Hocko 2017-07-14 11:16 ` Mel Gorman 2017-07-14 11:16 ` Mel Gorman 2017-07-14 11:16 ` Mel Gorman 2017-07-14 11:38 ` Michal Hocko 2017-07-14 11:38 ` Michal Hocko 2017-07-14 11:38 ` Michal Hocko 2017-07-14 12:56 ` Mel Gorman 2017-07-14 12:56 ` Mel Gorman 2017-07-14 13:01 ` Mel Gorman 2017-07-14 13:01 ` Mel Gorman 2017-07-14 13:01 ` Mel Gorman 2017-07-14 13:08 ` Michal Hocko 2017-07-14 13:08 ` Michal Hocko 2017-07-19 9:33 ` Vlastimil Babka 2017-07-19 9:33 ` Vlastimil Babka 2017-07-19 9:33 ` Vlastimil Babka 2017-07-19 13:44 ` Michal Hocko 2017-07-19 13:44 ` Michal Hocko 2017-07-19 13:44 ` Michal Hocko 2017-07-14 7:59 ` [PATCH 2/9] mm, page_alloc: remove boot pageset initialization from memory hotplug Michal Hocko 2017-07-14 7:59 ` Michal Hocko 2017-07-14 9:39 ` Mel Gorman 2017-07-14 9:39 ` Mel Gorman 2017-07-19 13:15 ` Vlastimil Babka 2017-07-19 13:15 ` Vlastimil Babka 2017-07-14 8:00 ` [PATCH 3/9] mm, page_alloc: do not set_cpu_numa_mem on empty nodes initialization Michal Hocko 2017-07-14 8:00 ` Michal Hocko 2017-07-14 9:48 ` Mel Gorman 2017-07-14 9:48 ` Mel Gorman 2017-07-14 10:50 ` Michal Hocko 2017-07-14 10:50 ` Michal Hocko 2017-07-14 12:32 ` Mel Gorman 2017-07-14 12:32 ` Mel Gorman 2017-07-14 12:39 ` Michal Hocko 2017-07-14 12:39 ` Michal Hocko 2017-07-14 12:56 ` Mel Gorman 2017-07-14 12:56 ` Mel Gorman 2017-07-19 13:19 ` Vlastimil Babka 2017-07-19 13:19 ` Vlastimil Babka 2017-07-14 8:00 ` [PATCH 4/9] mm, memory_hotplug: drop zone from build_all_zonelists Michal Hocko 2017-07-14 8:00 ` Michal Hocko 2017-07-19 13:33 ` Vlastimil Babka 2017-07-19 13:33 ` Vlastimil Babka 2017-07-20 8:15 ` Michal Hocko 2017-07-20 8:15 ` Michal Hocko 2017-07-14 8:00 ` [PATCH 5/9] mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node Michal Hocko 2017-07-14 8:00 ` Michal Hocko 2017-07-14 12:14 ` Michal Hocko 2017-07-14 12:14 ` Michal Hocko 2017-07-20 6:13 ` Vlastimil Babka 2017-07-20 6:13 ` Vlastimil Babka 2017-07-14 8:00 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko 2017-07-14 8:00 ` Michal Hocko 2017-07-14 9:55 ` Mel Gorman 2017-07-14 9:55 ` Mel Gorman 2017-07-14 10:51 ` Michal Hocko 2017-07-14 10:51 ` Michal Hocko 2017-07-14 12:46 ` Mel Gorman [this message] 2017-07-14 12:46 ` Mel Gorman 2017-07-14 13:02 ` Michal Hocko 2017-07-14 13:02 ` Michal Hocko 2017-07-14 14:18 ` Mel Gorman 2017-07-14 14:18 ` Mel Gorman 2017-07-17 6:06 ` Michal Hocko 2017-07-17 6:06 ` Michal Hocko 2017-07-17 8:07 ` Mel Gorman 2017-07-17 8:07 ` Mel Gorman 2017-07-17 8:19 ` Michal Hocko 2017-07-17 8:19 ` Michal Hocko 2017-07-17 8:58 ` Mel Gorman 2017-07-17 8:58 ` Mel Gorman 2017-07-17 9:15 ` Michal Hocko 2017-07-17 9:15 ` Michal Hocko 2017-07-20 6:55 ` Vlastimil Babka 2017-07-20 6:55 ` Vlastimil Babka 2017-07-20 7:19 ` Michal Hocko 2017-07-20 7:19 ` Michal Hocko 2017-07-14 8:00 ` [PATCH 7/9] mm, page_alloc: remove stop_machine from build_all_zonelists Michal Hocko 2017-07-14 8:00 ` Michal Hocko 2017-07-14 9:59 ` Mel Gorman 2017-07-14 9:59 ` Mel Gorman 2017-07-14 11:00 ` Michal Hocko 2017-07-14 11:00 ` Michal Hocko 2017-07-14 12:47 ` Mel Gorman 2017-07-14 12:47 ` Mel Gorman 2017-07-14 11:29 ` Vlastimil Babka 2017-07-14 11:29 ` Vlastimil Babka 2017-07-14 11:43 ` Michal Hocko 2017-07-14 11:43 ` Michal Hocko 2017-07-14 11:45 ` Michal Hocko 2017-07-14 11:45 ` Michal Hocko 2017-07-20 6:16 ` Vlastimil Babka 2017-07-20 6:16 ` Vlastimil Babka 2017-07-20 7:24 ` Vlastimil Babka 2017-07-20 7:24 ` Vlastimil Babka 2017-07-20 9:21 ` Michal Hocko 2017-07-20 9:21 ` Michal Hocko 2017-07-14 8:00 ` [PATCH 8/9] mm, memory_hotplug: get rid of zonelists_mutex Michal Hocko 2017-07-14 8:00 ` Michal Hocko 2017-07-14 8:00 ` [PATCH 9/9] mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations Michal Hocko 2017-07-14 8:00 ` Michal Hocko 2017-07-20 8:04 ` Vlastimil Babka 2017-07-20 8:04 ` Vlastimil Babka 2017-07-21 14:39 [PATCH -v1 0/9] cleanup zonelists initialization Michal Hocko 2017-07-21 14:39 ` [PATCH 6/9] mm, page_alloc: simplify zonelist initialization Michal Hocko 2017-07-21 14:39 ` Michal Hocko 2017-07-24 9:25 ` Vlastimil Babka 2017-07-24 9:25 ` Vlastimil Babka
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20170714124645.i3duhuie6cczlybr@suse.de \ --to=mgorman@suse.de \ --cc=akpm@linux-foundation.org \ --cc=hannes@cmpxchg.org \ --cc=linux-kernel@vger.kernel.org \ --cc=linux-mm@kvack.org \ --cc=mhocko@kernel.org \ --cc=mhocko@suse.com \ --cc=vbabka@suse.cz \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.