From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752848Ab0CINqo (ORCPT ); Tue, 9 Mar 2010 08:46:44 -0500 Received: from cantor2.suse.de ([195.135.220.15]:45634 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751931Ab0CINqj (ORCPT ); Tue, 9 Mar 2010 08:46:39 -0500 Date: Wed, 10 Mar 2010 00:46:33 +1100 From: Nick Piggin To: David Rientjes Cc: Pekka Enberg , Andi Kleen , Christoph Lameter , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haicheng.li@intel.com, KAMEZAWA Hiroyuki Subject: Re: [patch] slab: add memory hotplug support Message-ID: <20100309134633.GM8653@laptop> References: <20100226114136.GA16335@basil.fritz.box> <20100226155755.GE16335@basil.fritz.box> <20100305062002.GV8653@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 08, 2010 at 03:19:48PM -0800, David Rientjes wrote: > On Fri, 5 Mar 2010, Nick Piggin wrote: > > > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) > > > +/* > > > + * Drains and frees nodelists for a node on each slab cache, used for memory > > > + * hotplug. Returns -EBUSY if all objects cannot be drained on memory > > > + * hot-remove so that the node is not removed. When used because memory > > > + * hot-add is canceled, the only result is the freed kmem_list3. > > > + * > > > + * Must hold cache_chain_mutex. > > > + */ > > > +static int __meminit free_cache_nodelists_node(int node) > > > +{ > > > + struct kmem_cache *cachep; > > > + int ret = 0; > > > + > > > + list_for_each_entry(cachep, &cache_chain, next) { > > > + struct array_cache *shared; > > > + struct array_cache **alien; > > > + struct kmem_list3 *l3; > > > + > > > + l3 = cachep->nodelists[node]; > > > + if (!l3) > > > + continue; > > > + > > > + spin_lock_irq(&l3->list_lock); > > > + shared = l3->shared; > > > + if (shared) { > > > + free_block(cachep, shared->entry, shared->avail, node); > > > + l3->shared = NULL; > > > + } > > > + alien = l3->alien; > > > + l3->alien = NULL; > > > + spin_unlock_irq(&l3->list_lock); > > > + > > > + if (alien) { > > > + drain_alien_cache(cachep, alien); > > > + free_alien_cache(alien); > > > + } > > > + kfree(shared); > > > + > > > + drain_freelist(cachep, l3, l3->free_objects); > > > + if (!list_empty(&l3->slabs_full) || > > > + !list_empty(&l3->slabs_partial)) { > > > + /* > > > + * Continue to iterate through each slab cache to free > > > + * as many nodelists as possible even though the > > > + * offline will be canceled. > > > + */ > > > + ret = -EBUSY; > > > + continue; > > > + } > > > + kfree(l3); > > > + cachep->nodelists[node] = NULL; > > > > What's stopping races of other CPUs trying to access l3 and array > > caches while they're being freed? > > > > numa_node_id() will not return an offlined nodeid and cache_alloc_node() > already does a fallback to other onlined nodes in case a nodeid is passed > to kmalloc_node() that does not have a nodelist. l3->shared and l3->alien > cannot be accessed without l3->list_lock (drain, cache_alloc_refill, > cache_flusharray) or cache_chain_mutex (kmem_cache_destroy, cache_reap). Yeah, but can't it _have_ a nodelist (ie. before it is set to NULL here) while it is being accessed by another CPU and concurrently being freed on this one? > > > + } > > > + return ret; > > > +} > > > + > > > +/* > > > + * Onlines nid either as the result of memory hot-add or canceled hot-remove. > > > + */ > > > +static int __meminit slab_node_online(int nid) > > > +{ > > > + int ret; > > > + mutex_lock(&cache_chain_mutex); > > > + ret = init_cache_nodelists_node(nid); > > > + mutex_unlock(&cache_chain_mutex); > > > + return ret; > > > +} > > > + > > > +/* > > > + * Offlines nid either as the result of memory hot-remove or canceled hot-add. > > > + */ > > > +static int __meminit slab_node_offline(int nid) > > > +{ > > > + int ret; > > > + mutex_lock(&cache_chain_mutex); > > > + ret = free_cache_nodelists_node(nid); > > > + mutex_unlock(&cache_chain_mutex); > > > + return ret; > > > +} > > > + > > > +static int __meminit slab_memory_callback(struct notifier_block *self, > > > + unsigned long action, void *arg) > > > +{ > > > + struct memory_notify *mnb = arg; > > > + int ret = 0; > > > + int nid; > > > + > > > + nid = mnb->status_change_nid; > > > + if (nid < 0) > > > + goto out; > > > + > > > + switch (action) { > > > + case MEM_GOING_ONLINE: > > > + case MEM_CANCEL_OFFLINE: > > > + ret = slab_node_online(nid); > > > + break; > > > > This would explode if CANCEL_OFFLINE fails. Call it theoretical and > > put a panic() in here and I don't mind. Otherwise you get corruption > > somewhere in the slab code. > > > > MEM_CANCEL_ONLINE would only fail here if a struct kmem_list3 couldn't be > allocated anywhere on the system and if that happens then the node simply > couldn't be allocated from (numa_node_id() would never return it as the > cpu's node, so it's possible to fallback in this scenario). Why would it never return the CPU's node? It's CANCEL_OFFLINE that is the problem. > Instead of doing this all at MEM_GOING_OFFLINE, we could delay freeing of > the array caches and the nodelist until MEM_OFFLINE. We're guaranteed > that all pages are freed at that point so there are no existing objects > that we need to track and then if the offline fails from a different > callback it would be possible to reset the l3->nodelists[node] pointers > since they haven't been freed yet. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 662386B0078 for ; Tue, 9 Mar 2010 08:46:41 -0500 (EST) Date: Wed, 10 Mar 2010 00:46:33 +1100 From: Nick Piggin Subject: Re: [patch] slab: add memory hotplug support Message-ID: <20100309134633.GM8653@laptop> References: <20100226114136.GA16335@basil.fritz.box> <20100226155755.GE16335@basil.fritz.box> <20100305062002.GV8653@laptop> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: David Rientjes Cc: Pekka Enberg , Andi Kleen , Christoph Lameter , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haicheng.li@intel.com, KAMEZAWA Hiroyuki List-ID: On Mon, Mar 08, 2010 at 03:19:48PM -0800, David Rientjes wrote: > On Fri, 5 Mar 2010, Nick Piggin wrote: > > > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG) > > > +/* > > > + * Drains and frees nodelists for a node on each slab cache, used for memory > > > + * hotplug. Returns -EBUSY if all objects cannot be drained on memory > > > + * hot-remove so that the node is not removed. When used because memory > > > + * hot-add is canceled, the only result is the freed kmem_list3. > > > + * > > > + * Must hold cache_chain_mutex. > > > + */ > > > +static int __meminit free_cache_nodelists_node(int node) > > > +{ > > > + struct kmem_cache *cachep; > > > + int ret = 0; > > > + > > > + list_for_each_entry(cachep, &cache_chain, next) { > > > + struct array_cache *shared; > > > + struct array_cache **alien; > > > + struct kmem_list3 *l3; > > > + > > > + l3 = cachep->nodelists[node]; > > > + if (!l3) > > > + continue; > > > + > > > + spin_lock_irq(&l3->list_lock); > > > + shared = l3->shared; > > > + if (shared) { > > > + free_block(cachep, shared->entry, shared->avail, node); > > > + l3->shared = NULL; > > > + } > > > + alien = l3->alien; > > > + l3->alien = NULL; > > > + spin_unlock_irq(&l3->list_lock); > > > + > > > + if (alien) { > > > + drain_alien_cache(cachep, alien); > > > + free_alien_cache(alien); > > > + } > > > + kfree(shared); > > > + > > > + drain_freelist(cachep, l3, l3->free_objects); > > > + if (!list_empty(&l3->slabs_full) || > > > + !list_empty(&l3->slabs_partial)) { > > > + /* > > > + * Continue to iterate through each slab cache to free > > > + * as many nodelists as possible even though the > > > + * offline will be canceled. > > > + */ > > > + ret = -EBUSY; > > > + continue; > > > + } > > > + kfree(l3); > > > + cachep->nodelists[node] = NULL; > > > > What's stopping races of other CPUs trying to access l3 and array > > caches while they're being freed? > > > > numa_node_id() will not return an offlined nodeid and cache_alloc_node() > already does a fallback to other onlined nodes in case a nodeid is passed > to kmalloc_node() that does not have a nodelist. l3->shared and l3->alien > cannot be accessed without l3->list_lock (drain, cache_alloc_refill, > cache_flusharray) or cache_chain_mutex (kmem_cache_destroy, cache_reap). Yeah, but can't it _have_ a nodelist (ie. before it is set to NULL here) while it is being accessed by another CPU and concurrently being freed on this one? > > > + } > > > + return ret; > > > +} > > > + > > > +/* > > > + * Onlines nid either as the result of memory hot-add or canceled hot-remove. > > > + */ > > > +static int __meminit slab_node_online(int nid) > > > +{ > > > + int ret; > > > + mutex_lock(&cache_chain_mutex); > > > + ret = init_cache_nodelists_node(nid); > > > + mutex_unlock(&cache_chain_mutex); > > > + return ret; > > > +} > > > + > > > +/* > > > + * Offlines nid either as the result of memory hot-remove or canceled hot-add. > > > + */ > > > +static int __meminit slab_node_offline(int nid) > > > +{ > > > + int ret; > > > + mutex_lock(&cache_chain_mutex); > > > + ret = free_cache_nodelists_node(nid); > > > + mutex_unlock(&cache_chain_mutex); > > > + return ret; > > > +} > > > + > > > +static int __meminit slab_memory_callback(struct notifier_block *self, > > > + unsigned long action, void *arg) > > > +{ > > > + struct memory_notify *mnb = arg; > > > + int ret = 0; > > > + int nid; > > > + > > > + nid = mnb->status_change_nid; > > > + if (nid < 0) > > > + goto out; > > > + > > > + switch (action) { > > > + case MEM_GOING_ONLINE: > > > + case MEM_CANCEL_OFFLINE: > > > + ret = slab_node_online(nid); > > > + break; > > > > This would explode if CANCEL_OFFLINE fails. Call it theoretical and > > put a panic() in here and I don't mind. Otherwise you get corruption > > somewhere in the slab code. > > > > MEM_CANCEL_ONLINE would only fail here if a struct kmem_list3 couldn't be > allocated anywhere on the system and if that happens then the node simply > couldn't be allocated from (numa_node_id() would never return it as the > cpu's node, so it's possible to fallback in this scenario). Why would it never return the CPU's node? It's CANCEL_OFFLINE that is the problem. > Instead of doing this all at MEM_GOING_OFFLINE, we could delay freeing of > the array caches and the nodelist until MEM_OFFLINE. We're guaranteed > that all pages are freed at that point so there are no existing objects > that we need to track and then if the offline fails from a different > callback it would be possible to reset the l3->nodelists[node] pointers > since they haven't been freed yet. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org