From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932195AbWA3KBK (ORCPT ); Mon, 30 Jan 2006 05:01:10 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932194AbWA3KBK (ORCPT ); Mon, 30 Jan 2006 05:01:10 -0500 Received: from e5.ny.us.ibm.com ([32.97.182.145]:38023 "EHLO e5.ny.us.ibm.com") by vger.kernel.org with ESMTP id S932195AbWA3KBI (ORCPT ); Mon, 30 Jan 2006 05:01:08 -0500 Date: Mon, 30 Jan 2006 02:00:09 -0800 From: "Paul E. McKenney" To: "David S. Miller" Cc: dada1@cosmosbay.com, dipankar@in.ibm.com, rlrevell@joe-job.com, mingo@elte.hu, linux-kernel@vger.kernel.org, torvalds@osdl.org Subject: Re: RCU latency regression in 2.6.16-rc1 Message-ID: <20060130100009.GC17848@us.ibm.com> Reply-To: paulmck@us.ibm.com References: <20060130043604.GF16585@us.ibm.com> <43DD9C49.4000000@cosmosbay.com> <20060130051156.GK16585@us.ibm.com> <20060129.215244.05901896.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20060129.215244.05901896.davem@davemloft.net> User-Agent: Mutt/1.4.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jan 29, 2006 at 09:52:44PM -0800, David S. Miller wrote: > From: "Paul E. McKenney" > Date: Sun, 29 Jan 2006 21:11:56 -0800 > > > > If the size is expanded by a 2 factor (or a power of too), can your > > > proposal works ? > > > > Yep!!! > > > > Add the following: > > This all sounds very exciting and promising. I'll try to find some > spare cycles tomorrow to cook something up. Cool! My earlier description did not get rid of all the explicit memory barriers. The following updated pseudocode does so. I will of course be happy to review what you come up with!!! (And I will be interested to see if this still makes sense after sleeping on it...) Improvements welcome as always!!! Thanx, Paul ------------------------------------------------------------------------ Data Structures o Yes, the names suck, but that is why I usually am happy to have other people come up with names... o Having separate fvbupd and fvbrdr removes the need for (hopefully all) explicit memory barriers. struct hashtbl { int nbuckets; int fvbupd; /* updater's first valid bucket. */ int fvbrdr; /* readers' first valid bucket. */ struct hash_param params; struct list_head buckets[0]; }; /* Both fvbupd and fvbrdr need to be initialized to -1. */ struct hashtbl *current; struct hashtbl *not_current; Switch Pseudocode: assumes that the switcher can block, if not, then need to make a state machine driven by call_rcu(). o Wait until at least one grace period has elapsed since the previous switch. Easiest to just put "synchronize_rcu()" up front, but can take advantage of naturally occurring grace periods that elapsed since the last switch if desired. o Initialize the not-current array, including the hash params. If the new array is to be different in size, allocate the new one and free the not-current array. o Use rcu_assign_pointer() to point not_current to (you guessed it!) the not-current array. o Set current->fvbupd = current->fvbrdr = 0; o Wait for a grace period (synchronize_rcu() if blocking OK, otherwise call_rcu()...). The delay guarantees that readers are aware that things might be disappearing before we actually start making them disappear. o Loop until all buckets of current are emptied: o current->fvbrdr = current->fvbupd No memory barriers needed because all buckets preceding current->fvbupd were emptied at least one grace period ago, so any readers that were active when any of those buckets were non-empty have now completed. o Remove elements from current->bucket[current->fvbupd] until the bucket is empty or until we use up our alloted quota of time and/or elements. o If we empty the bucket, we of course increment current->fvbupd, and if there is time/elements left, could continue on the next bucket. o Wait for a grace period. -Really- big arrays on really big SMP machines might want to have multiple CPUs processing buckets in parallel. In theory, this is not hard, but the fvbrdr update gets messier. And yes, there needs to be some sort of update-side locking. Per-bucket locking seems like it should work -- one would hold the bucket lock for the current array throughout, and acquire and release the bucket lock for the non-current array on a per-element basis. One perhaps better way is to simply remove the elements from current, and let the readers insert them into not_current only upon cache miss. Getting this right requires more insight into the networking code than I currently have. o Use rcu_assign_pointer() to set current to not_current. o Note: leave not_current pointing to the now-current array to allow slow readers to complete successfully. Alternatively, one could wait for a grace period, then set non_current to NULL (but does not seem worth it). Table lookup (am assuming that this does rcu_read_unlock() before returning, but that only makes sense if you have some sort of lock on the element being returned -- if no such lock, the rcu_read_unlock() primitives must be removed -- the caller must rcu_read_unlock() instead): o rcu_read_lock(); o c = rcu_dereference(current); o Compute hash based on c->params o If hash >= c->fvbrdr, do lookup in c->bucket[hash] If found, rcu_read_unlock() and return it. o If c->fvbrdr is not -1, look the item up in the non-current table: o c = rcu_dereference(not_current); o Compute hash based on c->params o Do lookup in c->bucket[hash] Note that c->fvbrdr -could- transition to zero here, but if it does, we are guaranteed that nothing will actually be removed from the current table until we complete. And note that we need the non-current lookup even if the switch is not populating the non-current array, since the readers would be doing so upon cache miss, right? o rcu_read_unlock() o return what is found or complain appropriately if nothing found.