From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linus Torvalds Subject: Re: [PATCH v7 1/4] spinlock: A new lockref structure for lockless update of refcount Date: Sun, 1 Sep 2013 13:59:22 -0700 Message-ID: References: <1375758759-29629-1-git-send-email-Waiman.Long@hp.com> <1375758759-29629-2-git-send-email-Waiman.Long@hp.com> <1377751465.4028.20.camel@pasglop> <20130829070012.GC27322@gmail.com> <52200DAE.2020303@hp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: Waiman Long , Ingo Molnar , Benjamin Herrenschmidt , Alexander Viro , Jeff Layton , Miklos Szeredi , Ingo Molnar , Thomas Gleixner , linux-fsdevel , Linux Kernel Mailing List , Peter Zijlstra , Steven Rostedt , Andi Kleen , "Chandramouleeswaran, Aswin" , "Norton, Scott J" To: Sedat Dilek Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org List-Id: linux-fsdevel.vger.kernel.org On Sun, Sep 1, 2013 at 8:32 AM, Linus Torvalds wrote: > On Sun, Sep 1, 2013 at 3:01 AM, Sedat Dilek wrote: >> >> Looks like this is now 10x faster: ~2.66Mloops (debug) VS. >> ~26.60Mloops (no-debug). > > Ok, that's getting to be in the right ballpark. So I installed my new i7-4770S yesterday - somewhat lower frequency than my previous CPU, but it has four cores plus HT, and boy does that show the scalability problems better. My test-program used to get maybe 15% time in spinlock. On the 4770S, with current -git (so no lockref) I get this: [torvalds@i5 test-lookup]$ for i in 1 2 3 4 5; do ./a.out ; done Total loops: 26656873 Total loops: 26701572 Total loops: 26698526 Total loops: 26752993 Total loops: 26710556 with a profile that looks roughly like: 84.14% a.out _raw_spin_lock 3.04% a.out lg_local_lock 2.16% a.out vfs_getattr 1.16% a.out dput.part.15 0.67% a.out copy_user_enhanced_fast_string 0.55% a.out complete_walk [ Side note: Al, that lg_local_lock really is annoying: it's br_read_lock(mntput_no_expire), with two thirds of the calls coming from mntput_no_expire, and the rest from path_init -> lock_rcu_walk. I really really wonder if we could get rid of the br_read_lock(&vfsmount_lock) for rcu_walk_init(), and use just the RCU read accesses for the mount-namespaces too. What is that lock really protecting against during lookup anyway? ] With the last lockref patch I sent out, it looks like this: [torvalds@i5 test-lookup]$ for i in 1 2 3 4 5; do ./a.out ; done Total loops: 54740529 Total loops: 54568346 Total loops: 54715686 Total loops: 54715854 Total loops: 54790592 28.55% a.out lockref_get_or_lock 20.65% a.out lockref_put_or_lock 9.06% a.out dput 6.37% a.out lg_local_lock 5.45% a.out lookup_fast 3.77% a.out d_rcu_to_refcount 2.03% a.out vfs_getattr 1.75% a.out copy_user_enhanced_fast_string 1.16% a.out link_path_walk 1.15% a.out avc_has_perm_noaudit 1.14% a.out __lookup_mnt so performance more than doubled (on that admittedly stupid benchmark), and you can see that the cacheline bouncing for that reference count is still a big deal, but at least it gets some real work done now because we're not spinning waiting for it. So you can see the bad case with even just a single socket when the benchmark is just targeted enough. But two cores just wasn't enough to show any performance advantage. Linus