From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1758144AbbIDHLs (ORCPT <rfc822;w@1wt.eu>);
	Fri, 4 Sep 2015 03:11:48 -0400
Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:48710 "EHLO
	ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1751665AbbIDHLr (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 4 Sep 2015 03:11:47 -0400
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: A2D6CADlQulVPAUaLHlXBoMhgT2GUqJ1AQEBAQEHil6REgICAQECgTVNAQEBAQEBBwEBAQFBP4QjAQEBAwEnExwjBQsIAxgJJQ8FJQMHGhOIJgfKawEBCAIBHxmGE4VCgT0BgwpDB4QsBZVRikaCLoFNFYc8kVWCNg0PgWYsM4lLAQEB
Date: Fri, 4 Sep 2015 17:11:43 +1000
From: Dave Chinner <david@fromorbit.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Waiman Long <Waiman.Long@hp.com>, Ingo Molnar <mingo@kernel.org>
Subject: Re: [4.2, Regression] Queued spinlocks cause major XFS performance
 regression
Message-ID: <20150904071143.GZ3902@dastard>
References: <20150904054820.GY3902@dastard>
 <CA+55aFyuob5iOOptzdD1W7gsxcrUGkgU50UoLA+Aq29-jO0KSw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CA+55aFyuob5iOOptzdD1W7gsxcrUGkgU50UoLA+Aq29-jO0KSw@mail.gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Sep 03, 2015 at 11:39:21PM -0700, Linus Torvalds wrote:
> On Thu, Sep 3, 2015 at 10:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > When I turned spinlock debugging off on 4.2 to get some perf numbers
> > a request from Linus, I got this:
> 
> [ ugly numbers deleted ]
> 
> > And then a quick call graph sample to find the lock:
> >
> >    37.19%    37.19%  [kernel]         [k] queued_spin_lock_slowpath
> >    - queued_spin_lock_slowpath
> >       - 99.98% _raw_spin_lock
> >          - 89.16% xfs_log_commit_cil
> [ snip ]
> >
> > This shows that we have catastrophic spinlock contention in the
> > transaction commit path. The cil->xc_cil_lock spin lock as it's the
> > only spinlock in that path. And while it's the hot lock in the
> > commit path, turning spinlock debugging back on (and no other
> > changes) shows that it shouldn't be contended:
> >
> >    8.92%  [kernel]  [k] _xfs_buf_find
> [ snip ]
> 
> So you basically have almost no spinlock overhead at all even when
> debugging is on.

*nod*

> That's unusual, as usually the debug code makes the contention much much worse.

Right. The debug behaviour is completely unchanged, that's why I
didn't notice this earlier. And it's not until I scale this workload
to >32p that is tend to see and significant level of contention on
the cil->xc_cil_lock when the basic spin lock debugging is enabled.

> > To confirm that this is indeed caused by the queued spinlocks, I
> > removed the the spinlock debugging and did this to arch/x86/Kconfig:
> >
> > -       select ARCH_USE_QUEUED_SPINLOCK
> >
> > And the results are:
> 
> Ok, that's pretty conclusive. It doesn't seem to make much _sense_,
> but numbers talk, BS walks.
> 
> If I read things right, the actual spinlock is the "cil->xc_cil_lock"
> that is taken in xlog_cil_insert_items(), and it justr shows up in
> xfs_log_commit_cil() in the call graph due to inlining. Correct?

Yup, that's how I read it, too. 

> There doesn't seem to be anything even remotely strange going on in that area.
> 
> Is this a PARAVIRT configuration? There were issues with PV
> interaction at some point. If it is PV, and you don't actually use PV,
> can you test with PV support disabled?

$ grep PARAVIRT .config
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_DEBUG is not set
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_PARAVIRT_TIME_ACCOUNTING=y
CONFIG_PARAVIRT_CLOCK=y
$

I'll retest with CONFIG_PARAVIRT=n....

> Also, if you look at the instruction-level profile for
> queued_spin_lock_slowpath itself, does anything stand out? For
> example, I note that the for-loop with the atomic_cmpxchg() call in it
> doesn't ever do a cpu_relax(). It doesn't look like that should
> normally loop, but obviously that function also shouldn't normally use
> 2/3rds of the cpu, so.. Maybe some part of queued_spin_lock_slowpath()
> stands out as "it's spending 99% of the time in _that_ particular
> part, and it gives some clue what goes wrong.

I'll have a look when the current tests on that machine have
finished running.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com