From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753461AbcHPXXJ (ORCPT <rfc822;w@1wt.eu>);
	Tue, 16 Aug 2016 19:23:09 -0400
Received: from mail-oi0-f66.google.com ([209.85.218.66]:36211 "EHLO
	mail-oi0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752325AbcHPXXH (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 16 Aug 2016 19:23:07 -0400
MIME-Version: 1.0
In-Reply-To: <20160816220250.GI16044@dastard>
References: <CA+55aFxbLsD_btd0qXpuxTqAB5OQF1O+v7-OMgj8ftBE4Qd7WA@mail.gmail.com>
 <20160815050016.GY19025@dastard> <CA+55aFwva2Xffai+Eqv1Jn_NGryk3YJ2i5JoHOQnbQv6qVPAsw@mail.gmail.com>
 <CA+55aFy14nUnJQ_GdF=j8Fa9xiH70c6fY2G3q5HQ01+8z1z3qQ@mail.gmail.com>
 <CA+55aFxp+rLehC8c157uRbH459wUC1rRPfCVgvmcq5BrG9gkyg@mail.gmail.com>
 <20160815222211.GA19025@dastard> <20160815224259.GB19025@dastard>
 <CA+55aFzOAorMxCsv3uyyyhS8c5xteVnZVEm+bGyBjkjWVT5Zag@mail.gmail.com>
 <20160816001942.GF16044@dastard> <CA+55aFwmRPuhLLqN8D-3pcbkqoC05t=1dtnsv8k074uD7QNxBg@mail.gmail.com>
 <20160816220250.GI16044@dastard>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 16 Aug 2016 16:23:04 -0700
X-Google-Sender-Auth: jNkxyUSQSW9sD73tlRBDFqIz6_M
Message-ID: <CA+55aFyj6jidaFyCg3wXAMJJQJ1M1F_UmxzbJd5JPEwb2WfX5g@mail.gmail.com>
Subject: Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
To: Dave Chinner <david@fromorbit.com>
Cc: Bob Peterson <rpeterso@redhat.com>,
        "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
        "Huang, Ying" <ying.huang@intel.com>, Christoph Hellwig <hch@lst.de>,
        Wu Fengguang <fengguang.wu@intel.com>, LKP <lkp@01.org>,
        Tejun Heo <tj@kernel.org>, LKML <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, Aug 16, 2016 at 3:02 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> What does your profile show for when you actually dig into
>> __remove_mapping() itself?, Looking at your flat profile, I'm assuming
>> you get
>
> -   22.26%     0.93%  [kernel]          [k] __remove_mapping
>    - 3.86% __remove_mapping
>       - 18.35% _raw_spin_lock_irqsave
>            __pv_queued_spin_lock_slowpath
>         1.32% __delete_from_page_cache
>       - 0.92% _raw_spin_unlock_irqrestore
>            __raw_callee_save___pv_queued_spin_unlock

Ok, that's all very consistent with my profiles, except - obviously -
for the crazy spinlock thing.

One difference is that your unlock has that PV unlock thing - on raw
hardware it's just a single store. But I don't think I saw the
unlock_slowpath in there.

There's nothing really expensive going on there that I can tell.

> And the instruction level profile:

Yup. The bulk is in the cmpxchg and a cache miss (it just shows up in
the instruction after it: you can use "cycles:pp" to get perf to
actually try to fix up the blame to the instruction that _causes_
things rather than the instruction following, but in this case it's
all trivial).

> It's the same code AFAICT, except the pv version jumps straight to
> the "queue" case.

Yes. Your profile looks perfectly fine. Most of the profile is rigth
after the 'pause', which you'd expect.

>>From a quick look, it seems like only about 2/3rd of the time is
actually spent in the "pause" loop, but the control flow is complex
enough that maybe I didn't follow it right. The native case is
simpler. But since I suspect that it's not so much about the
spinlocked region being too costly, but just about locking too damn
much), that 2/3rds actually makes sense: it's not that it's
necessarily spinning waiting for the lock all that long in any
individual case, it's just that the spin_lock code is called so much.

So I still kind of just blame kswapd, rather than any new expense. It
would be interesting to hear if Mel is right about that kswapd
sleeping change between 4.6 and 4.7..

               Linus

From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: multipart/mixed; boundary="===============7766128255377144831=="
MIME-Version: 1.0
From: Linus Torvalds <torvalds@linux-foundation.org>
To: lkp@lists.01.org
Subject: Re: [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Date: Tue, 16 Aug 2016 16:23:04 -0700
Message-ID: <CA+55aFyj6jidaFyCg3wXAMJJQJ1M1F_UmxzbJd5JPEwb2WfX5g@mail.gmail.com>
In-Reply-To: <20160816220250.GI16044@dastard>
List-Id: <oe-lkp.lists.linux.dev>

--===============7766128255377144831==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable

On Tue, Aug 16, 2016 at 3:02 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> What does your profile show for when you actually dig into
>> __remove_mapping() itself?, Looking at your flat profile, I'm assuming
>> you get
>
> -   22.26%     0.93%  [kernel]          [k] __remove_mapping
>    - 3.86% __remove_mapping
>       - 18.35% _raw_spin_lock_irqsave
>            __pv_queued_spin_lock_slowpath
>         1.32% __delete_from_page_cache
>       - 0.92% _raw_spin_unlock_irqrestore
>            __raw_callee_save___pv_queued_spin_unlock

Ok, that's all very consistent with my profiles, except - obviously -
for the crazy spinlock thing.

One difference is that your unlock has that PV unlock thing - on raw
hardware it's just a single store. But I don't think I saw the
unlock_slowpath in there.

There's nothing really expensive going on there that I can tell.

> And the instruction level profile:

Yup. The bulk is in the cmpxchg and a cache miss (it just shows up in
the instruction after it: you can use "cycles:pp" to get perf to
actually try to fix up the blame to the instruction that _causes_
things rather than the instruction following, but in this case it's
all trivial).

> It's the same code AFAICT, except the pv version jumps straight to
> the "queue" case.

Yes. Your profile looks perfectly fine. Most of the profile is rigth
after the 'pause', which you'd expect.

>>From a quick look, it seems like only about 2/3rd of the time is
actually spent in the "pause" loop, but the control flow is complex
enough that maybe I didn't follow it right. The native case is
simpler. But since I suspect that it's not so much about the
spinlocked region being too costly, but just about locking too damn
much), that 2/3rds actually makes sense: it's not that it's
necessarily spinning waiting for the lock all that long in any
individual case, it's just that the spin_lock code is called so much.

So I still kind of just blame kswapd, rather than any new expense. It
would be interesting to hear if Mel is right about that kswapd
sleeping change between 4.6 and 4.7..

               Linus

--===============7766128255377144831==--