From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753461AbcHPXXJ (ORCPT ); Tue, 16 Aug 2016 19:23:09 -0400 Received: from mail-oi0-f66.google.com ([209.85.218.66]:36211 "EHLO mail-oi0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752325AbcHPXXH (ORCPT ); Tue, 16 Aug 2016 19:23:07 -0400 MIME-Version: 1.0 In-Reply-To: <20160816220250.GI16044@dastard> References: <20160815050016.GY19025@dastard> <20160815222211.GA19025@dastard> <20160815224259.GB19025@dastard> <20160816001942.GF16044@dastard> <20160816220250.GI16044@dastard> From: Linus Torvalds Date: Tue, 16 Aug 2016 16:23:04 -0700 X-Google-Sender-Auth: jNkxyUSQSW9sD73tlRBDFqIz6_M Message-ID: Subject: Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression To: Dave Chinner Cc: Bob Peterson , "Kirill A. Shutemov" , "Huang, Ying" , Christoph Hellwig , Wu Fengguang , LKP , Tejun Heo , LKML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 16, 2016 at 3:02 PM, Dave Chinner wrote: >> >> What does your profile show for when you actually dig into >> __remove_mapping() itself?, Looking at your flat profile, I'm assuming >> you get > > - 22.26% 0.93% [kernel] [k] __remove_mapping > - 3.86% __remove_mapping > - 18.35% _raw_spin_lock_irqsave > __pv_queued_spin_lock_slowpath > 1.32% __delete_from_page_cache > - 0.92% _raw_spin_unlock_irqrestore > __raw_callee_save___pv_queued_spin_unlock Ok, that's all very consistent with my profiles, except - obviously - for the crazy spinlock thing. One difference is that your unlock has that PV unlock thing - on raw hardware it's just a single store. But I don't think I saw the unlock_slowpath in there. There's nothing really expensive going on there that I can tell. > And the instruction level profile: Yup. The bulk is in the cmpxchg and a cache miss (it just shows up in the instruction after it: you can use "cycles:pp" to get perf to actually try to fix up the blame to the instruction that _causes_ things rather than the instruction following, but in this case it's all trivial). > It's the same code AFAICT, except the pv version jumps straight to > the "queue" case. Yes. Your profile looks perfectly fine. Most of the profile is rigth after the 'pause', which you'd expect. >>From a quick look, it seems like only about 2/3rd of the time is actually spent in the "pause" loop, but the control flow is complex enough that maybe I didn't follow it right. The native case is simpler. But since I suspect that it's not so much about the spinlocked region being too costly, but just about locking too damn much), that 2/3rds actually makes sense: it's not that it's necessarily spinning waiting for the lock all that long in any individual case, it's just that the spin_lock code is called so much. So I still kind of just blame kswapd, rather than any new expense. It would be interesting to hear if Mel is right about that kswapd sleeping change between 4.6 and 4.7.. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============7766128255377144831==" MIME-Version: 1.0 From: Linus Torvalds To: lkp@lists.01.org Subject: Re: [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression Date: Tue, 16 Aug 2016 16:23:04 -0700 Message-ID: In-Reply-To: <20160816220250.GI16044@dastard> List-Id: --===============7766128255377144831== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Tue, Aug 16, 2016 at 3:02 PM, Dave Chinner wrote: >> >> What does your profile show for when you actually dig into >> __remove_mapping() itself?, Looking at your flat profile, I'm assuming >> you get > > - 22.26% 0.93% [kernel] [k] __remove_mapping > - 3.86% __remove_mapping > - 18.35% _raw_spin_lock_irqsave > __pv_queued_spin_lock_slowpath > 1.32% __delete_from_page_cache > - 0.92% _raw_spin_unlock_irqrestore > __raw_callee_save___pv_queued_spin_unlock Ok, that's all very consistent with my profiles, except - obviously - for the crazy spinlock thing. One difference is that your unlock has that PV unlock thing - on raw hardware it's just a single store. But I don't think I saw the unlock_slowpath in there. There's nothing really expensive going on there that I can tell. > And the instruction level profile: Yup. The bulk is in the cmpxchg and a cache miss (it just shows up in the instruction after it: you can use "cycles:pp" to get perf to actually try to fix up the blame to the instruction that _causes_ things rather than the instruction following, but in this case it's all trivial). > It's the same code AFAICT, except the pv version jumps straight to > the "queue" case. Yes. Your profile looks perfectly fine. Most of the profile is rigth after the 'pause', which you'd expect. >>From a quick look, it seems like only about 2/3rd of the time is actually spent in the "pause" loop, but the control flow is complex enough that maybe I didn't follow it right. The native case is simpler. But since I suspect that it's not so much about the spinlocked region being too costly, but just about locking too damn much), that 2/3rds actually makes sense: it's not that it's necessarily spinning waiting for the lock all that long in any individual case, it's just that the spin_lock code is called so much. So I still kind of just blame kswapd, rather than any new expense. It would be interesting to hear if Mel is right about that kswapd sleeping change between 4.6 and 4.7.. Linus --===============7766128255377144831==--