From mboxrd@z Thu Jan 1 00:00:00 1970 From: Yury Subject: Re: [PATCH] lib: Make _find_next_bit helper function inline Date: Wed, 29 Jul 2015 00:38:45 +0300 Message-ID: <55B7F665.8050703@gmail.com> References: <1438110564-19932-1-git-send-email-cburden@codeaurora.org> <55B7F2C6.9010000@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mail-lb0-f194.google.com ([209.85.217.194]:36320 "EHLO mail-lb0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750921AbbG1Viu (ORCPT ); Tue, 28 Jul 2015 17:38:50 -0400 In-Reply-To: <55B7F2C6.9010000@gmail.com> Sender: linux-arm-msm-owner@vger.kernel.org List-Id: linux-arm-msm@vger.kernel.org To: Cassidy Burden Cc: akpm@linux-foundation.org, linux-arm-msm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Alexey Klimov , "David S. Miller" , Daniel Borkmann , Hannes Frederic Sowa , Lai Jiangshan , Mark Salter , AKASHI Takahiro , Thomas Graf , Valentin Rothberg , Chris Wilson , Rasmus Villemoes , linux@horizon.com On 29.07.2015 00:23, Yury wrote: > On 28.07.2015 22:09, Cassidy Burden wrote: >> I've tested Yury Norov's find_bit reimplementation with the >> test_find_bit >> module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40% >> performance degradation on arm64 3.18 run with fixed CPU frequency. >> >> The performance degradation appears to be caused by the >> helper function _find_next_bit. After inlining this function into >> find_next_bit and find_next_zero_bit I get slightly better performance >> than the old implementation: >> >> find_next_zero_bit find_next_bit >> old new inline old new inline >> 26 36 24 24 33 23 >> 25 36 24 24 33 23 >> 26 36 24 24 33 23 >> 25 36 24 24 33 23 >> 25 36 24 24 33 23 >> 25 37 24 24 33 23 >> 25 37 24 24 33 23 >> 25 37 24 24 33 23 >> 25 36 24 24 33 23 >> 25 37 24 24 33 23 >> >> Signed-off-by: Cassidy Burden >> Cc: Alexey Klimov >> Cc: David S. Miller >> Cc: Daniel Borkmann >> Cc: Hannes Frederic Sowa >> Cc: Lai Jiangshan >> Cc: Mark Salter >> Cc: AKASHI Takahiro >> Cc: Thomas Graf >> Cc: Valentin Rothberg >> Cc: Chris Wilson >> --- >> lib/find_bit.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/lib/find_bit.c b/lib/find_bit.c >> index 18072ea..d0e04f9 100644 >> --- a/lib/find_bit.c >> +++ b/lib/find_bit.c >> @@ -28,7 +28,7 @@ >> * find_next_zero_bit. The difference is the "invert" argument, which >> * is XORed with each fetched word before searching it for one bits. >> */ >> -static unsigned long _find_next_bit(const unsigned long *addr, >> +static inline unsigned long _find_next_bit(const unsigned long *addr, >> unsigned long nbits, unsigned long start, unsigned long >> invert) >> { >> unsigned long tmp; > > Hi Cassidi, > > At first, I'm really surprised that there's no assembler implementation > of find_bit routines for aarch64. Aarch32 has ones... > > I was thinking on inlining the helper, but decided not to do this.... > > 1. Test is not too realistic. https://lkml.org/lkml/2015/2/1/224 > The typical usage pattern is to look for a single bit or range of bits. > So in practice nobody calls find_next_bit thousand times. > > 2. Way more important to fit functions into as less cache lines as > possible. https://lkml.org/lkml/2015/2/12/114 > In this case, inlining increases cache lines consumption almost twice... > > 3. Inlining prevents compiler from some other possible optimizations. > It's > probable that in real module compiler will inline callers of > _find_next_bit, > and final output will be better. I don't like to point out the > compiler how > it should do its work. > > Nevertheless, if this is your real case, and inlining helps, I'm OK > with it. > > But I think, before/after for x86 is needed as well. > And why don't you consider '__always_inline__'? Simple inline is only > a hint and > guarantees nothing. (Sorry for typo in your name. Call me Yuri next time.) Adding Rasmus and George to CC From mboxrd@z Thu Jan 1 00:00:00 1970 From: yury.norov@gmail.com (Yury) Date: Wed, 29 Jul 2015 00:38:45 +0300 Subject: [PATCH] lib: Make _find_next_bit helper function inline In-Reply-To: <55B7F2C6.9010000@gmail.com> References: <1438110564-19932-1-git-send-email-cburden@codeaurora.org> <55B7F2C6.9010000@gmail.com> Message-ID: <55B7F665.8050703@gmail.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On 29.07.2015 00:23, Yury wrote: > On 28.07.2015 22:09, Cassidy Burden wrote: >> I've tested Yury Norov's find_bit reimplementation with the >> test_find_bit >> module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40% >> performance degradation on arm64 3.18 run with fixed CPU frequency. >> >> The performance degradation appears to be caused by the >> helper function _find_next_bit. After inlining this function into >> find_next_bit and find_next_zero_bit I get slightly better performance >> than the old implementation: >> >> find_next_zero_bit find_next_bit >> old new inline old new inline >> 26 36 24 24 33 23 >> 25 36 24 24 33 23 >> 26 36 24 24 33 23 >> 25 36 24 24 33 23 >> 25 36 24 24 33 23 >> 25 37 24 24 33 23 >> 25 37 24 24 33 23 >> 25 37 24 24 33 23 >> 25 36 24 24 33 23 >> 25 37 24 24 33 23 >> >> Signed-off-by: Cassidy Burden >> Cc: Alexey Klimov >> Cc: David S. Miller >> Cc: Daniel Borkmann >> Cc: Hannes Frederic Sowa >> Cc: Lai Jiangshan >> Cc: Mark Salter >> Cc: AKASHI Takahiro >> Cc: Thomas Graf >> Cc: Valentin Rothberg >> Cc: Chris Wilson >> --- >> lib/find_bit.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/lib/find_bit.c b/lib/find_bit.c >> index 18072ea..d0e04f9 100644 >> --- a/lib/find_bit.c >> +++ b/lib/find_bit.c >> @@ -28,7 +28,7 @@ >> * find_next_zero_bit. The difference is the "invert" argument, which >> * is XORed with each fetched word before searching it for one bits. >> */ >> -static unsigned long _find_next_bit(const unsigned long *addr, >> +static inline unsigned long _find_next_bit(const unsigned long *addr, >> unsigned long nbits, unsigned long start, unsigned long >> invert) >> { >> unsigned long tmp; > > Hi Cassidi, > > At first, I'm really surprised that there's no assembler implementation > of find_bit routines for aarch64. Aarch32 has ones... > > I was thinking on inlining the helper, but decided not to do this.... > > 1. Test is not too realistic. https://lkml.org/lkml/2015/2/1/224 > The typical usage pattern is to look for a single bit or range of bits. > So in practice nobody calls find_next_bit thousand times. > > 2. Way more important to fit functions into as less cache lines as > possible. https://lkml.org/lkml/2015/2/12/114 > In this case, inlining increases cache lines consumption almost twice... > > 3. Inlining prevents compiler from some other possible optimizations. > It's > probable that in real module compiler will inline callers of > _find_next_bit, > and final output will be better. I don't like to point out the > compiler how > it should do its work. > > Nevertheless, if this is your real case, and inlining helps, I'm OK > with it. > > But I think, before/after for x86 is needed as well. > And why don't you consider '__always_inline__'? Simple inline is only > a hint and > guarantees nothing. (Sorry for typo in your name. Call me Yuri next time.) Adding Rasmus and George to CC