* [PATCH] lib: Make _find_next_bit helper function inline @ 2015-07-28 19:09 ` Cassidy Burden 0 siblings, 0 replies; 18+ messages in thread From: Cassidy Burden @ 2015-07-28 19:09 UTC (permalink / raw) To: yury.norov, akpm Cc: linux-arm-msm, linux-kernel, linux-arm-kernel, Cassidy Burden, Alexey Klimov, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Lai Jiangshan, Mark Salter, AKASHI Takahiro, Thomas Graf, Valentin Rothberg, Chris Wilson I've tested Yury Norov's find_bit reimplementation with the test_find_bit module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40% performance degradation on arm64 3.18 run with fixed CPU frequency. The performance degradation appears to be caused by the helper function _find_next_bit. After inlining this function into find_next_bit and find_next_zero_bit I get slightly better performance than the old implementation: find_next_zero_bit find_next_bit old new inline old new inline 26 36 24 24 33 23 25 36 24 24 33 23 26 36 24 24 33 23 25 36 24 24 33 23 25 36 24 24 33 23 25 37 24 24 33 23 25 37 24 24 33 23 25 37 24 24 33 23 25 36 24 24 33 23 25 37 24 24 33 23 Signed-off-by: Cassidy Burden <cburden@codeaurora.org> Cc: Alexey Klimov <klimov.linux@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Mark Salter <msalter@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Valentin Rothberg <valentinrothberg@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> --- lib/find_bit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/find_bit.c b/lib/find_bit.c index 18072ea..d0e04f9 100644 --- a/lib/find_bit.c +++ b/lib/find_bit.c @@ -28,7 +28,7 @@ * find_next_zero_bit. The difference is the "invert" argument, which * is XORed with each fetched word before searching it for one bits. */ -static unsigned long _find_next_bit(const unsigned long *addr, +static inline unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start, unsigned long invert) { unsigned long tmp; -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project ^ permalink raw reply related [flat|nested] 18+ messages in thread
* [PATCH] lib: Make _find_next_bit helper function inline @ 2015-07-28 19:09 ` Cassidy Burden 0 siblings, 0 replies; 18+ messages in thread From: Cassidy Burden @ 2015-07-28 19:09 UTC (permalink / raw) To: linux-arm-kernel I've tested Yury Norov's find_bit reimplementation with the test_find_bit module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40% performance degradation on arm64 3.18 run with fixed CPU frequency. The performance degradation appears to be caused by the helper function _find_next_bit. After inlining this function into find_next_bit and find_next_zero_bit I get slightly better performance than the old implementation: find_next_zero_bit find_next_bit old new inline old new inline 26 36 24 24 33 23 25 36 24 24 33 23 26 36 24 24 33 23 25 36 24 24 33 23 25 36 24 24 33 23 25 37 24 24 33 23 25 37 24 24 33 23 25 37 24 24 33 23 25 36 24 24 33 23 25 37 24 24 33 23 Signed-off-by: Cassidy Burden <cburden@codeaurora.org> Cc: Alexey Klimov <klimov.linux@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: Daniel Borkmann <dborkman@redhat.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Mark Salter <msalter@redhat.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Valentin Rothberg <valentinrothberg@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> --- lib/find_bit.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/find_bit.c b/lib/find_bit.c index 18072ea..d0e04f9 100644 --- a/lib/find_bit.c +++ b/lib/find_bit.c @@ -28,7 +28,7 @@ * find_next_zero_bit. The difference is the "invert" argument, which * is XORed with each fetched word before searching it for one bits. */ -static unsigned long _find_next_bit(const unsigned long *addr, +static inline unsigned long _find_next_bit(const unsigned long *addr, unsigned long nbits, unsigned long start, unsigned long invert) { unsigned long tmp; -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH] lib: Make _find_next_bit helper function inline 2015-07-28 19:09 ` Cassidy Burden @ 2015-07-28 21:23 ` Yury -1 siblings, 0 replies; 18+ messages in thread From: Yury @ 2015-07-28 21:23 UTC (permalink / raw) To: Cassidy Burden, akpm Cc: linux-arm-msm, linux-kernel, linux-arm-kernel, Alexey Klimov, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Lai Jiangshan, Mark Salter, AKASHI Takahiro, Thomas Graf, Valentin Rothberg, Chris Wilson On 28.07.2015 22:09, Cassidy Burden wrote: > I've tested Yury Norov's find_bit reimplementation with the test_find_bit > module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40% > performance degradation on arm64 3.18 run with fixed CPU frequency. > > The performance degradation appears to be caused by the > helper function _find_next_bit. After inlining this function into > find_next_bit and find_next_zero_bit I get slightly better performance > than the old implementation: > > find_next_zero_bit find_next_bit > old new inline old new inline > 26 36 24 24 33 23 > 25 36 24 24 33 23 > 26 36 24 24 33 23 > 25 36 24 24 33 23 > 25 36 24 24 33 23 > 25 37 24 24 33 23 > 25 37 24 24 33 23 > 25 37 24 24 33 23 > 25 36 24 24 33 23 > 25 37 24 24 33 23 > > Signed-off-by: Cassidy Burden <cburden@codeaurora.org> > Cc: Alexey Klimov <klimov.linux@gmail.com> > Cc: David S. Miller <davem@davemloft.net> > Cc: Daniel Borkmann <dborkman@redhat.com> > Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> > Cc: Lai Jiangshan <laijs@cn.fujitsu.com> > Cc: Mark Salter <msalter@redhat.com> > Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> > Cc: Thomas Graf <tgraf@suug.ch> > Cc: Valentin Rothberg <valentinrothberg@gmail.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > --- > lib/find_bit.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/lib/find_bit.c b/lib/find_bit.c > index 18072ea..d0e04f9 100644 > --- a/lib/find_bit.c > +++ b/lib/find_bit.c > @@ -28,7 +28,7 @@ > * find_next_zero_bit. The difference is the "invert" argument, which > * is XORed with each fetched word before searching it for one bits. > */ > -static unsigned long _find_next_bit(const unsigned long *addr, > +static inline unsigned long _find_next_bit(const unsigned long *addr, > unsigned long nbits, unsigned long start, unsigned long invert) > { > unsigned long tmp; Hi Cassidi, At first, I'm really surprised that there's no assembler implementation of find_bit routines for aarch64. Aarch32 has ones... I was thinking on inlining the helper, but decided not to do this.... 1. Test is not too realistic. https://lkml.org/lkml/2015/2/1/224 The typical usage pattern is to look for a single bit or range of bits. So in practice nobody calls find_next_bit thousand times. 2. Way more important to fit functions into as less cache lines as possible. https://lkml.org/lkml/2015/2/12/114 In this case, inlining increases cache lines consumption almost twice... 3. Inlining prevents compiler from some other possible optimizations. It's probable that in real module compiler will inline callers of _find_next_bit, and final output will be better. I don't like to point out the compiler how it should do its work. Nevertheless, if this is your real case, and inlining helps, I'm OK with it. But I think, before/after for x86 is needed as well. And why don't you consider '__always_inline__'? Simple inline is only a hint and guarantees nothing. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] lib: Make _find_next_bit helper function inline @ 2015-07-28 21:23 ` Yury 0 siblings, 0 replies; 18+ messages in thread From: Yury @ 2015-07-28 21:23 UTC (permalink / raw) To: linux-arm-kernel On 28.07.2015 22:09, Cassidy Burden wrote: > I've tested Yury Norov's find_bit reimplementation with the test_find_bit > module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40% > performance degradation on arm64 3.18 run with fixed CPU frequency. > > The performance degradation appears to be caused by the > helper function _find_next_bit. After inlining this function into > find_next_bit and find_next_zero_bit I get slightly better performance > than the old implementation: > > find_next_zero_bit find_next_bit > old new inline old new inline > 26 36 24 24 33 23 > 25 36 24 24 33 23 > 26 36 24 24 33 23 > 25 36 24 24 33 23 > 25 36 24 24 33 23 > 25 37 24 24 33 23 > 25 37 24 24 33 23 > 25 37 24 24 33 23 > 25 36 24 24 33 23 > 25 37 24 24 33 23 > > Signed-off-by: Cassidy Burden <cburden@codeaurora.org> > Cc: Alexey Klimov <klimov.linux@gmail.com> > Cc: David S. Miller <davem@davemloft.net> > Cc: Daniel Borkmann <dborkman@redhat.com> > Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> > Cc: Lai Jiangshan <laijs@cn.fujitsu.com> > Cc: Mark Salter <msalter@redhat.com> > Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> > Cc: Thomas Graf <tgraf@suug.ch> > Cc: Valentin Rothberg <valentinrothberg@gmail.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > --- > lib/find_bit.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/lib/find_bit.c b/lib/find_bit.c > index 18072ea..d0e04f9 100644 > --- a/lib/find_bit.c > +++ b/lib/find_bit.c > @@ -28,7 +28,7 @@ > * find_next_zero_bit. The difference is the "invert" argument, which > * is XORed with each fetched word before searching it for one bits. > */ > -static unsigned long _find_next_bit(const unsigned long *addr, > +static inline unsigned long _find_next_bit(const unsigned long *addr, > unsigned long nbits, unsigned long start, unsigned long invert) > { > unsigned long tmp; Hi Cassidi, At first, I'm really surprised that there's no assembler implementation of find_bit routines for aarch64. Aarch32 has ones... I was thinking on inlining the helper, but decided not to do this.... 1. Test is not too realistic. https://lkml.org/lkml/2015/2/1/224 The typical usage pattern is to look for a single bit or range of bits. So in practice nobody calls find_next_bit thousand times. 2. Way more important to fit functions into as less cache lines as possible. https://lkml.org/lkml/2015/2/12/114 In this case, inlining increases cache lines consumption almost twice... 3. Inlining prevents compiler from some other possible optimizations. It's probable that in real module compiler will inline callers of _find_next_bit, and final output will be better. I don't like to point out the compiler how it should do its work. Nevertheless, if this is your real case, and inlining helps, I'm OK with it. But I think, before/after for x86 is needed as well. And why don't you consider '__always_inline__'? Simple inline is only a hint and guarantees nothing. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] lib: Make _find_next_bit helper function inline 2015-07-28 21:23 ` Yury @ 2015-07-28 21:38 ` Yury -1 siblings, 0 replies; 18+ messages in thread From: Yury @ 2015-07-28 21:38 UTC (permalink / raw) To: Cassidy Burden Cc: akpm, linux-arm-msm, linux-kernel, linux-arm-kernel, Alexey Klimov, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Lai Jiangshan, Mark Salter, AKASHI Takahiro, Thomas Graf, Valentin Rothberg, Chris Wilson, Rasmus Villemoes, linux On 29.07.2015 00:23, Yury wrote: > On 28.07.2015 22:09, Cassidy Burden wrote: >> I've tested Yury Norov's find_bit reimplementation with the >> test_find_bit >> module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40% >> performance degradation on arm64 3.18 run with fixed CPU frequency. >> >> The performance degradation appears to be caused by the >> helper function _find_next_bit. After inlining this function into >> find_next_bit and find_next_zero_bit I get slightly better performance >> than the old implementation: >> >> find_next_zero_bit find_next_bit >> old new inline old new inline >> 26 36 24 24 33 23 >> 25 36 24 24 33 23 >> 26 36 24 24 33 23 >> 25 36 24 24 33 23 >> 25 36 24 24 33 23 >> 25 37 24 24 33 23 >> 25 37 24 24 33 23 >> 25 37 24 24 33 23 >> 25 36 24 24 33 23 >> 25 37 24 24 33 23 >> >> Signed-off-by: Cassidy Burden <cburden@codeaurora.org> >> Cc: Alexey Klimov <klimov.linux@gmail.com> >> Cc: David S. Miller <davem@davemloft.net> >> Cc: Daniel Borkmann <dborkman@redhat.com> >> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> >> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> >> Cc: Mark Salter <msalter@redhat.com> >> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> >> Cc: Thomas Graf <tgraf@suug.ch> >> Cc: Valentin Rothberg <valentinrothberg@gmail.com> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> --- >> lib/find_bit.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/lib/find_bit.c b/lib/find_bit.c >> index 18072ea..d0e04f9 100644 >> --- a/lib/find_bit.c >> +++ b/lib/find_bit.c >> @@ -28,7 +28,7 @@ >> * find_next_zero_bit. The difference is the "invert" argument, which >> * is XORed with each fetched word before searching it for one bits. >> */ >> -static unsigned long _find_next_bit(const unsigned long *addr, >> +static inline unsigned long _find_next_bit(const unsigned long *addr, >> unsigned long nbits, unsigned long start, unsigned long >> invert) >> { >> unsigned long tmp; > > Hi Cassidi, > > At first, I'm really surprised that there's no assembler implementation > of find_bit routines for aarch64. Aarch32 has ones... > > I was thinking on inlining the helper, but decided not to do this.... > > 1. Test is not too realistic. https://lkml.org/lkml/2015/2/1/224 > The typical usage pattern is to look for a single bit or range of bits. > So in practice nobody calls find_next_bit thousand times. > > 2. Way more important to fit functions into as less cache lines as > possible. https://lkml.org/lkml/2015/2/12/114 > In this case, inlining increases cache lines consumption almost twice... > > 3. Inlining prevents compiler from some other possible optimizations. > It's > probable that in real module compiler will inline callers of > _find_next_bit, > and final output will be better. I don't like to point out the > compiler how > it should do its work. > > Nevertheless, if this is your real case, and inlining helps, I'm OK > with it. > > But I think, before/after for x86 is needed as well. > And why don't you consider '__always_inline__'? Simple inline is only > a hint and > guarantees nothing. (Sorry for typo in your name. Call me Yuri next time.) Adding Rasmus and George to CC ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] lib: Make _find_next_bit helper function inline @ 2015-07-28 21:38 ` Yury 0 siblings, 0 replies; 18+ messages in thread From: Yury @ 2015-07-28 21:38 UTC (permalink / raw) To: linux-arm-kernel On 29.07.2015 00:23, Yury wrote: > On 28.07.2015 22:09, Cassidy Burden wrote: >> I've tested Yury Norov's find_bit reimplementation with the >> test_find_bit >> module (https://lkml.org/lkml/2015/3/8/141) and measured about 35-40% >> performance degradation on arm64 3.18 run with fixed CPU frequency. >> >> The performance degradation appears to be caused by the >> helper function _find_next_bit. After inlining this function into >> find_next_bit and find_next_zero_bit I get slightly better performance >> than the old implementation: >> >> find_next_zero_bit find_next_bit >> old new inline old new inline >> 26 36 24 24 33 23 >> 25 36 24 24 33 23 >> 26 36 24 24 33 23 >> 25 36 24 24 33 23 >> 25 36 24 24 33 23 >> 25 37 24 24 33 23 >> 25 37 24 24 33 23 >> 25 37 24 24 33 23 >> 25 36 24 24 33 23 >> 25 37 24 24 33 23 >> >> Signed-off-by: Cassidy Burden <cburden@codeaurora.org> >> Cc: Alexey Klimov <klimov.linux@gmail.com> >> Cc: David S. Miller <davem@davemloft.net> >> Cc: Daniel Borkmann <dborkman@redhat.com> >> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> >> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> >> Cc: Mark Salter <msalter@redhat.com> >> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> >> Cc: Thomas Graf <tgraf@suug.ch> >> Cc: Valentin Rothberg <valentinrothberg@gmail.com> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> --- >> lib/find_bit.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/lib/find_bit.c b/lib/find_bit.c >> index 18072ea..d0e04f9 100644 >> --- a/lib/find_bit.c >> +++ b/lib/find_bit.c >> @@ -28,7 +28,7 @@ >> * find_next_zero_bit. The difference is the "invert" argument, which >> * is XORed with each fetched word before searching it for one bits. >> */ >> -static unsigned long _find_next_bit(const unsigned long *addr, >> +static inline unsigned long _find_next_bit(const unsigned long *addr, >> unsigned long nbits, unsigned long start, unsigned long >> invert) >> { >> unsigned long tmp; > > Hi Cassidi, > > At first, I'm really surprised that there's no assembler implementation > of find_bit routines for aarch64. Aarch32 has ones... > > I was thinking on inlining the helper, but decided not to do this.... > > 1. Test is not too realistic. https://lkml.org/lkml/2015/2/1/224 > The typical usage pattern is to look for a single bit or range of bits. > So in practice nobody calls find_next_bit thousand times. > > 2. Way more important to fit functions into as less cache lines as > possible. https://lkml.org/lkml/2015/2/12/114 > In this case, inlining increases cache lines consumption almost twice... > > 3. Inlining prevents compiler from some other possible optimizations. > It's > probable that in real module compiler will inline callers of > _find_next_bit, > and final output will be better. I don't like to point out the > compiler how > it should do its work. > > Nevertheless, if this is your real case, and inlining helps, I'm OK > with it. > > But I think, before/after for x86 is needed as well. > And why don't you consider '__always_inline__'? Simple inline is only > a hint and > guarantees nothing. (Sorry for typo in your name. Call me Yuri next time.) Adding Rasmus and George to CC ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] lib: Make _find_next_bit helper function inline 2015-07-28 21:23 ` Yury @ 2015-07-28 21:45 ` Andrew Morton -1 siblings, 0 replies; 18+ messages in thread From: Andrew Morton @ 2015-07-28 21:45 UTC (permalink / raw) To: Yury Cc: Cassidy Burden, linux-arm-msm, linux-kernel, linux-arm-kernel, Alexey Klimov, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Lai Jiangshan, Mark Salter, AKASHI Takahiro, Thomas Graf, Valentin Rothberg, Chris Wilson On Wed, 29 Jul 2015 00:23:18 +0300 Yury <yury.norov@gmail.com> wrote: > But I think, before/after for x86 is needed as well. That would be nice. > And why don't you consider '__always_inline__'? Simple inline is only a > hint and > guarantees nothing. Yup. My x86_64 compiler just ignores the "inline". When I use __always_inline, find_bit.o's text goes from 776 bytes to 863. Hopefully we get something in return for that bloat! Also, if _find_next_bit() benefits from this then _find_next_bit_le() will presumably also benefit. ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] lib: Make _find_next_bit helper function inline @ 2015-07-28 21:45 ` Andrew Morton 0 siblings, 0 replies; 18+ messages in thread From: Andrew Morton @ 2015-07-28 21:45 UTC (permalink / raw) To: linux-arm-kernel On Wed, 29 Jul 2015 00:23:18 +0300 Yury <yury.norov@gmail.com> wrote: > But I think, before/after for x86 is needed as well. That would be nice. > And why don't you consider '__always_inline__'? Simple inline is only a > hint and > guarantees nothing. Yup. My x86_64 compiler just ignores the "inline". When I use __always_inline, find_bit.o's text goes from 776 bytes to 863. Hopefully we get something in return for that bloat! Also, if _find_next_bit() benefits from this then _find_next_bit_le() will presumably also benefit. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] lib: Make _find_next_bit helper function inline 2015-07-28 21:45 ` Andrew Morton @ 2015-07-29 13:30 ` Alexey Klimov -1 siblings, 0 replies; 18+ messages in thread From: Alexey Klimov @ 2015-07-29 13:30 UTC (permalink / raw) To: Andrew Morton, Cassidy Burden Cc: Yury, linux-arm-msm, linux-kernel, linux-arm-kernel, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Lai Jiangshan, Mark Salter, AKASHI Takahiro, Thomas Graf, Valentin Rothberg, Chris Wilson On Вт., 2015-07-28 at 14:45 -0700, Andrew Morton wrote: > On Wed, 29 Jul 2015 00:23:18 +0300 Yury <yury.norov@gmail.com> wrote: > > > But I think, before/after for x86 is needed as well. > > That would be nice. > > > And why don't you consider '__always_inline__'? Simple inline is only a > > hint and > > guarantees nothing. > > Yup. My x86_64 compiler just ignores the "inline". When I use > __always_inline, find_bit.o's text goes from 776 bytes to 863. > Hopefully we get something in return for that bloat! On my x86_64 (core-i5 something, with disabled cpufreq) i got following numbers: find_next_zero_bit old new __always_inline 20 21 22 20 21 22 20 22 23 21 21 22 21 21 23 20 21 22 20 21 23 21 22 23 20 22 22 21 21 22 find_next_bit old new __always_inline 19 21 24 19 22 24 19 22 24 19 21 24 20 22 24 19 21 23 19 21 23 20 21 24 19 22 24 19 21 24 I will re-check on another machine. It's really interesting if __always_inline makes things better for aarch64 and worse for x86_64. It will be nice if someone will check it on x86_64 too. Best regards, Alexey Klimov. > Also, if _find_next_bit() benefits from this then _find_next_bit_le() > will presumably also benefit. > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] lib: Make _find_next_bit helper function inline @ 2015-07-29 13:30 ` Alexey Klimov 0 siblings, 0 replies; 18+ messages in thread From: Alexey Klimov @ 2015-07-29 13:30 UTC (permalink / raw) To: linux-arm-kernel On ??., 2015-07-28 at 14:45 -0700, Andrew Morton wrote: > On Wed, 29 Jul 2015 00:23:18 +0300 Yury <yury.norov@gmail.com> wrote: > > > But I think, before/after for x86 is needed as well. > > That would be nice. > > > And why don't you consider '__always_inline__'? Simple inline is only a > > hint and > > guarantees nothing. > > Yup. My x86_64 compiler just ignores the "inline". When I use > __always_inline, find_bit.o's text goes from 776 bytes to 863. > Hopefully we get something in return for that bloat! On my x86_64 (core-i5 something, with disabled cpufreq) i got following numbers: find_next_zero_bit old new __always_inline 20 21 22 20 21 22 20 22 23 21 21 22 21 21 23 20 21 22 20 21 23 21 22 23 20 22 22 21 21 22 find_next_bit old new __always_inline 19 21 24 19 22 24 19 22 24 19 21 24 20 22 24 19 21 23 19 21 23 20 21 24 19 22 24 19 21 24 I will re-check on another machine. It's really interesting if __always_inline makes things better for aarch64 and worse for x86_64. It will be nice if someone will check it on x86_64 too. Best regards, Alexey Klimov. > Also, if _find_next_bit() benefits from this then _find_next_bit_le() > will presumably also benefit. > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] lib: Make _find_next_bit helper function inline 2015-07-29 13:30 ` Alexey Klimov @ 2015-07-29 20:40 ` Cassidy Burden -1 siblings, 0 replies; 18+ messages in thread From: Cassidy Burden @ 2015-07-29 20:40 UTC (permalink / raw) To: Alexey Klimov, Andrew Morton Cc: Yury, linux-arm-msm, linux-kernel, linux-arm-kernel, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Lai Jiangshan, Mark Salter, AKASHI Takahiro, Thomas Graf, Valentin Rothberg, Chris Wilson I changed the test module to now set the entire array to all 0/1s and only flip a few bits. There appears to be a performance benefit, but it's only 2-3% better (if that). If the main benefit of the original patch was to save space then inlining definitely doesn't seem worth the small gains in real use cases. find_next_zero_bit (us) old new inline 14440 17080 17086 4779 5181 5069 10844 12720 12746 9642 11312 11253 3858 3818 3668 10540 12349 12307 12470 14716 14697 5403 6002 5942 2282 1820 1418 13632 16056 15998 11048 13019 13030 6025 6790 6706 13255 15586 15605 3038 2744 2539 10353 12219 12239 10498 12251 12322 14767 17452 17454 12785 15048 15052 1655 1034 691 9924 11611 11558 find_next_bit (us) old new inline 8535 9936 9667 14666 17372 16880 2315 1799 1355 6578 9092 8806 6548 7558 7274 9448 11213 10821 3467 3497 3449 2719 3079 2911 6115 7989 7796 13582 16113 15643 4643 4946 4766 3406 3728 3536 7118 9045 8805 3174 3011 2701 13300 16780 16252 14285 16848 16330 11583 13669 13207 13063 15455 14989 12661 14955 14500 12068 14166 13790 On 7/29/2015 6:30 AM, Alexey Klimov wrote: > I will re-check on another machine. It's really interesting if > __always_inline makes things better for aarch64 and worse for x86_64. It > will be nice if someone will check it on x86_64 too. Very odd, this may be related to the other compiler optimizations Yuri mentioned? -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] lib: Make _find_next_bit helper function inline @ 2015-07-29 20:40 ` Cassidy Burden 0 siblings, 0 replies; 18+ messages in thread From: Cassidy Burden @ 2015-07-29 20:40 UTC (permalink / raw) To: linux-arm-kernel I changed the test module to now set the entire array to all 0/1s and only flip a few bits. There appears to be a performance benefit, but it's only 2-3% better (if that). If the main benefit of the original patch was to save space then inlining definitely doesn't seem worth the small gains in real use cases. find_next_zero_bit (us) old new inline 14440 17080 17086 4779 5181 5069 10844 12720 12746 9642 11312 11253 3858 3818 3668 10540 12349 12307 12470 14716 14697 5403 6002 5942 2282 1820 1418 13632 16056 15998 11048 13019 13030 6025 6790 6706 13255 15586 15605 3038 2744 2539 10353 12219 12239 10498 12251 12322 14767 17452 17454 12785 15048 15052 1655 1034 691 9924 11611 11558 find_next_bit (us) old new inline 8535 9936 9667 14666 17372 16880 2315 1799 1355 6578 9092 8806 6548 7558 7274 9448 11213 10821 3467 3497 3449 2719 3079 2911 6115 7989 7796 13582 16113 15643 4643 4946 4766 3406 3728 3536 7118 9045 8805 3174 3011 2701 13300 16780 16252 14285 16848 16330 11583 13669 13207 13063 15455 14989 12661 14955 14500 12068 14166 13790 On 7/29/2015 6:30 AM, Alexey Klimov wrote: > I will re-check on another machine. It's really interesting if > __always_inline makes things better for aarch64 and worse for x86_64. It > will be nice if someone will check it on x86_64 too. Very odd, this may be related to the other compiler optimizations Yuri mentioned? -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] lib: Make _find_next_bit helper function inline 2015-07-29 20:40 ` Cassidy Burden @ 2015-08-23 22:53 ` Alexey Klimov -1 siblings, 0 replies; 18+ messages in thread From: Alexey Klimov @ 2015-08-23 22:53 UTC (permalink / raw) To: Cassidy Burden Cc: Andrew Morton, Yury, linux-arm-msm, Linux Kernel Mailing List, linux-arm-kernel, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Lai Jiangshan, Mark Salter, AKASHI Takahiro, Thomas Graf, Valentin Rothberg, Chris Wilson Hi Cassidy, On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden <cburden@codeaurora.org> wrote: > I changed the test module to now set the entire array to all 0/1s and > only flip a few bits. There appears to be a performance benefit, but > it's only 2-3% better (if that). If the main benefit of the original > patch was to save space then inlining definitely doesn't seem worth the > small gains in real use cases. > > find_next_zero_bit (us) > old new inline > 14440 17080 17086 > 4779 5181 5069 > 10844 12720 12746 > 9642 11312 11253 > 3858 3818 3668 > 10540 12349 12307 > 12470 14716 14697 > 5403 6002 5942 > 2282 1820 1418 > 13632 16056 15998 > 11048 13019 13030 > 6025 6790 6706 > 13255 15586 15605 > 3038 2744 2539 > 10353 12219 12239 > 10498 12251 12322 > 14767 17452 17454 > 12785 15048 15052 > 1655 1034 691 > 9924 11611 11558 > > find_next_bit (us) > old new inline > 8535 9936 9667 > 14666 17372 16880 > 2315 1799 1355 > 6578 9092 8806 > 6548 7558 7274 > 9448 11213 10821 > 3467 3497 3449 > 2719 3079 2911 > 6115 7989 7796 > 13582 16113 15643 > 4643 4946 4766 > 3406 3728 3536 > 7118 9045 8805 > 3174 3011 2701 > 13300 16780 16252 > 14285 16848 16330 > 11583 13669 13207 > 13063 15455 14989 > 12661 14955 14500 > 12068 14166 13790 > > On 7/29/2015 6:30 AM, Alexey Klimov wrote: >> >> I will re-check on another machine. It's really interesting if >> __always_inline makes things better for aarch64 and worse for x86_64. It >> will be nice if someone will check it on x86_64 too. > > > Very odd, this may be related to the other compiler optimizations Yuri > mentioned? It's better to ask Yury, i hope he can answer some day. Do you need to re-check this (with more iterations or on another machine(s))? -- Best regards, Klimov Alexey ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] lib: Make _find_next_bit helper function inline @ 2015-08-23 22:53 ` Alexey Klimov 0 siblings, 0 replies; 18+ messages in thread From: Alexey Klimov @ 2015-08-23 22:53 UTC (permalink / raw) To: linux-arm-kernel Hi Cassidy, On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden <cburden@codeaurora.org> wrote: > I changed the test module to now set the entire array to all 0/1s and > only flip a few bits. There appears to be a performance benefit, but > it's only 2-3% better (if that). If the main benefit of the original > patch was to save space then inlining definitely doesn't seem worth the > small gains in real use cases. > > find_next_zero_bit (us) > old new inline > 14440 17080 17086 > 4779 5181 5069 > 10844 12720 12746 > 9642 11312 11253 > 3858 3818 3668 > 10540 12349 12307 > 12470 14716 14697 > 5403 6002 5942 > 2282 1820 1418 > 13632 16056 15998 > 11048 13019 13030 > 6025 6790 6706 > 13255 15586 15605 > 3038 2744 2539 > 10353 12219 12239 > 10498 12251 12322 > 14767 17452 17454 > 12785 15048 15052 > 1655 1034 691 > 9924 11611 11558 > > find_next_bit (us) > old new inline > 8535 9936 9667 > 14666 17372 16880 > 2315 1799 1355 > 6578 9092 8806 > 6548 7558 7274 > 9448 11213 10821 > 3467 3497 3449 > 2719 3079 2911 > 6115 7989 7796 > 13582 16113 15643 > 4643 4946 4766 > 3406 3728 3536 > 7118 9045 8805 > 3174 3011 2701 > 13300 16780 16252 > 14285 16848 16330 > 11583 13669 13207 > 13063 15455 14989 > 12661 14955 14500 > 12068 14166 13790 > > On 7/29/2015 6:30 AM, Alexey Klimov wrote: >> >> I will re-check on another machine. It's really interesting if >> __always_inline makes things better for aarch64 and worse for x86_64. It >> will be nice if someone will check it on x86_64 too. > > > Very odd, this may be related to the other compiler optimizations Yuri > mentioned? It's better to ask Yury, i hope he can answer some day. Do you need to re-check this (with more iterations or on another machine(s))? -- Best regards, Klimov Alexey ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] lib: Make _find_next_bit helper function inline 2015-08-23 22:53 ` Alexey Klimov @ 2015-08-29 15:15 ` Yury -1 siblings, 0 replies; 18+ messages in thread From: Yury @ 2015-08-29 15:15 UTC (permalink / raw) To: Alexey Klimov, Cassidy Burden Cc: Andrew Morton, linux-arm-msm, Linux Kernel Mailing List, linux-arm-kernel, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Lai Jiangshan, Mark Salter, AKASHI Takahiro, Thomas Graf, Valentin Rothberg, Chris Wilson, Rasmus Villemoes, linux On 24.08.2015 01:53, Alexey Klimov wrote: > Hi Cassidy, > > > On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden <cburden@codeaurora.org> wrote: >> I changed the test module to now set the entire array to all 0/1s and >> only flip a few bits. There appears to be a performance benefit, but >> it's only 2-3% better (if that). If the main benefit of the original >> patch was to save space then inlining definitely doesn't seem worth the >> small gains in real use cases. >> >> find_next_zero_bit (us) >> old new inline >> 14440 17080 17086 >> 4779 5181 5069 >> 10844 12720 12746 >> 9642 11312 11253 >> 3858 3818 3668 >> 10540 12349 12307 >> 12470 14716 14697 >> 5403 6002 5942 >> 2282 1820 1418 >> 13632 16056 15998 >> 11048 13019 13030 >> 6025 6790 6706 >> 13255 15586 15605 >> 3038 2744 2539 >> 10353 12219 12239 >> 10498 12251 12322 >> 14767 17452 17454 >> 12785 15048 15052 >> 1655 1034 691 >> 9924 11611 11558 >> >> find_next_bit (us) >> old new inline >> 8535 9936 9667 >> 14666 17372 16880 >> 2315 1799 1355 >> 6578 9092 8806 >> 6548 7558 7274 >> 9448 11213 10821 >> 3467 3497 3449 >> 2719 3079 2911 >> 6115 7989 7796 >> 13582 16113 15643 >> 4643 4946 4766 >> 3406 3728 3536 >> 7118 9045 8805 >> 3174 3011 2701 >> 13300 16780 16252 >> 14285 16848 16330 >> 11583 13669 13207 >> 13063 15455 14989 >> 12661 14955 14500 >> 12068 14166 13790 >> >> On 7/29/2015 6:30 AM, Alexey Klimov wrote: >>> >>> I will re-check on another machine. It's really interesting if >>> __always_inline makes things better for aarch64 and worse for x86_64. It >>> will be nice if someone will check it on x86_64 too. >> >> >> Very odd, this may be related to the other compiler optimizations Yuri >> mentioned? > > It's better to ask Yury, i hope he can answer some day. > > Do you need to re-check this (with more iterations or on another machine(s))? > Hi, Alexey, Cassidy, (restoring Rasmus, George) I found no difference between original and inline versions for x86_64: (Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz) find_next_bit find_next_zero_bit old new inline old new inline 24 27 28 22 28 28 24 27 28 23 27 28 24 27 28 23 27 28 Inspecting assembler code, I found that GCC wants to see helper separated, even if you provide '__always_inline': inline <find_next_bit_new>: current <find_next_bit_new>: 280: cmp %rdx,%rsi 210: cmp %rdx,%rsi 283: jbe 295 <find_next_bit_new+0x15> 213: jbe 227 <find_next_bit_new+0x17> 285: test %rsi,%rsi 215: test %rsi,%rsi 288: je 295 <find_next_bit_new+0x15> 218: je 227 <find_next_bit_new+0x17> 28a: push %rbp 21a: push %rbp 28b: mov %rsp,%rbp 21b: xor %ecx,%ecx 28e: callq 0 <find_next_bit_new.part.0> 21d: mov %rsp,%rbp 293: pop %rbp 220: callq 0 <_find_next_bit.part.0> 294: retq 225: pop %rbp 295: mov %rsi,%rax 226: retq 298: retq 227: mov %rsi,%rax 299: nopl 0x0(%rax) 22a: retq 22b: nopl 0x0(%rax,%rax,1) So things are looking like x86_64 gcc (at least 4.9.2 build for Ubuntu) ignores '__always_inline' hint as well as 'inline'. But in case of __always_inline compiler does something not really smart: it introduces <find_next_bit_new.part.0> and <find_next_zero_bit_new.part.1> helpers and so increases text size from 0x250 to 0x2b9 bytes, but doesn't really inline to optimize push/pop and call/ret. I don't like inline, as I already told, but I believe that complete disabling is bad idea. Maybe someone knows another trick to make inline work? ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] lib: Make _find_next_bit helper function inline @ 2015-08-29 15:15 ` Yury 0 siblings, 0 replies; 18+ messages in thread From: Yury @ 2015-08-29 15:15 UTC (permalink / raw) To: linux-arm-kernel On 24.08.2015 01:53, Alexey Klimov wrote: > Hi Cassidy, > > > On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden <cburden@codeaurora.org> wrote: >> I changed the test module to now set the entire array to all 0/1s and >> only flip a few bits. There appears to be a performance benefit, but >> it's only 2-3% better (if that). If the main benefit of the original >> patch was to save space then inlining definitely doesn't seem worth the >> small gains in real use cases. >> >> find_next_zero_bit (us) >> old new inline >> 14440 17080 17086 >> 4779 5181 5069 >> 10844 12720 12746 >> 9642 11312 11253 >> 3858 3818 3668 >> 10540 12349 12307 >> 12470 14716 14697 >> 5403 6002 5942 >> 2282 1820 1418 >> 13632 16056 15998 >> 11048 13019 13030 >> 6025 6790 6706 >> 13255 15586 15605 >> 3038 2744 2539 >> 10353 12219 12239 >> 10498 12251 12322 >> 14767 17452 17454 >> 12785 15048 15052 >> 1655 1034 691 >> 9924 11611 11558 >> >> find_next_bit (us) >> old new inline >> 8535 9936 9667 >> 14666 17372 16880 >> 2315 1799 1355 >> 6578 9092 8806 >> 6548 7558 7274 >> 9448 11213 10821 >> 3467 3497 3449 >> 2719 3079 2911 >> 6115 7989 7796 >> 13582 16113 15643 >> 4643 4946 4766 >> 3406 3728 3536 >> 7118 9045 8805 >> 3174 3011 2701 >> 13300 16780 16252 >> 14285 16848 16330 >> 11583 13669 13207 >> 13063 15455 14989 >> 12661 14955 14500 >> 12068 14166 13790 >> >> On 7/29/2015 6:30 AM, Alexey Klimov wrote: >>> >>> I will re-check on another machine. It's really interesting if >>> __always_inline makes things better for aarch64 and worse for x86_64. It >>> will be nice if someone will check it on x86_64 too. >> >> >> Very odd, this may be related to the other compiler optimizations Yuri >> mentioned? > > It's better to ask Yury, i hope he can answer some day. > > Do you need to re-check this (with more iterations or on another machine(s))? > Hi, Alexey, Cassidy, (restoring Rasmus, George) I found no difference between original and inline versions for x86_64: (Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz) find_next_bit find_next_zero_bit old new inline old new inline 24 27 28 22 28 28 24 27 28 23 27 28 24 27 28 23 27 28 Inspecting assembler code, I found that GCC wants to see helper separated, even if you provide '__always_inline': inline <find_next_bit_new>: current <find_next_bit_new>: 280: cmp %rdx,%rsi 210: cmp %rdx,%rsi 283: jbe 295 <find_next_bit_new+0x15> 213: jbe 227 <find_next_bit_new+0x17> 285: test %rsi,%rsi 215: test %rsi,%rsi 288: je 295 <find_next_bit_new+0x15> 218: je 227 <find_next_bit_new+0x17> 28a: push %rbp 21a: push %rbp 28b: mov %rsp,%rbp 21b: xor %ecx,%ecx 28e: callq 0 <find_next_bit_new.part.0> 21d: mov %rsp,%rbp 293: pop %rbp 220: callq 0 <_find_next_bit.part.0> 294: retq 225: pop %rbp 295: mov %rsi,%rax 226: retq 298: retq 227: mov %rsi,%rax 299: nopl 0x0(%rax) 22a: retq 22b: nopl 0x0(%rax,%rax,1) So things are looking like x86_64 gcc (at least 4.9.2 build for Ubuntu) ignores '__always_inline' hint as well as 'inline'. But in case of __always_inline compiler does something not really smart: it introduces <find_next_bit_new.part.0> and <find_next_zero_bit_new.part.1> helpers and so increases text size from 0x250 to 0x2b9 bytes, but doesn't really inline to optimize push/pop and call/ret. I don't like inline, as I already told, but I believe that complete disabling is bad idea. Maybe someone knows another trick to make inline work? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH] lib: Make _find_next_bit helper function inline 2015-08-29 15:15 ` Yury @ 2015-08-30 21:47 ` Rasmus Villemoes -1 siblings, 0 replies; 18+ messages in thread From: Rasmus Villemoes @ 2015-08-30 21:47 UTC (permalink / raw) To: Yury Cc: Alexey Klimov, Cassidy Burden, Andrew Morton, linux-arm-msm, Linux Kernel Mailing List, linux-arm-kernel, David S. Miller, Daniel Borkmann, Hannes Frederic Sowa, Lai Jiangshan, Mark Salter, AKASHI Takahiro, Thomas Graf, Valentin Rothberg, Chris Wilson, linux I've lost track of what's up and down in this, but now that I look at this again let me throw in my two observations of stupid gcc behaviour: For the current code, both debian's gcc (4.7) and 5.1 partially inlines _find_next_bit, namely the "if (!nbits || start >= nbits)" test. I know it does it to avoid a function call, but in this case the early return condition is unlikely, so there's not much to gain. Moreover, it fails to optimize the test to simply "if (start >= nbits)" - everything being unsigned, these are obviously equivalent. Rasmus ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH] lib: Make _find_next_bit helper function inline @ 2015-08-30 21:47 ` Rasmus Villemoes 0 siblings, 0 replies; 18+ messages in thread From: Rasmus Villemoes @ 2015-08-30 21:47 UTC (permalink / raw) To: linux-arm-kernel I've lost track of what's up and down in this, but now that I look at this again let me throw in my two observations of stupid gcc behaviour: For the current code, both debian's gcc (4.7) and 5.1 partially inlines _find_next_bit, namely the "if (!nbits || start >= nbits)" test. I know it does it to avoid a function call, but in this case the early return condition is unlikely, so there's not much to gain. Moreover, it fails to optimize the test to simply "if (start >= nbits)" - everything being unsigned, these are obviously equivalent. Rasmus ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2015-08-30 21:47 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-07-28 19:09 [PATCH] lib: Make _find_next_bit helper function inline Cassidy Burden 2015-07-28 19:09 ` Cassidy Burden 2015-07-28 21:23 ` Yury 2015-07-28 21:23 ` Yury 2015-07-28 21:38 ` Yury 2015-07-28 21:38 ` Yury 2015-07-28 21:45 ` Andrew Morton 2015-07-28 21:45 ` Andrew Morton 2015-07-29 13:30 ` Alexey Klimov 2015-07-29 13:30 ` Alexey Klimov 2015-07-29 20:40 ` Cassidy Burden 2015-07-29 20:40 ` Cassidy Burden 2015-08-23 22:53 ` Alexey Klimov 2015-08-23 22:53 ` Alexey Klimov 2015-08-29 15:15 ` Yury 2015-08-29 15:15 ` Yury 2015-08-30 21:47 ` Rasmus Villemoes 2015-08-30 21:47 ` Rasmus Villemoes
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.