From mboxrd@z Thu Jan 1 00:00:00 1970 From: yury.norov@gmail.com (Yury) Date: Sat, 29 Aug 2015 18:15:15 +0300 Subject: [PATCH] lib: Make _find_next_bit helper function inline In-Reply-To: References: <1438110564-19932-1-git-send-email-cburden@codeaurora.org> <55B7F2C6.9010000@gmail.com> <20150728144537.67d46b5714c99d25f0bb33fb@linux-foundation.org> <1438176656.18723.8.camel@ceres> <55B93A47.90107@codeaurora.org> Message-ID: <55E1CC83.1010007@gmail.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On 24.08.2015 01:53, Alexey Klimov wrote: > Hi Cassidy, > > > On Wed, Jul 29, 2015 at 11:40 PM, Cassidy Burden wrote: >> I changed the test module to now set the entire array to all 0/1s and >> only flip a few bits. There appears to be a performance benefit, but >> it's only 2-3% better (if that). If the main benefit of the original >> patch was to save space then inlining definitely doesn't seem worth the >> small gains in real use cases. >> >> find_next_zero_bit (us) >> old new inline >> 14440 17080 17086 >> 4779 5181 5069 >> 10844 12720 12746 >> 9642 11312 11253 >> 3858 3818 3668 >> 10540 12349 12307 >> 12470 14716 14697 >> 5403 6002 5942 >> 2282 1820 1418 >> 13632 16056 15998 >> 11048 13019 13030 >> 6025 6790 6706 >> 13255 15586 15605 >> 3038 2744 2539 >> 10353 12219 12239 >> 10498 12251 12322 >> 14767 17452 17454 >> 12785 15048 15052 >> 1655 1034 691 >> 9924 11611 11558 >> >> find_next_bit (us) >> old new inline >> 8535 9936 9667 >> 14666 17372 16880 >> 2315 1799 1355 >> 6578 9092 8806 >> 6548 7558 7274 >> 9448 11213 10821 >> 3467 3497 3449 >> 2719 3079 2911 >> 6115 7989 7796 >> 13582 16113 15643 >> 4643 4946 4766 >> 3406 3728 3536 >> 7118 9045 8805 >> 3174 3011 2701 >> 13300 16780 16252 >> 14285 16848 16330 >> 11583 13669 13207 >> 13063 15455 14989 >> 12661 14955 14500 >> 12068 14166 13790 >> >> On 7/29/2015 6:30 AM, Alexey Klimov wrote: >>> >>> I will re-check on another machine. It's really interesting if >>> __always_inline makes things better for aarch64 and worse for x86_64. It >>> will be nice if someone will check it on x86_64 too. >> >> >> Very odd, this may be related to the other compiler optimizations Yuri >> mentioned? > > It's better to ask Yury, i hope he can answer some day. > > Do you need to re-check this (with more iterations or on another machine(s))? > Hi, Alexey, Cassidy, (restoring Rasmus, George) I found no difference between original and inline versions for x86_64: (Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz) find_next_bit find_next_zero_bit old new inline old new inline 24 27 28 22 28 28 24 27 28 23 27 28 24 27 28 23 27 28 Inspecting assembler code, I found that GCC wants to see helper separated, even if you provide '__always_inline': inline : current : 280: cmp %rdx,%rsi 210: cmp %rdx,%rsi 283: jbe 295 213: jbe 227 285: test %rsi,%rsi 215: test %rsi,%rsi 288: je 295 218: je 227 28a: push %rbp 21a: push %rbp 28b: mov %rsp,%rbp 21b: xor %ecx,%ecx 28e: callq 0 21d: mov %rsp,%rbp 293: pop %rbp 220: callq 0 <_find_next_bit.part.0> 294: retq 225: pop %rbp 295: mov %rsi,%rax 226: retq 298: retq 227: mov %rsi,%rax 299: nopl 0x0(%rax) 22a: retq 22b: nopl 0x0(%rax,%rax,1) So things are looking like x86_64 gcc (at least 4.9.2 build for Ubuntu) ignores '__always_inline' hint as well as 'inline'. But in case of __always_inline compiler does something not really smart: it introduces and helpers and so increases text size from 0x250 to 0x2b9 bytes, but doesn't really inline to optimize push/pop and call/ret. I don't like inline, as I already told, but I believe that complete disabling is bad idea. Maybe someone knows another trick to make inline work?