Re: [PATCH] LoongArch: Make -mstrict-align be configurable

From: Jianmin Lv <lvjianmin@loongson.cn>
To: WANG Xuerui <kernel@xen0n.name>,
	Huacai Chen <chenhuacai@loongson.cn>,
	Arnd Bergmann <arnd@arndb.de>,
	Huacai Chen <chenhuacai@kernel.org>
Cc: loongarch@lists.linux.dev, linux-arch@vger.kernel.org,
	Xuefeng Li <lixuefeng@loongson.cn>, Guo Ren <guoren@kernel.org>,
	Jiaxun Yang <jiaxun.yang@flygoat.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] LoongArch: Make -mstrict-align be configurable
Date: Mon, 6 Feb 2023 18:24:53 +0800	[thread overview]
Message-ID: <5303aeda-5c66-ede6-b3ac-7d8ebd73ec70@loongson.cn> (raw)
In-Reply-To: <5fc85453-1e2c-1f00-7879-1b5fa318c78a@xen0n.name>

On 2023/2/2 下午6:30, WANG Xuerui wrote:
> On 2023/2/2 16:42, Huacai Chen wrote:
>> Introduce Kconfig option ARCH_STRICT_ALIGN to make -mstrict-align be
>> configurable.
>>
>> Not all LoongArch cores support h/w unaligned access, we can use the
>> -mstrict-align build parameter to prevent unaligned accesses.
>>
>> This option is disabled by default to optimise for performance, but you
>> can enabled it manually if you want to run kernel on systems without h/w
>> unaligned access support.
> 
> It's customary to accompany "performance-related" changes like this with 
> some benchmark numbers and concrete use cases where this would be 
> profitable. Especially given that arch/loongarch developer and user base 
> is relatively small, we probably don't want to allow customization of 
> such a low-level characteristic. In general kernel performance does not 
> vary much with compiler flags like this, so I'd really hope to see some 
> numbers here to convince people that this is *really* providing gains.
> 
> Also, defaulting to emitting unaligned accesses would mean those future, 
> likely embedded models (and AFAIK some existing models that haven't 
> reached GA yet) would lose support with the defconfig. Which means 
> downstream packagers that care about those use cases would have one more 
> non-default, non-generic option to carry within their Kconfig. We 
> probably don't want to repeat the history of other architectures (think 
> arch/arm or arch/mips) where there wasn't really generic builds and 
> board-specific tweaks proliferated.
> 

Hi, Xuerui

I think the kernels produced with and without -mstrict-align have mainly 
following differences:
- Diffirent size. I build two kernls (vmlinux), size of kernel with 
-mstrict-align is 26533376 bytes and size of kernel without 
-mstrict-align is 26123280 bytes.
- Diffirent performance. For example, in kernel function jhash(), the 
assemble code slices with and without -mstrict-align are following:

without -mstrict-align:
900000000032736c <jhash>:
900000000032736c:       15bd5b6d        lu12i.w         $t1, 
-136485(0xdeadb)
9000000000327370:       03bbbdad        ori             $t1, $t1, 0xeef
9000000000327374:       001019ad        add.w           $t1, $t1, $a2
9000000000327378:       001015ae        add.w           $t2, $t1, $a1
900000000032737c:       0280300c        addi.w          $t0, $zero, 
12(0xc)
9000000000327380:       00150091        move            $t5, $a0
9000000000327384:       001501d0        move            $t4, $t2 

9000000000327388:       001501c4        move            $a0, $t2 

900000000032738c:       6c009585        bgeu            $t0, $a1, 
148(0x94)     # 9000000000327420 <jhash+0xb4>
9000000000327390:       02803012        addi.w          $t6, $zero, 
12(0xc)
9000000000327394:       24000a2f        ldptr.w         $t3, $t5, 8(0x8) 

9000000000327398:       2400022d        ldptr.w         $t1, $t5, 0
900000000032739c:       2400062c        ldptr.w         $t0, $t5, 4(0x4)
90000000003273a0:       001011e4        add.w           $a0, $t3, $a0 

90000000003273a4:       001111af        sub.w           $t3, $t1, $a0 

90000000003273a8:       001039ef        add.w           $t3, $t3, $t2 

90000000003273ac:       004cf08e        rotri.w         $t2, $a0, 0x1c
90000000003273b0:       0010418c        add.w           $t0, $t0, $t4
...

with -mstrict-align:
90000000003310c0 <jhash>:
90000000003310c0:       15bd5b6f        lu12i.w         $t3, 
-136485(0xdeadb)
90000000003310c4:       03bbbdef        ori             $t3, $t3, 0xeef
90000000003310c8:       001019ef        add.w           $t3, $t3, $a2
90000000003310cc:       001015e6        add.w           $a2, $t3, $a1
90000000003310d0:       0280300d        addi.w          $t1, $zero, 12(0xc)
90000000003310d4:       0015008c        move            $t0, $a0
90000000003310d8:       001500d2        move            $t6, $a2
90000000003310dc:       001500c4        move            $a0, $a2
90000000003310e0:       6c0101a5        bgeu            $t1, $a1, 
256(0x100)    # 90000000003311e0 <jhash+0x120>
90000000003310e4:       02803011        addi.w          $t5, $zero, 12(0xc)
90000000003310e8:       2a002589        ld.bu           $a5, $t0, 9(0x9)
90000000003310ec:       2a00218d        ld.bu           $t1, $t0, 8(0x8)
90000000003310f0:       2a002988        ld.bu           $a4, $t0, 10(0xa)
90000000003310f4:       2a000587        ld.bu           $a3, $t0, 1(0x1)
90000000003310f8:       2a002d8e        ld.bu           $t2, $t0, 11(0xb)
90000000003310fc:       2a00018b        ld.bu           $a7, $t0, 0
9000000000331100:       2a000994        ld.bu           $t8, $t0, 2(0x2)
9000000000331104:       2a001593        ld.bu           $t7, $t0, 5(0x5)
9000000000331108:       2a000d8f        ld.bu           $t3, $t0, 3(0x3)
900000000033110c:       00412129        slli.d          $a5, $a5, 0x8
9000000000331110:       2a00118a        ld.bu           $a6, $t0, 4(0x4)
9000000000331114:       2a001990        ld.bu           $t4, $t0, 6(0x6)
9000000000331118:       00153529        or              $a5, $a5, $t1
...

It seems that it's difficult for me to test the performance difference 
in a real kernel path with unaligned-access code. So, I use a kernel 
module (use simple test code) to show some difference on 3A5000 as 
following:

c code:

         preempt_disable();
         start = ktime_get_ns();
         for (i = 0; i < n; i++)
                 assign(p1[i], q1[i]);
         end = ktime_get_ns();
         preempt_enable();

         printk("mstrict-align-test took: %lld nsec\n", end - start);

assemble code without -mstrict-align:
0:   260000ac        ldptr.d         $t0, $a1, 0
4:   2700008c        stptr.d         $t0, $a0, 0
8:   4c000020        jirl            $zero, $ra, 0

assemble code with -mstrict-align:
0:   2a0000b3        ld.bu           $t7, $a1, 0
4:   2a0004b2        ld.bu           $t6, $a1, 1(0x1)
8:   2a0008b1        ld.bu           $t5, $a1, 2(0x2)
c:   2a000cb0        ld.bu           $t4, $a1, 3(0x3)
10:   2a0010af        ld.bu           $t3, $a1, 4(0x4)
14:   2a0014ae        ld.bu           $t2, $a1, 5(0x5)
18:   2a0018ad        ld.bu           $t1, $a1, 6(0x6)
1c:   2a001cac        ld.bu           $t0, $a1, 7(0x7)
20:   29000093        st.b            $t7, $a0, 0
24:   29000492        st.b            $t6, $a0, 1(0x1)
28:   29000891        st.b            $t5, $a0, 2(0x2)
2c:   29000c90        st.b            $t4, $a0, 3(0x3)
30:   2900108f        st.b            $t3, $a0, 4(0x4)
34:   2900148e        st.b            $t2, $a0, 5(0x5)
38:   2900188d        st.b            $t1, $a0, 6(0x6)
3c:   29001c8c        st.b            $t0, $a0, 7(0x7)
40:   4c000020        jirl            $zero, $ra, 0

and test results (run 3 times) following:

the module without -mstrict-align testing:
[root@openEuler loongson]# insmod align-test.ko
[   39.029931] mstrict-align-test took: 29603510 nsec
[root@openEuler loongson]# rmmod align-test.ko
[root@openEuler loongson]# insmod align-test.ko
[   41.356007] mstrict-align-test took: 28816710 nsec
[root@openEuler loongson]# rmmod align-test.ko
[root@openEuler loongson]# insmod align-test.ko
[   43.506624] mstrict-align-test took: 30030700 nsec
[root@openEuler loongson]# rmmod align-test.ko

the module with -mstrict-align testing:
root@openEuler ~]# insmod align-test.ko
[   92.656477] mstrict-align-test took: 59629000 nsec
[root@openEuler ~]# rmmod align-test.ko
[root@openEuler ~]# insmod align-test.ko
[   99.473011] mstrict-align-test took: 58972250 nsec
[root@openEuler ~]# rmmod align-test.ko
[root@openEuler ~]# insmod align-test.ko
[  104.620103] mstrict-align-test took: 59419260 nsec
[root@openEuler ~]# rmmod align-test.ko

Thanks!
Jianmin

>>
>> Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
>> ---
>>   arch/loongarch/Kconfig  | 10 ++++++++++
>>   arch/loongarch/Makefile |  2 ++
>>   2 files changed, 12 insertions(+)
>>
>> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
>> index 9cc8b84f7eb0..7470dcfb32f0 100644
>> --- a/arch/loongarch/Kconfig
>> +++ b/arch/loongarch/Kconfig
>> @@ -441,6 +441,16 @@ config ARCH_IOREMAP
>>         protection support. However, you can enable LoongArch DMW-based
>>         ioremap() for better performance.
>> +config ARCH_STRICT_ALIGN
>> +    bool "Enable -mstrict-align to prevent unaligned accesses"
>> +    help
>> +      Not all LoongArch cores support h/w unaligned access, we can use
>> +      -mstrict-align build parameter to prevent unaligned accesses.
>> +
>> +      This is disabled by default to optimise for performance, you can
>> +      enabled it manually if you want to run kernel on systems without
>> +      h/w unaligned access support.
>> +
>>   config KEXEC
>>       bool "Kexec system call"
>>       select KEXEC_CORE
>> diff --git a/arch/loongarch/Makefile b/arch/loongarch/Makefile
>> index 4402387d2755..ccfb52700237 100644
>> --- a/arch/loongarch/Makefile
>> +++ b/arch/loongarch/Makefile
>> @@ -91,10 +91,12 @@ KBUILD_CPPFLAGS += -DVMLINUX_LOAD_ADDRESS=$(load-y)
>>   # instead of .eh_frame so we don't discard them.
>>   KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
>> +ifdef CONFIG_ARCH_STRICT_ALIGN
>>   # Don't emit unaligned accesses.
>>   # Not all LoongArch cores support unaligned access, and as kernel we 
>> can't
>>   # rely on others to provide emulation for these accesses.
>>   KBUILD_CFLAGS += $(call cc-option,-mstrict-align)
>> +endif >
>>   KBUILD_CFLAGS += -isystem $(shell $(CC) -print-file-name=include)
>