From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DFE4743EE for ; Thu, 23 Jun 2022 08:43:31 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8F9B7C385A2 for ; Thu, 23 Jun 2022 08:43:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1655973811; bh=1SmrLtg/dZvPnbdyKt5xE9DmhbJI/omhXuaynaptDnk=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=nxmvN7PMpuaICpQeaQSu1+4fpwFrtYXTn2ippOCnFBtYYqIGJ4wC20I8R5LzNKI4n vbIU3Gx9ZEhUP8otxP6wFhl3YOGcvRgmmNcjyQ6rC5A15yzOUTwwPlUy33tA25GrG7 MnMRDoc5XaZK1xmTXXL/huHzpzBtlO7juml2tdMqciPm/IBDFBTAL82NOepXWtdSMY Mt3mG8o8C7nW5JP0Qm9ZwaYy76Z4xh0BCoR2WmlBVmoaaoth25JCxrufGWhNk7wnXh wdYcwe+sLgdMXIcpjIKPl1DjIxttgvmT5O/XIi7bXQVuTNy5p5H+MCfQlRYDpiKVML VyCRBBfvOBc1g== Received: by mail-vk1-f170.google.com with SMTP id 15so5889323vko.13 for ; Thu, 23 Jun 2022 01:43:31 -0700 (PDT) X-Gm-Message-State: AJIora9dVDu9MM4Dk4z7rUikGAT/cnh15Mt7/C5ssQZdchEQSSg/eABz ZK2V0zBiyzwmCd6MdSLymEMVYkIgT3k4hi7YbV0= X-Google-Smtp-Source: AGRyM1sndJc0K6nOroTmaV5Dw08JPV03PM+rRjnWDZu1Y8v0SoWaH2qGiWZHghLYNWdR5ou3BvJA4lKMwhTuxqbhJEQ= X-Received: by 2002:a1f:1b4b:0:b0:36c:bc20:3982 with SMTP id b72-20020a1f1b4b000000b0036cbc203982mr289898vkb.8.1655973810453; Thu, 23 Jun 2022 01:43:30 -0700 (PDT) Precedence: bulk X-Mailing-List: loongarch@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20220623044752.2074066-1-chenhuacai@loongson.cn> In-Reply-To: From: Guo Ren Date: Thu, 23 Jun 2022 16:43:19 +0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH V2 1/2] LoongArch: Add subword xchg/cmpxchg emulation To: Huacai Chen Cc: Huacai Chen , Arnd Bergmann , loongarch@lists.linux.dev, linux-arch , Xuefeng Li , Xuerui Wang , Jiaxun Yang , Peter Zijlstra , Will Deacon , Ingo Molnar , Rui Wang Content-Type: text/plain; charset="UTF-8" On Thu, Jun 23, 2022 at 4:04 PM Huacai Chen wrote: > > Hi, Ren, > > On Thu, Jun 23, 2022 at 2:49 PM Guo Ren wrote: > > > > On Thu, Jun 23, 2022 at 12:46 PM Huacai Chen wrote: > > > > > > LoongArch only support 32-bit/64-bit xchg/cmpxchg in native. But percpu > > > operation and qspinlock need 8-bit/16-bit xchg/cmpxchg. On NUMA system, > > > the performance of subword xchg/cmpxchg emulation is better than the > > > generic implementation (especially for qspinlock, data will be shown in > > > the next patch). This patch (of 2) adds subword xchg/cmpxchg emulation > > > with ll/sc and enable its usage for percpu operations. > > The xchg/cmpxchg are designed for multi-processor data-sharing issues. > > The percpu data won't share with other harts and disabling > > irq/preemption is enough. Do you have the same issue with f97fc810798c > > ("arm64: percpu: Implement this_cpu operations")? > Yes, very similar, our csr operations are even slower than atomic operations. :( > > > > > > > > > > > Signed-off-by: Rui Wang > > > Signed-off-by: Huacai Chen > > > --- > > > arch/loongarch/include/asm/cmpxchg.h | 98 +++++++++++++++++++++++++++- > > > arch/loongarch/include/asm/percpu.h | 8 +++ > > > 2 files changed, 105 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/loongarch/include/asm/cmpxchg.h b/arch/loongarch/include/asm/cmpxchg.h > > > index 75b3a4478652..967e64bf2c37 100644 > > > --- a/arch/loongarch/include/asm/cmpxchg.h > > > +++ b/arch/loongarch/include/asm/cmpxchg.h > > > @@ -5,8 +5,9 @@ > > > #ifndef __ASM_CMPXCHG_H > > > #define __ASM_CMPXCHG_H > > > > > > -#include > > > +#include > > > #include > > > +#include > > > > > > #define __xchg_asm(amswap_db, m, val) \ > > > ({ \ > > > @@ -21,10 +22,53 @@ > > > __ret; \ > > > }) > > > > > > +static inline unsigned int __xchg_small(volatile void *ptr, unsigned int val, > > > + unsigned int size) > > > +{ > > > + unsigned int shift; > > > + u32 old32, mask, temp; > > > + volatile u32 *ptr32; > > > + > > > + /* Mask value to the correct size. */ > > > + mask = GENMASK((size * BITS_PER_BYTE) - 1, 0); > > > + val &= mask; > > > + > > > + /* > > > + * Calculate a shift & mask that correspond to the value we wish to > > > + * exchange within the naturally aligned 4 byte integerthat includes > > > + * it. > > > + */ > > > + shift = (unsigned long)ptr & 0x3; > > > + shift *= BITS_PER_BYTE; > > > + mask <<= shift; > > > + > > > + /* > > > + * Calculate a pointer to the naturally aligned 4 byte integer that > > > + * includes our byte of interest, and load its value. > > > + */ > > > + ptr32 = (volatile u32 *)((unsigned long)ptr & ~0x3); > > > + > > > + asm volatile ( > > > + "1: ll.w %0, %3 \n" > > > + " andn %1, %0, %z4 \n" > > > + " or %1, %1, %z5 \n" > > > + " sc.w %1, %2 \n" > > > + " beqz %1, 1b \n" > > > + : "=&r" (old32), "=&r" (temp), "=" GCC_OFF_SMALL_ASM() (*ptr32) > > > + : GCC_OFF_SMALL_ASM() (*ptr32), "Jr" (mask), "Jr" (val << shift) > > > + : "memory"); > > > + > > > + return (old32 & mask) >> shift; > > > +} > > > + > > > static inline unsigned long __xchg(volatile void *ptr, unsigned long x, > > > int size) > > > { > > > switch (size) { > > > + case 1: > > > + case 2: > > > + return __xchg_small(ptr, x, size); > > > + > > > case 4: > > > return __xchg_asm("amswap_db.w", (volatile u32 *)ptr, (u32)x); > > > > > > @@ -67,10 +111,62 @@ static inline unsigned long __xchg(volatile void *ptr, unsigned long x, > > > __ret; \ > > > }) > > > > > > +static inline unsigned int __cmpxchg_small(volatile void *ptr, unsigned int old, > > > + unsigned int new, unsigned int size) > > > +{ > > > + unsigned int shift; > > > + u32 old32, mask, temp; > > > + volatile u32 *ptr32; > > > + > > > + /* Mask inputs to the correct size. */ > > > + mask = GENMASK((size * BITS_PER_BYTE) - 1, 0); > > > + old &= mask; > > > + new &= mask; > > > + > > > + /* > > > + * Calculate a shift & mask that correspond to the value we wish to > > > + * compare & exchange within the naturally aligned 4 byte integer > > > + * that includes it. > > > + */ > > > + shift = (unsigned long)ptr & 0x3; > > > + shift *= BITS_PER_BYTE; > > > + old <<= shift; > > > + new <<= shift; > > > + mask <<= shift; > > > + > > > + /* > > > + * Calculate a pointer to the naturally aligned 4 byte integer that > > > + * includes our byte of interest, and load its value. > > > + */ > > > + ptr32 = (volatile u32 *)((unsigned long)ptr & ~0x3); > > > + > > > + asm volatile ( > > > + "1: ll.w %0, %3 \n" > > > + " and %1, %0, %z4 \n" > > > + " bne %1, %z5, 2f \n" > > > + " andn %1, %0, %z4 \n" > > > + " or %1, %1, %z6 \n" > > > + " sc.w %1, %2 \n" > > > + " beqz %1, 1b \n" > > > + " b 3f \n" > > > + "2: \n" > > > + __WEAK_LLSC_MB > > > + "3: \n" > > > + : "=&r" (old32), "=&r" (temp), "=" GCC_OFF_SMALL_ASM() (*ptr32) > > > + : GCC_OFF_SMALL_ASM() (*ptr32), "Jr" (mask), "Jr" (old), "Jr" (new) > > > + : "memory"); > > > + > > > + return (old32 & mask) >> shift; > > > +} > > > + > > > static inline unsigned long __cmpxchg(volatile void *ptr, unsigned long old, > > > unsigned long new, unsigned int size) > > > { > > > switch (size) { > > > + case 1: > > > + case 2: > > > + return __cmpxchg_small(ptr, old, new, size); > > > + > > > case 4: > > > return __cmpxchg_asm("ll.w", "sc.w", (volatile u32 *)ptr, > > > (u32)old, new); > > > diff --git a/arch/loongarch/include/asm/percpu.h b/arch/loongarch/include/asm/percpu.h > > > index e6569f18c6dd..0bd6b0110198 100644 > > > --- a/arch/loongarch/include/asm/percpu.h > > > +++ b/arch/loongarch/include/asm/percpu.h > > > @@ -123,6 +123,10 @@ static inline unsigned long __percpu_xchg(void *ptr, unsigned long val, > > > int size) > > > { > > > switch (size) { > > > + case 1: > > > + case 2: > > > + return __xchg_small((volatile void *)ptr, val, size); > > > + > > > case 4: > > > return __xchg_asm("amswap.w", (volatile u32 *)ptr, (u32)val); > > The percpu operations are local and we shouldn't combine them with the > > normal xchg/cmpxchg (They have different semantics one for local, one > > for share.), please implement your own percpu ops here to fix the irq > > disable/enable performance issue. > Yes, percpu operations are local and atomic operations are for > sharing. But we are using atomic ops to implement percpu ops (for > performance), so just implementing common emulated operations and > using it for both percpu ops and qspinlock is just OK? No, separating them would be fine. The qspinlock only needs xchg16_relaxed and just leave that in loongarch's cmpxchg.h. That would be easier for Arnd to cleanup xchg/cmpxchg. > > Huacai > > > > > > > > @@ -204,9 +208,13 @@ do { \ > > > #define this_cpu_write_4(pcp, val) _percpu_write(pcp, val) > > > #define this_cpu_write_8(pcp, val) _percpu_write(pcp, val) > > > > > > +#define this_cpu_xchg_1(pcp, val) _percpu_xchg(pcp, val) > > > +#define this_cpu_xchg_2(pcp, val) _percpu_xchg(pcp, val) > > > #define this_cpu_xchg_4(pcp, val) _percpu_xchg(pcp, val) > > > #define this_cpu_xchg_8(pcp, val) _percpu_xchg(pcp, val) > > > > > > +#define this_cpu_cmpxchg_1(ptr, o, n) _protect_cmpxchg_local(ptr, o, n) > > > +#define this_cpu_cmpxchg_2(ptr, o, n) _protect_cmpxchg_local(ptr, o, n) > > > #define this_cpu_cmpxchg_4(ptr, o, n) _protect_cmpxchg_local(ptr, o, n) > > > #define this_cpu_cmpxchg_8(ptr, o, n) _protect_cmpxchg_local(ptr, o, n) > > > > > > -- > > > 2.27.0 > > > > > > > > > -- > > Best Regards > > Guo Ren > > > > ML: https://lore.kernel.org/linux-csky/ -- Best Regards Guo Ren ML: https://lore.kernel.org/linux-csky/