Re: [PATCH v3] parisc: Fix spinlock barriers

From: Helge Deller <deller@gmx.de>
To: John David Anglin <dave.anglin@bell.net>,
	linux-parisc <linux-parisc@vger.kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Subject: Re: [PATCH v3] parisc: Fix spinlock barriers
Date: Sun, 19 Jul 2020 18:35:19 +0200	[thread overview]
Message-ID: <c100c26e-831e-6d65-ec2c-ba42b881f5c7@gmx.de> (raw)
In-Reply-To: <04485b95-01df-d100-cf3a-1944a69ded26@bell.net>

Hi Dave,

On 19.07.20 16:34, John David Anglin wrote:
> Stalls are quite frequent with recent kernels.  When the stall is detected by rcu_sched, we
> get a backtrace similar to the following:
>
> rcu: INFO: rcu_sched self-detected stall on CPU
> rcu:    0-...!: (5998 ticks this GP) idle=3a6/1/0x4000000000000002 softirq=8356938/8356939 fqs=2
>         (t=6000 jiffies g=8985785 q=391)
> rcu: rcu_sched kthread starved for 5992 jiffies! g8985785 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
> rcu: RCU grace-period kthread stack dump:
> rcu_sched       R  running task        0    10      2 0x00000000
> Backtrace:
>
> Task dump for CPU 0:
> collect2        R  running task        0 16562  16561 0x00000014
> Backtrace:
>  [<000000004017913c>] show_stack+0x44/0x60
>  [<00000000401df694>] sched_show_task.part.77+0xf4/0x180
>  [<00000000401e70e8>] dump_cpu_task+0x68/0x80
>  [<0000000040230a58>] rcu_sched_clock_irq+0x708/0xae0
>  [<0000000040237670>] update_process_times+0x58/0xb8
>  [<00000000407dc39c>] timer_interrupt+0xa4/0x110
>  [<000000004021af30>] __handle_irq_event_percpu+0xb8/0x228
>  [<000000004021b0d4>] handle_irq_event_percpu+0x34/0x98
>  [<00000000402225b8>] handle_percpu_irq+0xa8/0xe8
>  [<000000004021a05c>] generic_handle_irq+0x54/0x70
>  [<0000000040180340>] call_on_stack+0x18/0x24
>  [<000000004017a63c>] execute_on_irq_stack+0x5c/0xa8
>  [<000000004017b76c>] do_cpu_irq_mask+0x2dc/0x410
>  [<000000004017f074>] intr_return+0x0/0xc
>
> However, this doesn't provide any information as to the cause.  I enabled CONFIG_SOFTLOCKUP_DETECTOR
> and I caught the following stall:
>
> watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [cc1:22803]
> Modules linked in: dm_mod dax binfmt_misc ext4 crc16 jbd2 ext2 mbcache sg ipmi_watchdog ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler nfsd
> ip_tables x_tables ipv6 autofs4 xfs libcrc32c crc32c_generic raid10 raid1 raid0 multipath linear md_mod ses enclosure sd_mod scsi_transport_sas
> t10_pi sr_mod cdrom ata_generic uas usb_storage pata_cmd64x libata ohci_pci ehci_pci ohci_hcd sym53c8xx ehci_hcd scsi_transport_spi tg3 usbcore
> scsi_mod usb_common
> CPU: 0 PID: 22803 Comm: cc1 Not tainted 5.6.17+ #3
> Hardware name: 9000/800/rp3440
>
>      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> PSW: 00001000000001001111111100001111 Not tainted
> r00-03  000000ff0804ff0f 0000000040891dc0 000000004037d1c4 000000006d5e8890
> r04-07  000000004086fdc0 0000000040ab31ac 000000000004e99a 0000000000001f20
> r08-11  0000000040b24710 000000006d5e8488 0000000040a1d280 000000006d5e89b0
> r12-15  000000006d5e88c4 00000001802c2cb8 000000003c812825 0000004122eb4d18
> r16-19  0000000040b26630 000000006d5e8898 000000000001d330 000000006d5e88c0
> r20-23  000000000800000f 0000000a0ad24270 b6683633143fce3c 0000004122eb4d54
> r24-27  000000006d5e88c4 000000006d5e8488 00000001802c2cb8 000000004086fdc0
> r28-31  0000004122d57b69 000000006d5e89b0 000000006d5e89e0 000000006d5e8000
> sr00-03  000000000c749000 0000000000000000 0000000000000000 000000000c749000
> sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
>
> IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004037d414 000000004037d418
>  IIR: 0e0010dc    ISR: 00000041042d63f0  IOR: 000000004086fdc0
>  CPU:        0   CR30: 000000006d5e8000 CR31: ffffffffffffefff
>  ORIG_R28: 000000004086fdc0
>  IAOQ[0]: d_alloc_parallel+0x384/0x688
>  IAOQ[1]: d_alloc_parallel+0x388/0x688
>  RP(r2): d_alloc_parallel+0x134/0x688
> Backtrace:
>  [<000000004036974c>] __lookup_slow+0xa4/0x200
>  [<0000000040369fc8>] walk_component+0x288/0x458
>  [<000000004036a9a0>] path_lookupat+0x88/0x198
>  [<000000004036e748>] filename_lookup+0xa0/0x168
>  [<000000004036e95c>] user_path_at_empty+0x64/0x80
>  [<000000004035d93c>] vfs_statx+0x104/0x158
>  [<000000004035dfcc>] __do_sys_lstat64+0x44/0x80
>  [<000000004035e5a0>] sys_lstat64+0x20/0x38
>  [<0000000040180054>] syscall_exit+0x0/0x14
>
> The code was stuck in this loop in d_alloc_parallel:
>
>     4037d414:   0e 00 10 dc     ldd 0(r16),ret0
>     4037d418:   c7 fc 5f ed     bb,< ret0,1f,4037d414 <d_alloc_parallel+0x384>
>     4037d41c:   08 00 02 40     nop
>
> This is the inner loop of bit_spin_lock which is called by hlist_bl_unlock in d_alloc_parallel:
>
> static inline void bit_spin_lock(int bitnum, unsigned long *addr)
> {
>         /*
>          * Assuming the lock is uncontended, this never enters
>          * the body of the outer loop. If it is contended, then
>          * within the inner loop a non-atomic test is used to
>          * busywait with less bus contention for a good time to
>          * attempt to acquire the lock bit.
>          */
>         preempt_disable();
> #if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
>         while (unlikely(test_and_set_bit_lock(bitnum, addr))) {
>                 preempt_enable();
>                 do {
>                         cpu_relax();
>                 } while (test_bit(bitnum, addr));
>                 preempt_disable();
>         }
> #endif
>         __acquire(bitlock);
> }
>
> test_and_set_bit_lock() looks like this:
>
> static inline int test_and_set_bit_lock(unsigned int nr,
>                                         volatile unsigned long *p)
> {
>         long old;
>         unsigned long mask = BIT_MASK(nr);
>
>         p += BIT_WORD(nr);
>         if (READ_ONCE(*p) & mask)
>                 return 1;
>
>         old = atomic_long_fetch_or_acquire(mask, (atomic_long_t *)p);
>         return !!(old & mask);
> }
>
> After some more considerations, I realized that the previous patches were just affecting the timing
> of the spin lock routines and not fixing the real problem.  The actual problem is not with the spin lock
> barriers but with unlock code.

Ok.

> We enter the inner loop of bit_spin_lock() when the bit lock is held.  We stall when the lock is never
> released.  This lead me to look at the locking in arch/parisc/include/asm/atomic.h.  It turns out we are
> missing a define for atomic64_set_release().  The equivalent is present for 32-bit releases.
>
> The release for 64-bit bit operations needs to use atomic64_set() to prevent the loss of release
> operations when there is contention.

Good!!!

> In reviewing the atomic operations in entry.S, I realized that there is also a bug in the
> spin lock release code of the TLB handler.  Space id's are 64 bits on 64-bit targets.  So,
> using the least significant 32 bits to reset the spin lock is not safe.  The lock will not
> be freed if the bits are all zero.

Hmm..
The space ids on 64-bit Linux are limited to (see arch/parisc/mm/init.c):
#define NR_SPACE_IDS 262144
and SID == 0 can't happen for userspace (it's blocked in the space_id[] bitmap).
So, I think this part was ok.

> @@ -467,10 +466,9 @@
>  	/* Release pa_tlb_lock lock without reloading lock address. */
>  	.macro		tlb_unlock0	spc,tmp,tmp1
>  #ifdef CONFIG_SMP
> +	ldi		1,\tmp1
>  98:	or,COND(=)	%r0,\spc,%r0
> -	LDCW		0(\tmp),\tmp1
> -	or,COND(=)	%r0,\spc,%r0
> -	stw		\spc,0(\tmp)
> +	stw		\tmp1,0(\tmp)
>  99:	ALTERNATIVE(98b, 99b, ALT_COND_NO_SMP, INSN_NOP)

In tlb_lock() we only lock for non-kernel SIDs (!=0),
but now you unlock unconditionally.

> This patch modifies the code to unlock the lock in the TLB handler with the value one.  Using
> one is consistent with the release value used in the spin lock code.  It also would allow one to
> dirty the lock cache line with a stb to the most significant byte.  This optimization may speed
> up the ldcw instruction on some machines.
>
> I removed the LDCW barriers from this code as I don't believe they are necessary.  When the page
> is not present, nothing has been done other than to test the page present bit.  The release is done
> after the TLB entry is updated.  I believe it is strongly ordered and forces prior writes to complete.

Ok.

> This fixes the stall in building libpreludedb.

I wonder if the stall is still fixed if you omit your patch to  pa_tlb_lock().

Thanks for your work on this!!!

Helge

>
> Signed-off-by: Dave Anglin <dave.anglin@bell.net>
> ---
>
> diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h
> index 118953d41763..6dd4171c9530 100644
> --- a/arch/parisc/include/asm/atomic.h
> +++ b/arch/parisc/include/asm/atomic.h
> @@ -212,6 +212,8 @@ atomic64_set(atomic64_t *v, s64 i)
>  	_atomic_spin_unlock_irqrestore(v, flags);
>  }
>
> +#define atomic64_set_release(v, i)	atomic64_set((v), (i))
> +
>  static __inline__ s64
>  atomic64_read(const atomic64_t *v)
>  {
> diff --git a/arch/parisc/kernel/entry.S b/arch/parisc/kernel/entry.S
> index 4b484ec7c7da..28a17e3c5383 100644
> --- a/arch/parisc/kernel/entry.S
> +++ b/arch/parisc/kernel/entry.S
> @@ -454,7 +454,6 @@
>  	nop
>  	LDREG		0(\ptp),\pte
>  	bb,<,n		\pte,_PAGE_PRESENT_BIT,3f
> -	LDCW		0(\tmp),\tmp1
>  	b		\fault
>  	stw		\spc,0(\tmp)
>  99:	ALTERNATIVE(98b, 99b, ALT_COND_NO_SMP, INSN_NOP)
> @@ -467,10 +466,9 @@
>  	/* Release pa_tlb_lock lock without reloading lock address. */
>  	.macro		tlb_unlock0	spc,tmp,tmp1
>  #ifdef CONFIG_SMP
> +	ldi		1,\tmp1
>  98:	or,COND(=)	%r0,\spc,%r0
> -	LDCW		0(\tmp),\tmp1
> -	or,COND(=)	%r0,\spc,%r0
> -	stw		\spc,0(\tmp)
> +	stw		\tmp1,0(\tmp)
>  99:	ALTERNATIVE(98b, 99b, ALT_COND_NO_SMP, INSN_NOP)
>  #endif
>  	.endm
>