From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751529AbdGZOQp (ORCPT <rfc822;w@1wt.eu>);
        Wed, 26 Jul 2017 10:16:45 -0400
Received: from Galois.linutronix.de ([146.0.238.70]:49135 "EHLO
        Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750893AbdGZOQo (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 26 Jul 2017 10:16:44 -0400
Date: Wed, 26 Jul 2017 16:16:37 +0200 (CEST)
From: Thomas Gleixner <tglx@linutronix.de>
To: qiaozhou <qiaozhou@asrmicro.com>
cc: John Stultz <john.stultz@linaro.org>, sboyd@codeaurora.org,
        LKML <linux-kernel@vger.kernel.org>,
        Wang Wilbur <wilburwang@asrmicro.com>,
        Marc Zyngier <marc.zyngier@arm.com>, Will Deacon <will.deacon@arm.com>,
        Peter Zijlstra <peterz@infradead.org>
Subject: Re: [Question]: try to fix contention between expire_timers and
 try_to_del_timer_sync
In-Reply-To: <3d2459c7-defd-a47e-6cea-007c10cecaac@asrmicro.com>
Message-ID: <alpine.DEB.2.20.1707261548560.2186@nanos>
References: <3d2459c7-defd-a47e-6cea-007c10cecaac@asrmicro.com>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, 26 Jul 2017, qiaozhou wrote:

Cc'ed ARM folks.

> I want to ask you for suggestions about how to fix one contention between
> expire_timers and try_to_del_timer_sync. Thanks in advance.
> The issue is a hard-lockup issue detected on our platform(arm64, one cluster
> with 4 a53, and the other with 4 a73). The sequence is as below:
> 1. core0 checks expired timers, and wakes up a process on a73, in step 1.3.
> 2. core4 starts to run the process and try to delete the timer in sync method,
> in step 2.1.
> 3. Before core0 can run to 1.4 and get the lock, core4 has already run from
> 2.1 to 2.6 in while loop. And in step 2.4, it fails since the timer is still
> the running timer.
> 
> core0(a53): core4(a73):
> run_timer_softirq
> __run_timers
>     spin_lock_irq(&base->lock)
>         expire_timers()
>             1.1: base->running_timer = timer;
>             1.2: spin_unlock_irq(&base->lock);
>             1.3: call_timer_fn(timer, fn, data);
> schedule()
>             1.4: spin_lock_irq(&base->lock);     del_timer_sync(&timer)
>             1.5: back to 1.1 in while loop
> 2.1: try_to_del_timer_sync()
>         2.2: get_timer_base()
>         2.3: spin_lock_irqsave(&base->lock, flags);
>           2.4: check "base->running_timer != timer"
>         2.5: spin_unlock_irqrestore(&base->lock, flags);
>    2.6:cpu_relax(); (back to 2.1 in while loop)

This is horribly formatted.

> Core0 runs @832MHz, and core4 runs @1.8GHz. A73 is also much more powerful in
> design than a53. So the actual running is that a73 keeps looping in spin_lock
> -> check running_timer -> spin_unlock -> spin_lock ->...., while a53 can't
> even succeed to complete one store exclusive instruction, before it can enter
> WFE to wait for lock release. It just keeps looping in +30 and+3c in below
> asm, due to stxr fails.
> 
> So it deadloop, a53 can't jump out ldaxr/stxr loop, while a73 can't pass the
> running_timer check. Finally the hard-lockup is triggered.(BTW, the issue
> needs a long time to occur.)
> 
> <_raw_spin_lock_irq+0x2c>:   prfm    pstl1strm, [x19]
> /<_raw_spin_lock_irq+0x30>:   ldaxr   w0, [x19]//
> //<_raw_spin_lock_irq+0x34>:   add     w1, w0, #0x10, lsl #12//
> //<_raw_spin_lock_irq+0x38>:   stxr    w2, w1, [x19]//
> //<_raw_spin_lock_irq+0x3c>:   cbnz    w2, 0xffffff80089f52e0
> <_raw_spin_lock_irq+0x30>/
> <_raw_spin_lock_irq+0x40>:   eor     w1, w0, w0, ror #16
> <_raw_spin_lock_irq+0x44>:   cbz     w1, 0xffffff80089f530c
> <_raw_spin_lock_irq+0x5c>
> <_raw_spin_lock_irq+0x48>:   sevl
> <_raw_spin_lock_irq+0x4c>:   wfe
> <_raw_spin_lock_irq+0x50>:   ldaxrh  w2, [x19]
> <_raw_spin_lock_irq+0x54>:   eor     w1, w2, w0, lsr #16
> <_raw_spin_lock_irq+0x58>:   cbnz    w1, 0xffffff80089f52fc
> <_raw_spin_lock_irq+0x4c>
> 
> The loop on a53 only has 4 instructions, and loop on a73 has ~100
> instructions. Still a53 has no chance to store exclusive successfully. It may
> be related with ldaxr/stxr cost, core frequency, core number etc.
> 
> I have no idea of fixing it in spinlock/unlock implement, so I try to fix it
> in timer driver. I prepared a raw patch, not sure it's the correct direction
> to solve this issue. Could you help to give some suggestions? Thanks.
> 
> From fb8fbfeb8f9f92fdadd9920ce234fd433bc883e1 Mon Sep 17 00:00:00 2001
> From: Qiao Zhou <qiaozhou@asrmicro.com>
> Date: Wed, 26 Jul 2017 20:30:33 +0800
> Subject: [PATCH] RFC: timers: try to fix contention between expire_timers and
>  del_timer_sync
> 
> try to fix contention between expire_timers and del_timer_sync by
> adding TIMER_WOKEN status, which means that the timer has already
> woken up corresponding process.

Timers do a lot more things than waking up a process.

Just because this happens in a particular case for you where a wakeup is
involved this does not mean, that this is generally true.

And that whole flag business is completely broken. There is a comment in
call_timer_fn() which says:

        /*
         * It is permissible to free the timer from inside the
         * function that is called from it ....

So you _CANNOT_ touch timer after the function returned. It might be gone
already.

For that particular timer case we can clear base->running_timer w/o the
lock held (see patch below), but this kind of

     lock -> test -> unlock -> retry

loops are all over the place in the kernel, so this is going to hurt you
sooner than later in some other place.

Thanks,

	tglx

8<------------

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1301,10 +1301,12 @@ static void expire_timers(struct timer_b
 		if (timer->flags & TIMER_IRQSAFE) {
 			raw_spin_unlock(&base->lock);
 			call_timer_fn(timer, fn, data);
+			base->running_timer = NULL;
 			raw_spin_lock(&base->lock);
 		} else {
 			raw_spin_unlock_irq(&base->lock);
 			call_timer_fn(timer, fn, data);
+			base->running_timer = NULL;
 			raw_spin_lock_irq(&base->lock);
 		}
 	}