* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() @ 2021-04-13 16:55 Maciej Żenczykowski 2021-04-13 17:14 ` Greg KH 0 siblings, 1 reply; 23+ messages in thread From: Maciej Żenczykowski @ 2021-04-13 16:55 UTC (permalink / raw) To: Ingo Molnar, Greg KH, Anna-Maria Behnsen Cc: linux-kernel, mikael.beckius, tglx, Lorenzo Colitti, Maciej Żenczykowski This patch (or at least the version of it that showed up in 5.10.24 LTS when combined with Android Common Kernel and some arm phone platform drivers) causes a roughly 60% regression (from ~5.3-6 gbps down to ~2.2gbps) on running pktgen when egressing via ncm gadget on a SuperSpeed+ 10gbps USB3 connection. The regression is not there in 5.10.23, and is present in 5.10.24 and 5.10.26. Reverting just this one patch is confirmed to restore performance (both on 5.10.24 and 5.10.26). We don't know the cause, as we know nothing about hrtimers... but we lightly suspect that the ncm->task_timer in f_ncm.c is perhaps not firing as often as it should... Unfortunately I have no idea how to replicate this on commonly available hardware (or with a pure stable or 5.11/5.12 Linus tree) since it requires a gadget capable usb3.1 10gbps controller (which I only have access to in combination with a 5.10-based arm+dwc3 soc). (the regression is visible with just usb3.0, but it's smaller due to usb3.0 topping out at just under 4gbps, though it still drops to 2.2gbps -- and this still doesn't help since usb3 gadget capable controllers are nearly as rare) - Maciej & Lorenzo ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-13 16:55 [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() Maciej Żenczykowski @ 2021-04-13 17:14 ` Greg KH 2021-04-14 2:49 ` Lorenzo Colitti 0 siblings, 1 reply; 23+ messages in thread From: Greg KH @ 2021-04-13 17:14 UTC (permalink / raw) To: Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, linux-kernel, mikael.beckius, tglx, Lorenzo Colitti, Maciej Żenczykowski On Tue, Apr 13, 2021 at 09:55:08AM -0700, Maciej Żenczykowski wrote: > This patch (or at least the version of it that showed up in 5.10.24 > LTS when combined with Android Common Kernel and some arm phone > platform drivers) causes a roughly 60% regression (from ~5.3-6 gbps > down to ~2.2gbps) on running pktgen when egressing via ncm gadget on a > SuperSpeed+ 10gbps USB3 connection. > > The regression is not there in 5.10.23, and is present in 5.10.24 and > 5.10.26. Reverting just this one patch is confirmed to restore > performance (both on 5.10.24 and 5.10.26). > > We don't know the cause, as we know nothing about hrtimers... but we > lightly suspect that the ncm->task_timer in f_ncm.c is perhaps not > firing as often as it should... > > Unfortunately I have no idea how to replicate this on commonly > available hardware (or with a pure stable or 5.11/5.12 Linus tree) > since it requires a gadget capable usb3.1 10gbps controller (which I > only have access to in combination with a 5.10-based arm+dwc3 soc). > > (the regression is visible with just usb3.0, but it's smaller due to > usb3.0 topping out at just under 4gbps, though it still drops to > 2.2gbps -- and this still doesn't help since usb3 gadget capable > controllers are nearly as rare) > To give context, the commit is now 46eb1701c046 ("hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event()") and is attached below. The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should not be happening, we can easily fix it but I guess no one has actually had fast USB devices that noticed this until now :) thanks, greg k-h ---------- commit 46eb1701c046cc18c032fa68f3c8ccbf24483ee4 Author: Anna-Maria Behnsen <anna-maria@linutronix.de> Date: Tue Feb 23 17:02:40 2021 +0100 hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() hrtimer_force_reprogram() and hrtimer_interrupt() invokes __hrtimer_get_next_event() to find the earliest expiry time of hrtimer bases. __hrtimer_get_next_event() does not update cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That needs to be done at the callsites. hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when the first expiring timer is a softirq timer and the soft interrupt is not activated. That's wrong because cpu_base::softirq_expires_next is left stale when the first expiring timer of all bases is a timer which expires in hard interrupt context. hrtimer_interrupt() does never update cpu_base::softirq_expires_next which is wrong too. That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that timer before the stale cpu_base::softirq_expires_next. cpu_base::softirq_expires_next is cached to make the check for raising the soft interrupt fast. In the above case the soft interrupt won't be raised until clock monotonic reaches the stale cpu_base::softirq_expires_next value. That's incorrect, but what's worse it that if the softirq timer becomes the first expiring timer of all clock bases after the hard expiry timer has been handled the reprogramming of the clockevent from hrtimer_interrupt() will result in an interrupt storm. That happens because the reprogramming does not use cpu_base::softirq_expires_next, it uses __hrtimer_get_next_event() which returns the actual expiry time. Once clock MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is raised and the storm subsides. Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard bases seperately, update softirq_expires_next and handle the case when a soft expiring timer is the first of all bases by comparing the expiry times and updating the required cpu base fields. Split this functionality into a separate function to be able to use it in hrtimer_interrupt() as well without copy paste. Fixes: 5da70160462e ("hrtimer: Implement support for softirq based hrtimers") Reported-by: Mikael Beckius <mikael.beckius@windriver.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mikael Beckius <mikael.beckius@windriver.com> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210223160240.27518-1-anna-maria@linutronix.de diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 743c852e10f2..788b9d137de4 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -546,8 +546,11 @@ static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base, } /* - * Recomputes cpu_base::*next_timer and returns the earliest expires_next but - * does not set cpu_base::*expires_next, that is done by hrtimer_reprogram. + * Recomputes cpu_base::*next_timer and returns the earliest expires_next + * but does not set cpu_base::*expires_next, that is done by + * hrtimer[_force]_reprogram and hrtimer_interrupt only. When updating + * cpu_base::*expires_next right away, reprogramming logic would no longer + * work. * * When a softirq is pending, we can ignore the HRTIMER_ACTIVE_SOFT bases, * those timers will get run whenever the softirq gets handled, at the end of @@ -588,6 +591,37 @@ __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_ return expires_next; } +static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base) +{ + ktime_t expires_next, soft = KTIME_MAX; + + /* + * If the soft interrupt has already been activated, ignore the + * soft bases. They will be handled in the already raised soft + * interrupt. + */ + if (!cpu_base->softirq_activated) { + soft = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT); + /* + * Update the soft expiry time. clock_settime() might have + * affected it. + */ + cpu_base->softirq_expires_next = soft; + } + + expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_HARD); + /* + * If a softirq timer is expiring first, update cpu_base->next_timer + * and program the hardware with the soft expiry time. + */ + if (expires_next > soft) { + cpu_base->next_timer = cpu_base->softirq_next_timer; + expires_next = soft; + } + + return expires_next; +} + static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base) { ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset; @@ -628,23 +662,7 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal) { ktime_t expires_next; - /* - * Find the current next expiration time. - */ - expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL); - - if (cpu_base->next_timer && cpu_base->next_timer->is_soft) { - /* - * When the softirq is activated, hrtimer has to be - * programmed with the first hard hrtimer because soft - * timer interrupt could occur too late. - */ - if (cpu_base->softirq_activated) - expires_next = __hrtimer_get_next_event(cpu_base, - HRTIMER_ACTIVE_HARD); - else - cpu_base->softirq_expires_next = expires_next; - } + expires_next = hrtimer_update_next_event(cpu_base); if (skip_equal && expires_next == cpu_base->expires_next) return; @@ -1644,8 +1662,8 @@ void hrtimer_interrupt(struct clock_event_device *dev) __hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD); - /* Reevaluate the clock bases for the next expiry */ - expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL); + /* Reevaluate the clock bases for the [soft] next expiry */ + expires_next = hrtimer_update_next_event(cpu_base); /* * Store the new expiry value so the migration code can verify * against it. ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-13 17:14 ` Greg KH @ 2021-04-14 2:49 ` Lorenzo Colitti 2021-04-15 16:47 ` Thomas Gleixner 0 siblings, 1 reply; 23+ messages in thread From: Lorenzo Colitti @ 2021-04-14 2:49 UTC (permalink / raw) To: Greg KH Cc: Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Thomas Gleixner, Maciej Żenczykowski On Wed, Apr 14, 2021 at 2:14 AM Greg KH <gregkh@linuxfoundation.org> wrote: > To give context, the commit is now 46eb1701c046 ("hrtimer: Update > softirq_expires_next correctly after __hrtimer_get_next_event()") and is > attached below. > > The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode > HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should > not be happening, we can easily fix it but I guess no one has actually > had fast USB devices that noticed this until now :) AIUI the purpose of this timer is to support packet aggregation. USB transfers are relatively expensive/high latency. 6 Gbps is 500k 1500-byte packets per second, or one every 2us. So f_ncm buffers as many packets as will fit into 16k (usually, 10 1500-byte packets), and only initiates a USB transfer when those packets have arrived. That ends up doing only one transfer every 20us. It sets a 300us timer to ensure that if the 10 packets haven't arrived, it still sends out whatever it has when the timer fires. The timer is set/updated on every packet buffered by ncm. Is this somehow much more expensive in 5.10.24 than it was before? Even if this driver is somehow "holding it wrong", might there not be other workloads that have a similar problem? What about regressions on those workloads? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-14 2:49 ` Lorenzo Colitti @ 2021-04-15 16:47 ` Thomas Gleixner 2021-04-20 3:12 ` Maciej Żenczykowski 2021-04-20 8:15 ` Lorenzo Colitti 0 siblings, 2 replies; 23+ messages in thread From: Thomas Gleixner @ 2021-04-15 16:47 UTC (permalink / raw) To: Lorenzo Colitti, Greg KH Cc: Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski On Wed, Apr 14 2021 at 11:49, Lorenzo Colitti wrote: > On Wed, Apr 14, 2021 at 2:14 AM Greg KH <gregkh@linuxfoundation.org> wrote: >> To give context, the commit is now 46eb1701c046 ("hrtimer: Update >> softirq_expires_next correctly after __hrtimer_get_next_event()") and is >> attached below. >> >> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode >> HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should >> not be happening, we can easily fix it but I guess no one has actually >> had fast USB devices that noticed this until now :) > > AIUI the purpose of this timer is to support packet aggregation. USB > transfers are relatively expensive/high latency. 6 Gbps is 500k > 1500-byte packets per second, or one every 2us. So f_ncm buffers as > many packets as will fit into 16k (usually, 10 1500-byte packets), and > only initiates a USB transfer when those packets have arrived. That > ends up doing only one transfer every 20us. It sets a 300us timer to > ensure that if the 10 packets haven't arrived, it still sends out > whatever it has when the timer fires. The timer is set/updated on > every packet buffered by ncm. > > Is this somehow much more expensive in 5.10.24 than it was before? > Even if this driver is somehow "holding it wrong", might there not be > other workloads that have a similar problem? What about regressions on > those workloads? Let's put the question of whether this hrtimer usage is sensible or not aside for now. I stared at the change for a while and did some experiments to recreate the problem, but that didn't get me anywhere. Could you please do the following? Enable tracing and enable the following tracepoints: timers/hrtimer_cancel timers/hrtimer_start timers/hrtimer_expire_entry irq/softirq_raise irq/softirq_enter irq/softirq_exit and function tracing filtered on ncm_wrap_ntb() and package_for_tx() only (to reduce the noise). Run the test on a kernels with and without that commit and collect trace data for both. That should give me a pretty clear picture what's going on. Thanks, tglx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-15 16:47 ` Thomas Gleixner @ 2021-04-20 3:12 ` Maciej Żenczykowski 2021-04-20 6:44 ` Thomas Gleixner 2021-04-20 8:15 ` Lorenzo Colitti 1 sibling, 1 reply; 23+ messages in thread From: Maciej Żenczykowski @ 2021-04-20 3:12 UTC (permalink / raw) To: Thomas Gleixner Cc: Lorenzo Colitti, Greg KH, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius On Thu, Apr 15, 2021 at 9:47 AM Thomas Gleixner <tglx@linutronix.de> wrote: > > On Wed, Apr 14 2021 at 11:49, Lorenzo Colitti wrote: > > On Wed, Apr 14, 2021 at 2:14 AM Greg KH <gregkh@linuxfoundation.org> wrote: > >> To give context, the commit is now 46eb1701c046 ("hrtimer: Update > >> softirq_expires_next correctly after __hrtimer_get_next_event()") and is > >> attached below. > >> > >> The f_ncm.c driver is doing a lot of calls to hrtimer_start() with mode > >> HRTIMER_MODE_REL_SOFT for I think every packet it gets. If that should > >> not be happening, we can easily fix it but I guess no one has actually > >> had fast USB devices that noticed this until now :) > > > > AIUI the purpose of this timer is to support packet aggregation. USB > > transfers are relatively expensive/high latency. 6 Gbps is 500k > > 1500-byte packets per second, or one every 2us. So f_ncm buffers as > > many packets as will fit into 16k (usually, 10 1500-byte packets), and > > only initiates a USB transfer when those packets have arrived. That > > ends up doing only one transfer every 20us. It sets a 300us timer to > > ensure that if the 10 packets haven't arrived, it still sends out > > whatever it has when the timer fires. The timer is set/updated on > > every packet buffered by ncm. > > > > Is this somehow much more expensive in 5.10.24 than it was before? > > Even if this driver is somehow "holding it wrong", might there not be > > other workloads that have a similar problem? What about regressions on > > those workloads? > > Let's put the question of whether this hrtimer usage is sensible or not > aside for now. > > I stared at the change for a while and did some experiments to recreate > the problem, but that didn't get me anywhere. > > Could you please do the following? > > Enable tracing and enable the following tracepoints: > > timers/hrtimer_cancel > timers/hrtimer_start > timers/hrtimer_expire_entry > irq/softirq_raise > irq/softirq_enter > irq/softirq_exit > > and function tracing filtered on ncm_wrap_ntb() and > package_for_tx() only (to reduce the noise). > > Run the test on a kernels with and without that commit and collect trace > data for both. > > That should give me a pretty clear picture what's going on. Lorenzo is trying to get the traces you asked for, or rather he’s trying to get confirmation he can post them. Our initial observation of these results seems to suggest that updating the timer (hrtimer_start, which seems to also call hrtimer_cancel) takes twice as long as it used to. My gut feeling is that softirq_activated is usually false, and the old code in such a case calls just __hrtimer_get_next_event(, HRTIMER_ACTIVE_ALL). While the new code will first call __hrtimer_get_next_event(, HRTIMER_ACTIVE_SOFT) and then __hrtimer_get_next_event(, HRTIMER_ACTIVE_HARD) Perhaps __hrtimer_get_next_event() should return both soft and hard event times in one function call? Or perhaps hrtimer_start should not call hrtimer_cancel? We've also been looking at the f_ncm driver itself, and trying to reduce the number of hrtimer_cancel/start calls. It's pretty easy to reduce this by a factor of 10x (we’re not yet 100% certain this is race free), which does drastically improve performance. However, it still only takes the regression from 60% to 20%. - Maciej ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-20 3:12 ` Maciej Żenczykowski @ 2021-04-20 6:44 ` Thomas Gleixner 0 siblings, 0 replies; 23+ messages in thread From: Thomas Gleixner @ 2021-04-20 6:44 UTC (permalink / raw) To: Maciej Żenczykowski Cc: Lorenzo Colitti, Greg KH, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius On Mon, Apr 19 2021 at 20:12, Maciej Żenczykowski wrote: > On Thu, Apr 15, 2021 at 9:47 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> Run the test on a kernels with and without that commit and collect trace >> data for both. >> >> That should give me a pretty clear picture what's going on. > > Lorenzo is trying to get the traces you asked for, or rather he’s > trying to get confirmation he can post them. > > Our initial observation of these results seems to suggest that > updating the timer (hrtimer_start, which seems to also call > hrtimer_cancel) takes twice as long as it used to. Which contradicts my measurements. The change in complexity is marginal and the overhead in cycles/instructions is close to noise. It's measurable in a microbenchmark, but it's in the < 1% range which is far away from the 60% you are seing. > My gut feeling is that softirq_activated is usually false, and the old > code in such a case calls just __hrtimer_get_next_event(, > HRTIMER_ACTIVE_ALL). While the new code will first call > __hrtimer_get_next_event(, HRTIMER_ACTIVE_SOFT) and then > __hrtimer_get_next_event(, HRTIMER_ACTIVE_HARD) > > Perhaps __hrtimer_get_next_event() should return both soft and hard > event times in one function call? > Or perhaps hrtimer_start should not call hrtimer_cancel? Perhaps we do a proper analysis first :) Thanks, tglx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-15 16:47 ` Thomas Gleixner 2021-04-20 3:12 ` Maciej Żenczykowski @ 2021-04-20 8:15 ` Lorenzo Colitti 2021-04-20 14:19 ` Thomas Gleixner 1 sibling, 1 reply; 23+ messages in thread From: Lorenzo Colitti @ 2021-04-20 8:15 UTC (permalink / raw) To: Thomas Gleixner Cc: Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski On Fri, Apr 16, 2021 at 1:47 AM Thomas Gleixner <tglx@linutronix.de> wrote: > Enable tracing and enable the following tracepoints: > [...] Sorry for the delay. I had to learn a bit about how to use the tracing infrastructure. I don't know if I can post here, but to my untrained eye, one big difference between the old (fast) code and the new (slow) code is that the new code calls tick_program_event() much more. It looks like that makes most of the difference. With the old code, hrtimer_start_range_ns almost never calls tick_program_event at all, but the new code seems to call it twice on every timer update. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-20 8:15 ` Lorenzo Colitti @ 2021-04-20 14:19 ` Thomas Gleixner 2021-04-21 14:08 ` Lorenzo Colitti 2021-04-26 8:49 ` [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() Thomas Gleixner 0 siblings, 2 replies; 23+ messages in thread From: Thomas Gleixner @ 2021-04-20 14:19 UTC (permalink / raw) To: Lorenzo Colitti Cc: Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Tue, Apr 20 2021 at 17:15, Lorenzo Colitti wrote: > On Fri, Apr 16, 2021 at 1:47 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> Enable tracing and enable the following tracepoints: >> [...] > > Sorry for the delay. I had to learn a bit about how to use the tracing > infrastructure. I don't know if I can post here, but to my untrained > eye, one big difference between the old (fast) code and the new (slow) > code is that the new code calls tick_program_event() much more. It > looks like that makes most of the difference. > > With the old code, hrtimer_start_range_ns almost never calls > tick_program_event at all, but the new code seems to call it twice on > every timer update. Yes, I figured out why that happens by now, but the old behaviour was just incorrect. So now there are clearly two issues: 1) hrtimer is contrary to timer_list not really suited for high frequency start/cancel/start/... cycles of a timer. It's optimized for the start and expire precisely case. 2) The cost for reprogramming depends on the underlying hardware. With halfways sane timer hardware it's minimal and as I measured in a micro benchmark in the 1% range. Now when your system ends up having one of the real timer trainwrecks which can be found in drivers/clocksource/ which are even worse than the x86 HPET horror, then it's going to be painful and a performance issue. I assume that's an ARM64 system. ARM64 CPUs have an architected per CPU timer where the reprogramming is pretty fast as it's next to the CPU, but who knows what your system is using. Now in the meantime I looked into __hrtimer_start_range_ns() whether that double reprogram can be avoided without creating a total trainwreck and imposing penalty on all sensible use cases. Completely untested patch below should do the trick and it's not ugly enough that I hate it with a passion. Even if that makes your problem go away #1 is still beyond suboptimal and #2 is something you really want to look into. Thanks, tglx --- --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrti * remove hrtimer, called with base lock held */ static inline int -remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool restart) +remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, + bool restart, bool keep_local) { u8 state = timer->state; if (state & HRTIMER_STATE_ENQUEUED) { - int reprogram; + bool reprogram; /* * Remove the timer and force reprogramming when high @@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st debug_deactivate(timer); reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases); + /* + * If the timer is not restarted then reprogramming is + * required if the timer is local. If it is local and about + * to be restarted, avoid programming it twice (on removal + * and a moment later when it's requeued). + */ if (!restart) state = HRTIMER_STATE_INACTIVE; + else + reprogram &= !keep_local; __remove_hrtimer(timer, base, state, reprogram); return 1; @@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru struct hrtimer_clock_base *base) { struct hrtimer_clock_base *new_base; + bool force_local, first; - /* Remove an active timer from the queue: */ - remove_hrtimer(timer, base, true); + /* + * If the timer is on the local cpu base and is the first expiring + * timer then this might end up reprogramming the hardware twice + * (on removal and on enqueue). To avoid that by prevent the + * reprogram on removal, keep the timer local to the current CPU + * and enforce reprogramming after it is queued no matter whether + * it is the new first expiring timer again or not. + */ + force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases); + force_local &= base->cpu_base->next_timer == timer; + + /* + * Remove an active timer from the queue. In case it is not queued + * on the current CPU, make sure that remove_hrtimer() updates the + * remote data correctly. + * + * If it's on the current CPU and the first expiring timer, then + * skip reprogramming, keep the timer local and enforce + * reprogramming later if it was the first expiring timer. This + * avoids programming the underlying clock event twice (once at + * removal and once after enqueue). + */ + remove_hrtimer(timer, base, true, force_local); if (mode & HRTIMER_MODE_REL) tim = ktime_add_safe(tim, base->get_time()); @@ -1115,9 +1146,24 @@ static int __hrtimer_start_range_ns(stru hrtimer_set_expires_range_ns(timer, tim, delta_ns); /* Switch the timer base, if necessary: */ - new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED); + if (!force_local) { + new_base = switch_hrtimer_base(timer, base, + mode & HRTIMER_MODE_PINNED); + } else { + new_base = base; + } - return enqueue_hrtimer(timer, new_base, mode); + first = enqueue_hrtimer(timer, new_base, mode); + if (!force_local) + return first; + + /* + * Timer was forced to stay on the current CPU to avoid + * reprogramming on removal and enqueue. Force reprogram the + * hardware by evaluating the new first expiring timer. + */ + hrtimer_force_reprogram(new_base->cpu_base, 1); + return 0; } /** @@ -1183,7 +1229,7 @@ int hrtimer_try_to_cancel(struct hrtimer base = lock_hrtimer_base(timer, &flags); if (!hrtimer_callback_running(timer)) - ret = remove_hrtimer(timer, base, false); + ret = remove_hrtimer(timer, base, false, false); unlock_hrtimer_base(timer, &flags); ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-20 14:19 ` Thomas Gleixner @ 2021-04-21 14:08 ` Lorenzo Colitti 2021-04-21 14:40 ` Lorenzo Colitti 2021-04-22 0:08 ` Thomas Gleixner 2021-04-26 8:49 ` [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() Thomas Gleixner 1 sibling, 2 replies; 23+ messages in thread From: Lorenzo Colitti @ 2021-04-21 14:08 UTC (permalink / raw) To: Thomas Gleixner Cc: Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Tue, Apr 20, 2021 at 11:19 PM Thomas Gleixner <tglx@linutronix.de> wrote: > 1) hrtimer is contrary to timer_list not really suited for high > frequency start/cancel/start/... cycles of a timer. It's optimized > for the start and expire precisely case. Ack. This is not what the f_ncm gadget code needs. It just needs a timer to fire "about 300us after the last packet submitted". Under load the timer will almost never fire since there will always be another packet coming. So the speed of adding/updating the timer is much more important than the accuracy. We will try to move it to timer_list. > I assume that's an ARM64 system. ARM64 CPUs have an architected per > CPU timer where the reprogramming is pretty fast as it's next to > the CPU, but who knows what your system is using. This system appears to be using timer hardware that is... slow to program (microseconds). We're looking at whether it's possible to use the arch timer instead. > Now in the meantime I looked into __hrtimer_start_range_ns() whether > that double reprogram can be avoided without creating a total trainwreck > and imposing penalty on all sensible use cases. Completely untested > patch below should do the trick and it's not ugly enough that I hate it > with a passion. I tested this and in my simple benchmark, the double calls are gone and hrtimer_start calls tick_program_event approximately once (more precisely, 90.06% of the time; sometimes it doesn't call it at all). This is not enough to cancel the regression we're seeing because the previous code would pretty much never call tick_program_event at all. But it's definitely better. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-21 14:08 ` Lorenzo Colitti @ 2021-04-21 14:40 ` Lorenzo Colitti 2021-04-21 15:22 ` Greg KH 2021-04-22 0:08 ` Thomas Gleixner 1 sibling, 1 reply; 23+ messages in thread From: Lorenzo Colitti @ 2021-04-21 14:40 UTC (permalink / raw) To: Thomas Gleixner Cc: Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Wed, Apr 21, 2021 at 11:08 PM Lorenzo Colitti <lorenzo@google.com> wrote: > load the timer will almost never fire since there will always be > another packet coming. So the speed of adding/updating the timer is > much more important than the accuracy. We will try to move it to > timer_list. ... but actually, that's not acceptable, because AFAICS timer_list counts in jiffies so its resolution might be as low as 10ms. It's not acceptable to delay network traffic by up to 10ms under load. Not sure if there is another timer API that can be used here. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-21 14:40 ` Lorenzo Colitti @ 2021-04-21 15:22 ` Greg KH 0 siblings, 0 replies; 23+ messages in thread From: Greg KH @ 2021-04-21 15:22 UTC (permalink / raw) To: Lorenzo Colitti Cc: Thomas Gleixner, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Wed, Apr 21, 2021 at 11:40:54PM +0900, Lorenzo Colitti wrote: > On Wed, Apr 21, 2021 at 11:08 PM Lorenzo Colitti <lorenzo@google.com> wrote: > > load the timer will almost never fire since there will always be > > another packet coming. So the speed of adding/updating the timer is > > much more important than the accuracy. We will try to move it to > > timer_list. > > ... but actually, that's not acceptable, because AFAICS timer_list > counts in jiffies so its resolution might be as low as 10ms. It's not > acceptable to delay network traffic by up to 10ms under load. Not sure > if there is another timer API that can be used here. What do other network drivers do? This has to be a common problem, why is this USB driver somehow unique here? thanks, greg k-h ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-21 14:08 ` Lorenzo Colitti 2021-04-21 14:40 ` Lorenzo Colitti @ 2021-04-22 0:08 ` Thomas Gleixner 2021-04-22 10:07 ` Thomas Gleixner 2021-04-22 14:20 ` Lorenzo Colitti 1 sibling, 2 replies; 23+ messages in thread From: Thomas Gleixner @ 2021-04-22 0:08 UTC (permalink / raw) To: Lorenzo Colitti Cc: Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Wed, Apr 21 2021 at 23:08, Lorenzo Colitti wrote: > On Tue, Apr 20, 2021 at 11:19 PM Thomas Gleixner <tglx@linutronix.de> wrote: >> 1) hrtimer is contrary to timer_list not really suited for high >> frequency start/cancel/start/... cycles of a timer. It's optimized >> for the start and expire precisely case. > > Ack. This is not what the f_ncm gadget code needs. It just needs a > timer to fire "about 300us after the last packet submitted". Under > load the timer will almost never fire since there will always be > another packet coming. So the speed of adding/updating the timer is > much more important than the accuracy. We will try to move it to > timer_list. > >> I assume that's an ARM64 system. ARM64 CPUs have an architected per >> CPU timer where the reprogramming is pretty fast as it's next to >> the CPU, but who knows what your system is using. > > This system appears to be using timer hardware that is... slow to > program (microseconds). We're looking at whether it's possible to use > the arch timer instead. > >> Now in the meantime I looked into __hrtimer_start_range_ns() whether >> that double reprogram can be avoided without creating a total trainwreck >> and imposing penalty on all sensible use cases. Completely untested >> patch below should do the trick and it's not ugly enough that I hate it >> with a passion. > > I tested this and in my simple benchmark, the double calls are gone > and hrtimer_start calls tick_program_event approximately once (more > precisely, 90.06% of the time; sometimes it doesn't call it at all). > This is not enough to cancel the regression we're seeing because the > previous code would pretty much never call tick_program_event at all. > But it's definitely better. It's hardly a regression when the behaviour relies on a bug in the core code which should have never been there in the first place. If you look at what you are doing it's obvious that reprogramming has to take place quite often. Your timeout is relative 300us and it's moved every N microseconds ahead. So if there is not much other hrtimer activity on that CPU which is often the case and the only other relevant hrtimer which is armed (and expires) is the tick, then it depends on the HZ setting how often reprogramming is required. Lets assume HZ=1000 and a rearm period of 20us, which means the timer is rearmed 50 times per tick and reprogramming is needed for each rearm before the tick timer becomes the first expiring timer, i.e. 300us before the tick. So 700/20 = 35 times per 1ms the hardware needs to be updated because the first expiry time moves 20us farther into the future on every update. That's 70% So depending on how long that takes on your machine 35 updates can be a significant amount of time. With HZ=250 that's 200 times rearming / tick and the reprogramming is then 3700/20 = 185 which is 92.5% which is close enough to the 90% you are seing... Now, if your timer hardware needs several microseconds to be programmed then it's obvious that you lose a lot of CPU time. Just for comparison. In a VM I'm experimenting with right now the reprogramming time is ~500ns which is still a lot of cycles, but compared to 5us faster by an order of magnitude. And on the sane machine bare metal its way faster and therefore less noticeable. The use case still sucks, but that's not a problem of hrtimers as they are not designed for this use case. Now looking at the usecase. The comment above the hrtimer callback, which is the only hint there, says: * The transmit should only be run if no skb data has been sent for a * certain duration. which is useless word salad. So looking at the commit which added that muck: 6d3865f9d41f ("usb: gadget: NCM: Add transmit multi-frame."). The changelog mutters: It has a time out of 300ms to ensure that smaller number of packets still maintain a normal latency. while the timeout is defined as: +#define TX_TIMEOUT_NSECS 300000 which is clearly 300us and not 300ms. So much for useful information in changelogs. But at least the rest of the commit message gives a clue why this is useful: This adds multi-frame support to the NCM NTB's for the gadget driver. This allows multiple network packets to be put inside a single USB NTB with a maximum size of 16kB. So the timer ensures that for batching xmit packets into a single NTB the maximum delay is 300us if the NTB is not filled up on time. Though there is ZERO explanation why 300us is a sensible number and why it's a brilliant idea to use a hrtimer which is clearly documented to be not suitable for this. I'm pretty sure that the original author did not develop that on hardware which used a timer based on a hardware trainwreck, but alone the fact that: - rbtree reshuffling _is_ always required - high frequency reprogramming of the timer hardware might be required - in case that CONFIG_HIGH_RES_TIMERS is off or not enabled at runtime for whatever reason this is still subject to the CONFIG_HZ setting tells me that this is clearly well thought out and based on "works for me" engineering. Plus at the time when this was hacked up, the bug which was fixed by the commit you refer to, did not exist yet as it was introduced with the softirq based expiry changes years later. Back then the timer expiry was scheduling a tasklet to ensure that ->ndo_start_xmit() is called from softirq context. The softirq based expiry changes which introduced the correctness bug removed that tasklet indirection. And going back to 4.14 yields the proof of this. The same microbenchmark as described above i.e. modifying the timer every 20us has exactly the same effect vs. reprogramming the clock event device: It's programmed twice on every hrtimer_start() ~90% of the time So no. The real regression was introduced with: 5da70160462e ("hrtimer: Implement support for softirq based hrtimers") and related changes, but that had not a negative effect on that USB driver usage, quite the contrary it papered over it. commit 46eb1701c0 fixed that regression and thereby unearthed the issue with that USB driver usage issue in combination with that timer hardware. If 5da70160462e et al. would not had introduced that bug then the problems with that hrtimer usage in combination with that timer hardware trainwreck would have been noticed way earlier. IOW it would never have worked at all. Aside of that any use case which does not use the _SOFT variant of hrtimers was not affected by 5da70160462e et al. Which in turn means that all !_SOFT usecases are not abusing hrtimers and NONE of the _SOFT usecases cares about correctness and accuracy at all. So pretty please stop claiming that this is a regression. It was broken forever, but papered over and therefore just not noticed. IOW, the problem was introduced with 6d3865f9d41f, i.e. 6+ years ago. The reasons why this pops up now are: - There was an unnoticed bug since 5da70160462e in the hrtimers code for 3 years. - It's only observable on hardware which has abysmal per cpu timers TBH, I'm telling HW people for 15+ years now that clockevents and clocksources are performance critical and relevant for correctness, but obviously they don't listen. Otherwise we would not have code in tree which has comments like this: hpet_clkevt_set_next_event(...) * HPETs are a complete disaster. The compare register is or any other clock event which is compare equal based and has to do read back based interrupt loss prevention like pxa_osmr0_set_next_event() and other trainwrecks. Along with the insanities of watchdogs to monitor correctness and clocksources/events which stop to work when entering deep power states. Not to talk about gems like this: exynos4_mct_write() /* Wait maximum 1 ms until written values are applied */ along with all the magic patches which make this the preferred device for the very wrong reasons. I have no idea which particular trainwreck you are dealing with because you didn't tell, but TBH I don't want to know at all. It's probably just another incarnation of broken which is not covered by the above, but fits nicely. That said, even if you manage to avoid the timer hardware trainwreck on which your system is depending on right now, then still the underlying problem remains that hrtimers are not the proper tool for high frequency modification and never will be. There are smarter approaches to that USB/NTB problem, but that's beyond the discussion at hand. Thanks, tglx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-22 0:08 ` Thomas Gleixner @ 2021-04-22 10:07 ` Thomas Gleixner 2021-04-22 14:20 ` Lorenzo Colitti 1 sibling, 0 replies; 23+ messages in thread From: Thomas Gleixner @ 2021-04-22 10:07 UTC (permalink / raw) To: Lorenzo Colitti Cc: Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Thu, Apr 22 2021 at 02:08, Thomas Gleixner wrote: > On Wed, Apr 21 2021 at 23:08, Lorenzo Colitti wrote: > That said, even if you manage to avoid the timer hardware trainwreck on > which your system is depending on right now, then still the underlying > problem remains that hrtimers are not the proper tool for high frequency > modification and never will be. > > There are smarter approaches to that USB/NTB problem, but that's beyond > the discussion at hand. And because I was looking at it anyway, here are some hints: Buffersize: 16k Timeout: 300us Let's assume small packets with a size of 256 bytes each. The packets arrive with a period of ~299us, i.e. right before the timer fires. Which means the timer wont fire because it's moved 300us ahead each time and the buffer is sent to the wire when it's full. That means the 1st packet in the buffer is sent 63 * 299 us = 18.8ms after it was queued in the buffer. How is that supposed to be correct? I know, throughput is what matters and correctness is overrated. For the high volume traffic case the timer is not needed at all if the logic is done proper. xmit() { if (!packet_fits_in_buffer()) send_buffer_and_cancel_timer_if_armed(); if (first_packet_in_buffer()) buffer_deadline = now() + MAX_DELAY; add_packet_to_buffer(); /* Are more packets available for xmit at the core? */ if (!xmit_queue_empty()) return; /* * Last packet from network core for now, check how long the * first packet sits in the buffer. */ if (buffer_deadline <= now()) { send_buffer_and_cancel_timer_if_armed(); return; } if (!timer_armed()) arm_timer(buffer_deadline); } IOW, as long as the network core has packets outstanding for xmit, there is no point in arming the timer at all. But yes, that needs more work and thoughts than the 'bolt and duct tape' approach which led to the current situation. Aside of that it would certainly be interesting to start very simple with an experiment which does not use a timer in the first place: xmit() { if (!packet_fits_in_buffer()) send_buffer(); add_packet_to_buffer(); /* Are more packets available for xmit at the core? */ if (!xmit_queue_empty()) return; /* Last packet from network core for now. Drain it. */ send_buffer(); } I wouldn't be surprised if that'd turn out to be the right thing to do both latency-wise and throughput-wise. But that'd be not convoluted enough and of course the commit which introduced that magic does not explain anything at all, so I can't tell if that was actually tried or not. Thanks, tglx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-22 0:08 ` Thomas Gleixner 2021-04-22 10:07 ` Thomas Gleixner @ 2021-04-22 14:20 ` Lorenzo Colitti 2021-04-22 15:35 ` Thomas Gleixner 1 sibling, 1 reply; 23+ messages in thread From: Lorenzo Colitti @ 2021-04-22 14:20 UTC (permalink / raw) To: Thomas Gleixner Cc: Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Thu, Apr 22, 2021 at 9:08 AM Thomas Gleixner <tglx@linutronix.de> wrote: > Just for comparison. In a VM I'm experimenting with right now the > reprogramming time is ~500ns which is still a lot of cycles, but > compared to 5us faster by an order of magnitude. And on the sane > machine bare metal its way faster and therefore less noticeable. FWIW, on this hardware, frtrace says that arming the arm64 architected timer takes 0.7us. Definitely better than 2-3us, but still not free. This is not a high-end desktop or server, but it's also not super slow, low-power hardware. > * The transmit should only be run if no skb data has been sent for a > * certain duration. > > which is useless word salad. You're the one who wrote that comment - see b1a31a5f5f27. You'll forgive me for being amused. :-) Thanks for the history/analysis/suggestions. I think it's a fact that this is a regression in performance: this particular code has performed well for a couple of years now. The fact that the good performance only existed due to a correctness bug in the hrtimer code definitely does make it harder to argue that the regression should be reverted. That said: if you have a fix for the double reprogram, then that fix should probably be applied? 0.5us is not free, and even if hrtimers aren't designed for frequent updates, touching the hardware twice as often does seem like a bad idea, since, as you point out, there's a *lot* of hardware that is slow. Separately, we're also going to look at making ncm better. (In defense of the original author, in 2014 I don't think anyone would even have dreamed of USB being fast enough for this to be a problem.) The first thing we're going to try to do is set the timer once per NTB instead of once per packet (so, 10x less). My initial attempt to do that causes the link to die after a while and I need to figure out why before I can send a patch up. I'm suspicious of the threading, which uses non-atomic variables (timer_force_tx, ncm->skb_tx_data) to synchronize control flow between the timer and the transmit function, which can presumably run on different CPUs. That seems wrong since either core could observe stale variables. But perhaps there are memory barriers that I'm not aware of. The idea of getting rid of the timer by doing aggregation based on transmit queue lengths seems like a much larger effort, but probably one that is required to actually improve performance substantially beyond what it is now. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-04-22 14:20 ` Lorenzo Colitti @ 2021-04-22 15:35 ` Thomas Gleixner 0 siblings, 0 replies; 23+ messages in thread From: Thomas Gleixner @ 2021-04-22 15:35 UTC (permalink / raw) To: Lorenzo Colitti Cc: Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Thu, Apr 22 2021 at 23:20, Lorenzo Colitti wrote: > On Thu, Apr 22, 2021 at 9:08 AM Thomas Gleixner <tglx@linutronix.de> wrote: >> Just for comparison. In a VM I'm experimenting with right now the >> reprogramming time is ~500ns which is still a lot of cycles, but >> compared to 5us faster by an order of magnitude. And on the sane >> machine bare metal its way faster and therefore less noticeable. > > FWIW, on this hardware, frtrace says that arming the arm64 architected > timer takes 0.7us. Definitely better than 2-3us, but still not free. > This is not a high-end desktop or server, but it's also not super > slow, low-power hardware. > >> * The transmit should only be run if no skb data has been sent for a >> * certain duration. >> >> which is useless word salad. > > You're the one who wrote that comment - see b1a31a5f5f27. You'll > forgive me for being amused. :-) Rightfully so! I still call it word salat :) > Thanks for the history/analysis/suggestions. I think it's a fact that > this is a regression in performance: this particular code has > performed well for a couple of years now. The fact that the good > performance only existed due to a correctness bug in the hrtimer code > definitely does make it harder to argue that the regression should be > reverted. We tend to disagree about the naming conventions here, but we seem at least to agree that reverting a fix for a correctness bug (which has way worse implications than slowing down a gruesome driver) is not going to happen. > That said: if you have a fix for the double reprogram, then that fix > should probably be applied? 0.5us is not free, and even if hrtimers > aren't designed for frequent updates, touching the hardware twice as > often does seem like a bad idea, since, as you point out, there's a > *lot* of hardware that is slow. That's an obvious improvement, but not a fix. And I checked quite some hrtimer users and there are only a few which ever rearm an queued timer and that happens infrequently. > Separately, we're also going to look at making ncm better. (In defense > of the original author, in 2014 I don't think anyone would even have > dreamed of USB being fast enough for this to be a problem.) The first > thing we're going to try to do is set the timer once per NTB instead > of once per packet (so, 10x less). My initial attempt to do that > causes the link to die after a while and I need to figure out why > before I can send a patch up. I'm suspicious of the threading, which > uses non-atomic variables (timer_force_tx, ncm->skb_tx_data) to > synchronize control flow between the timer and the transmit function, > which can presumably run on different CPUs. That seems wrong since > either core could observe stale variables. But perhaps there are > memory barriers that I'm not aware of. Not that I see any. > The idea of getting rid of the timer by doing aggregation based on > transmit queue lengths seems like a much larger effort, but probably > one that is required to actually improve performance substantially > beyond what it is now. I don't think it's a huge effort. netdev_xmit_more() should tell you what you need to know. Thanks, tglx ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() 2021-04-20 14:19 ` Thomas Gleixner 2021-04-21 14:08 ` Lorenzo Colitti @ 2021-04-26 8:49 ` Thomas Gleixner 2021-04-26 9:40 ` Peter Zijlstra 1 sibling, 1 reply; 23+ messages in thread From: Thomas Gleixner @ 2021-04-26 8:49 UTC (permalink / raw) To: Lorenzo Colitti Cc: Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon, Peter Zijlstra If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then the timer has to be canceled first and then added back. If the timer is the first expiring timer then on removal the clockevent device is reprogrammed to the next expiring timer to avoid that the pending expiry fires needlessly. If the new expiry time ends up to be the first expiry again then the clock event device has to reprogrammed again. Avoid this by checking whether the timer is the first to expire and in that case, keep the timer on the current CPU and delay the reprogramming up to the point where the timer has been enqueued again. Reported-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- kernel/time/hrtimer.c | 60 ++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 53 insertions(+), 7 deletions(-) --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrti * remove hrtimer, called with base lock held */ static inline int -remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool restart) +remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, + bool restart, bool keep_local) { u8 state = timer->state; if (state & HRTIMER_STATE_ENQUEUED) { - int reprogram; + bool reprogram; /* * Remove the timer and force reprogramming when high @@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st debug_deactivate(timer); reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases); + /* + * If the timer is not restarted then reprogramming is + * required if the timer is local. If it is local and about + * to be restarted, avoid programming it twice (on removal + * and a moment later when it's requeued). + */ if (!restart) state = HRTIMER_STATE_INACTIVE; + else + reprogram &= !keep_local; __remove_hrtimer(timer, base, state, reprogram); return 1; @@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru struct hrtimer_clock_base *base) { struct hrtimer_clock_base *new_base; + bool force_local, first; - /* Remove an active timer from the queue: */ - remove_hrtimer(timer, base, true); + /* + * If the timer is on the local cpu base and is the first expiring + * timer then this might end up reprogramming the hardware twice + * (on removal and on enqueue). To avoid that by prevent the + * reprogram on removal, keep the timer local to the current CPU + * and enforce reprogramming after it is queued no matter whether + * it is the new first expiring timer again or not. + */ + force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases); + force_local &= base->cpu_base->next_timer == timer; + + /* + * Remove an active timer from the queue. In case it is not queued + * on the current CPU, make sure that remove_hrtimer() updates the + * remote data correctly. + * + * If it's on the current CPU and the first expiring timer, then + * skip reprogramming, keep the timer local and enforce + * reprogramming later if it was the first expiring timer. This + * avoids programming the underlying clock event twice (once at + * removal and once after enqueue). + */ + remove_hrtimer(timer, base, true, force_local); if (mode & HRTIMER_MODE_REL) tim = ktime_add_safe(tim, base->get_time()); @@ -1115,9 +1146,24 @@ static int __hrtimer_start_range_ns(stru hrtimer_set_expires_range_ns(timer, tim, delta_ns); /* Switch the timer base, if necessary: */ - new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED); + if (!force_local) { + new_base = switch_hrtimer_base(timer, base, + mode & HRTIMER_MODE_PINNED); + } else { + new_base = base; + } - return enqueue_hrtimer(timer, new_base, mode); + first = enqueue_hrtimer(timer, new_base, mode); + if (!force_local) + return first; + + /* + * Timer was forced to stay on the current CPU to avoid + * reprogramming on removal and enqueue. Force reprogram the + * hardware by evaluating the new first expiring timer. + */ + hrtimer_force_reprogram(new_base->cpu_base, 1); + return 0; } /** @@ -1183,7 +1229,7 @@ int hrtimer_try_to_cancel(struct hrtimer base = lock_hrtimer_base(timer, &flags); if (!hrtimer_callback_running(timer)) - ret = remove_hrtimer(timer, base, false); + ret = remove_hrtimer(timer, base, false, false); unlock_hrtimer_base(timer, &flags); ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() 2021-04-26 8:49 ` [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() Thomas Gleixner @ 2021-04-26 9:40 ` Peter Zijlstra 2021-04-26 12:25 ` Peter Zijlstra 2021-04-26 12:33 ` Thomas Gleixner 0 siblings, 2 replies; 23+ messages in thread From: Peter Zijlstra @ 2021-04-26 9:40 UTC (permalink / raw) To: Thomas Gleixner Cc: Lorenzo Colitti, Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Mon, Apr 26, 2021 at 10:49:33AM +0200, Thomas Gleixner wrote: > If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then > the timer has to be canceled first and then added back. If the timer is the > first expiring timer then on removal the clockevent device is reprogrammed > to the next expiring timer to avoid that the pending expiry fires needlessly. > > If the new expiry time ends up to be the first expiry again then the clock > event device has to reprogrammed again. > > Avoid this by checking whether the timer is the first to expire and in that > case, keep the timer on the current CPU and delay the reprogramming up to > the point where the timer has been enqueued again. > > Reported-by: Lorenzo Colitti <lorenzo@google.com> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de> > --- > kernel/time/hrtimer.c | 60 ++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 53 insertions(+), 7 deletions(-) > > --- a/kernel/time/hrtimer.c > +++ b/kernel/time/hrtimer.c > @@ -1030,12 +1030,13 @@ static void __remove_hrtimer(struct hrti > * remove hrtimer, called with base lock held > */ > static inline int > -remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, bool restart) > +remove_hrtimer(struct hrtimer *timer, struct hrtimer_clock_base *base, > + bool restart, bool keep_local) > { > u8 state = timer->state; > > if (state & HRTIMER_STATE_ENQUEUED) { > - int reprogram; > + bool reprogram; > > /* > * Remove the timer and force reprogramming when high > @@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st > debug_deactivate(timer); > reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases); > > + /* > + * If the timer is not restarted then reprogramming is > + * required if the timer is local. If it is local and about > + * to be restarted, avoid programming it twice (on removal > + * and a moment later when it's requeued). > + */ > if (!restart) > state = HRTIMER_STATE_INACTIVE; > + else > + reprogram &= !keep_local; reprogram = reprogram && !keep_local; perhaps? > > __remove_hrtimer(timer, base, state, reprogram); > return 1; > @@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru > struct hrtimer_clock_base *base) > { > struct hrtimer_clock_base *new_base; > + bool force_local, first; > > - /* Remove an active timer from the queue: */ > - remove_hrtimer(timer, base, true); > + /* > + * If the timer is on the local cpu base and is the first expiring > + * timer then this might end up reprogramming the hardware twice > + * (on removal and on enqueue). To avoid that by prevent the > + * reprogram on removal, keep the timer local to the current CPU > + * and enforce reprogramming after it is queued no matter whether > + * it is the new first expiring timer again or not. > + */ > + force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases); > + force_local &= base->cpu_base->next_timer == timer; Using bitwise ops on a bool is cute and all, but isn't that more readable when written like: force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases) && base->cpu_base->next_timer == timer; > + > + /* > + * Remove an active timer from the queue. In case it is not queued > + * on the current CPU, make sure that remove_hrtimer() updates the > + * remote data correctly. > + * > + * If it's on the current CPU and the first expiring timer, then > + * skip reprogramming, keep the timer local and enforce > + * reprogramming later if it was the first expiring timer. This > + * avoids programming the underlying clock event twice (once at > + * removal and once after enqueue). > + */ > + remove_hrtimer(timer, base, true, force_local); > > if (mode & HRTIMER_MODE_REL) > tim = ktime_add_safe(tim, base->get_time()); > @@ -1115,9 +1146,24 @@ static int __hrtimer_start_range_ns(stru > hrtimer_set_expires_range_ns(timer, tim, delta_ns); > > /* Switch the timer base, if necessary: */ > - new_base = switch_hrtimer_base(timer, base, mode & HRTIMER_MODE_PINNED); > + if (!force_local) { > + new_base = switch_hrtimer_base(timer, base, > + mode & HRTIMER_MODE_PINNED); > + } else { > + new_base = base; > + } > > - return enqueue_hrtimer(timer, new_base, mode); > + first = enqueue_hrtimer(timer, new_base, mode); > + if (!force_local) > + return first; > + > + /* > + * Timer was forced to stay on the current CPU to avoid > + * reprogramming on removal and enqueue. Force reprogram the > + * hardware by evaluating the new first expiring timer. > + */ > + hrtimer_force_reprogram(new_base->cpu_base, 1); > + return 0; > } There is an unfortunate amount of duplication between hrtimer_force_reprogram() and hrtimer_reprogram(). The obvious cleanups don't work however :/ Still, does that in_hrtirq optimization make sense to have in force_reprogram ? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() 2021-04-26 9:40 ` Peter Zijlstra @ 2021-04-26 12:25 ` Peter Zijlstra 2021-05-14 19:29 ` Thomas Gleixner 2021-04-26 12:33 ` Thomas Gleixner 1 sibling, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2021-04-26 12:25 UTC (permalink / raw) To: Thomas Gleixner Cc: Lorenzo Colitti, Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Mon, Apr 26, 2021 at 11:40:46AM +0200, Peter Zijlstra wrote: > There is an unfortunate amount of duplication between > hrtimer_force_reprogram() and hrtimer_reprogram(). The obvious cleanups > don't work however :/ Still, does that in_hrtirq optimization make sense > to have in force_reprogram ? Something like so perhaps? --- kernel/time/hrtimer.c | 75 ++++++++++++++++++++++----------------------------- 1 file changed, 32 insertions(+), 43 deletions(-) diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 5c9d968187ae..fb8d2c58443a 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -652,21 +652,27 @@ static inline int hrtimer_hres_active(void) return __hrtimer_hres_active(this_cpu_ptr(&hrtimer_bases)); } -/* - * Reprogram the event source with checking both queues for the - * next event - * Called with interrupts disabled and base->lock held - */ static void -hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal) +__hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal, + struct hrtimer *next_timer, ktime_t expires_next) { - ktime_t expires_next; + /* + * If the hrtimer interrupt is running, then it will + * reevaluate the clock bases and reprogram the clock event + * device. The callbacks are always executed in hard interrupt + * context so we don't need an extra check for a running + * callback. + */ + if (cpu_base->in_hrtirq) + return; - expires_next = hrtimer_update_next_event(cpu_base); + if (expires_next > cpu_base->expires_next) + return; if (skip_equal && expires_next == cpu_base->expires_next) return; + cpu_base->next_timer = next_timer; cpu_base->expires_next = expires_next; /* @@ -689,7 +695,23 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal) if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected) return; - tick_program_event(cpu_base->expires_next, 1); + tick_program_event(expires_next, 1); +} + +/* + * Reprogram the event source with checking both queues for the + * next event + * Called with interrupts disabled and base->lock held + */ +static void +hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal) +{ + ktime_t expires_next; + + expires_next = hrtimer_update_next_event(cpu_base); + + __hrtimer_force_reprogram(cpu_base, skip_equal, + cpu_base->next_timer, expires_next); } /* High resolution timer related functions */ @@ -835,40 +857,7 @@ static void hrtimer_reprogram(struct hrtimer *timer, bool reprogram) if (base->cpu_base != cpu_base) return; - /* - * If the hrtimer interrupt is running, then it will - * reevaluate the clock bases and reprogram the clock event - * device. The callbacks are always executed in hard interrupt - * context so we don't need an extra check for a running - * callback. - */ - if (cpu_base->in_hrtirq) - return; - - if (expires >= cpu_base->expires_next) - return; - - /* Update the pointer to the next expiring timer */ - cpu_base->next_timer = timer; - cpu_base->expires_next = expires; - - /* - * If hres is not active, hardware does not have to be - * programmed yet. - * - * If a hang was detected in the last timer interrupt then we - * do not schedule a timer which is earlier than the expiry - * which we enforced in the hang detection. We want the system - * to make progress. - */ - if (!__hrtimer_hres_active(cpu_base) || cpu_base->hang_detected) - return; - - /* - * Program the timer hardware. We enforce the expiry for - * events which are already in the past. - */ - tick_program_event(expires, 1); + __hrtimer_force_reprogram(cpu_base, true, timer, expires); } /* ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() 2021-04-26 12:25 ` Peter Zijlstra @ 2021-05-14 19:29 ` Thomas Gleixner 0 siblings, 0 replies; 23+ messages in thread From: Thomas Gleixner @ 2021-05-14 19:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Lorenzo Colitti, Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Mon, Apr 26 2021 at 14:25, Peter Zijlstra wrote: > On Mon, Apr 26, 2021 at 11:40:46AM +0200, Peter Zijlstra wrote: >> There is an unfortunate amount of duplication between >> hrtimer_force_reprogram() and hrtimer_reprogram(). The obvious cleanups >> don't work however :/ Still, does that in_hrtirq optimization make sense >> to have in force_reprogram ? No. It need's to be in hrtimer_programm(). ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() 2021-04-26 9:40 ` Peter Zijlstra 2021-04-26 12:25 ` Peter Zijlstra @ 2021-04-26 12:33 ` Thomas Gleixner 2021-04-26 12:40 ` Peter Zijlstra 1 sibling, 1 reply; 23+ messages in thread From: Thomas Gleixner @ 2021-04-26 12:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Lorenzo Colitti, Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Mon, Apr 26 2021 at 11:40, Peter Zijlstra wrote: > On Mon, Apr 26, 2021 at 10:49:33AM +0200, Thomas Gleixner wrote: >> If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then >> the timer has to be canceled first and then added back. If the timer is the >> first expiring timer then on removal the clockevent device is reprogrammed >> to the next expiring timer to avoid that the pending expiry fires needlessly. >> /* >> * Remove the timer and force reprogramming when high >> @@ -1048,8 +1049,16 @@ remove_hrtimer(struct hrtimer *timer, st >> debug_deactivate(timer); >> reprogram = base->cpu_base == this_cpu_ptr(&hrtimer_bases); >> >> + /* >> + * If the timer is not restarted then reprogramming is >> + * required if the timer is local. If it is local and about >> + * to be restarted, avoid programming it twice (on removal >> + * and a moment later when it's requeued). >> + */ >> if (!restart) >> state = HRTIMER_STATE_INACTIVE; >> + else >> + reprogram &= !keep_local; > > reprogram = reprogram && !keep_local; > > perhaps? Maybe >> >> __remove_hrtimer(timer, base, state, reprogram); >> return 1; >> @@ -1103,9 +1112,31 @@ static int __hrtimer_start_range_ns(stru >> struct hrtimer_clock_base *base) >> { >> struct hrtimer_clock_base *new_base; >> + bool force_local, first; >> >> - /* Remove an active timer from the queue: */ >> - remove_hrtimer(timer, base, true); >> + /* >> + * If the timer is on the local cpu base and is the first expiring >> + * timer then this might end up reprogramming the hardware twice >> + * (on removal and on enqueue). To avoid that by prevent the >> + * reprogram on removal, keep the timer local to the current CPU >> + * and enforce reprogramming after it is queued no matter whether >> + * it is the new first expiring timer again or not. >> + */ >> + force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases); >> + force_local &= base->cpu_base->next_timer == timer; > > Using bitwise ops on a bool is cute and all, but isn't that more > readable when written like: > > force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases) && > base->cpu_base->next_timer == timer; > Which results in an extra conditional branch. >> + /* >> + * Timer was forced to stay on the current CPU to avoid >> + * reprogramming on removal and enqueue. Force reprogram the >> + * hardware by evaluating the new first expiring timer. >> + */ >> + hrtimer_force_reprogram(new_base->cpu_base, 1); >> + return 0; >> } > > There is an unfortunate amount of duplication between > hrtimer_force_reprogram() and hrtimer_reprogram(). The obvious cleanups > don't work however :/ Still, does that in_hrtirq optimization make sense > to have in force_reprogram ? Yes, no, do not know. Let me have a look. Thanks, tglx ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() 2021-04-26 12:33 ` Thomas Gleixner @ 2021-04-26 12:40 ` Peter Zijlstra 2021-04-26 14:27 ` Thomas Gleixner 0 siblings, 1 reply; 23+ messages in thread From: Peter Zijlstra @ 2021-04-26 12:40 UTC (permalink / raw) To: Thomas Gleixner Cc: Lorenzo Colitti, Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Mon, Apr 26, 2021 at 02:33:00PM +0200, Thomas Gleixner wrote: > >> + force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases); > >> + force_local &= base->cpu_base->next_timer == timer; > > > > Using bitwise ops on a bool is cute and all, but isn't that more > > readable when written like: > > > > force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases) && > > base->cpu_base->next_timer == timer; > > > > Which results in an extra conditional branch. Srlsy, compiler not smart enough? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() 2021-04-26 12:40 ` Peter Zijlstra @ 2021-04-26 14:27 ` Thomas Gleixner 0 siblings, 0 replies; 23+ messages in thread From: Thomas Gleixner @ 2021-04-26 14:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Lorenzo Colitti, Greg KH, Maciej Żenczykowski, Ingo Molnar, Anna-Maria Behnsen, lkml, mikael.beckius, Maciej Żenczykowski, Will Deacon On Mon, Apr 26 2021 at 14:40, Peter Zijlstra wrote: > On Mon, Apr 26, 2021 at 02:33:00PM +0200, Thomas Gleixner wrote: > >> >> + force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases); >> >> + force_local &= base->cpu_base->next_timer == timer; >> > >> > Using bitwise ops on a bool is cute and all, but isn't that more >> > readable when written like: >> > >> > force_local = base->cpu_base == this_cpu_ptr(&hrtimer_bases) && >> > base->cpu_base->next_timer == timer; >> > >> >> Which results in an extra conditional branch. > > Srlsy, compiler not smart enough? gcc 8.3 is definitely not ... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Sv: [PATCH] hrtimer: Interrupt storm on clock_settime @ 2021-02-12 13:38 Beckius, Mikael 2021-02-23 16:02 ` [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() Anna-Maria Behnsen 0 siblings, 1 reply; 23+ messages in thread From: Beckius, Mikael @ 2021-02-12 13:38 UTC (permalink / raw) To: Thomas Gleixner, linux-kernel; +Cc: Anna-Maria Behnsen Thanks for the update and sorry for the late reply. After long-term testing of the patch, storm detection improved, it turns out that a similar problem can occur if hrtimer_interrupt runs during clock_settime. In this case it seems the offset can get updated and later read using hrtimer_update_base in hrtimer_interrupt before softirq_expires_next gets updated. As soon as softirq_expires_next gets updated by hrtimer_force_reprogram the storm ends. To fix this I made the below changes. --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -529,8 +529,10 @@ static ktime_t __hrtimer_next_event_base(struct hrtimer_cpu_base *cpu_base, if (exclude) continue; - if (timer->is_soft) + if (timer->is_soft) { cpu_base->softirq_next_timer = timer; + cpu_base->softirq_expires_next = expires; + } else cpu_base->next_timer = timer; } @@ -633,19 +635,6 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal) */ expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL); - if (cpu_base->next_timer && cpu_base->next_timer->is_soft) { - /* - * When the softirq is activated, hrtimer has to be - * programmed with the first hard hrtimer because soft - * timer interrupt could occur too late. - */ - if (cpu_base->softirq_activated) - expires_next = __hrtimer_get_next_event(cpu_base, - HRTIMER_ACTIVE_HARD); - else - cpu_base->softirq_expires_next = expires_next; - } - if (skip_equal && expires_next == cpu_base->expires_next) return; -- This is similar to hrtimer_reprogram. I also removed the cpu_base->softirq_activated case since as far as I can tell expires_next must be hard if cpu_base->softirq_activated is true. I might be missing something as I don't have whole picture of the hrtimer subsystem but at least no interrupt storms are detected during clock_settime with latest changes. Micke ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() 2021-02-12 13:38 Sv: [PATCH] hrtimer: Interrupt storm on clock_settime Beckius, Mikael @ 2021-02-23 16:02 ` Anna-Maria Behnsen 0 siblings, 0 replies; 23+ messages in thread From: Anna-Maria Behnsen @ 2021-02-23 16:02 UTC (permalink / raw) To: mikael.beckius; +Cc: tglx, linux-kernel hrtimer_force_reprogram() and hrtimer_interrupt() invokes __hrtimer_get_next_event() to find the earliest expiry time of hrtimer bases. __hrtimer_get_next_event() does not update cpu_base::[softirq_]_expires_next to preserve reprogramming logic. That needs to be done at the callsites. hrtimer_force_reprogram() updates cpu_base::softirq_expires_next only when the first expiring timer is a softirq timer and the soft interrupt is not activated. That's wrong because cpu_base::softirq_expires_next is left stale when the first expiring timer of all bases is a timer which expires in hard interrupt context. hrtimer_interrupt() does never update cpu_base::softirq_expires_next which is wrong too. That becomes a problem when clock_settime() sets CLOCK_REALTIME forward and the first soft expiring timer is in the CLOCK_REALTIME_SOFT base. Setting CLOCK_REALTIME forward moves the clock MONOTONIC based expiry time of that timer before the stale cpu_base::softirq_expires_next. cpu_base::softirq_expires_next is cached to make the check for raising the soft interrupt fast. In the above case the soft interrupt won't be raised until clock monotonic reaches the stale cpu_base::softirq_expires_next value. That's incorrect, but what's worse it that if the softirq timer becomes the first expiring timer of all clock bases after the hard expiry timer has been handled the reprogramming of the clockevent from hrtimer_interrupt() will result in an interrupt storm. That happens because the reprogramming does not use cpu_base::softirq_expires_next, it uses __hrtimer_get_next_event() which returns the actual expiry time. Once clock MONOTONIC reaches cpu_base::softirq_expires_next the soft interrupt is raised and the storm subsides. Change the logic in hrtimer_force_reprogram() to evaluate the soft and hard bases seperately, update softirq_expires_next and handle the case when a soft expiring timer is the first of all bases by comparing the expiry times and updating the required cpu base fields. Split this functionality into a separate function to be able to use it in hrtimer_interrupt() as well without copy paste. Micke, please test this patch - it is compile tested only... Fixes:5da70160462e ("hrtimer: Implement support for softirq based hrtimers") Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Reported-by: Mikael Beckius <mikael.beckius@windriver.com> --- kernel/time/hrtimer.c | 53 +++++++++++++++++++++++++++---------------- 1 file changed, 34 insertions(+), 19 deletions(-) diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c index 46bd35d5fc0d..788b9d137de4 100644 --- a/kernel/time/hrtimer.c +++ b/kernel/time/hrtimer.c @@ -591,6 +591,37 @@ __hrtimer_get_next_event(struct hrtimer_cpu_base *cpu_base, unsigned int active_ return expires_next; } +static ktime_t hrtimer_update_next_event(struct hrtimer_cpu_base *cpu_base) +{ + ktime_t expires_next, soft = KTIME_MAX; + + /* + * If the soft interrupt has already been activated, ignore the + * soft bases. They will be handled in the already raised soft + * interrupt. + */ + if (!cpu_base->softirq_activated) { + soft = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_SOFT); + /* + * Update the soft expiry time. clock_settime() might have + * affected it. + */ + cpu_base->softirq_expires_next = soft; + } + + expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_HARD); + /* + * If a softirq timer is expiring first, update cpu_base->next_timer + * and program the hardware with the soft expiry time. + */ + if (expires_next > soft) { + cpu_base->next_timer = cpu_base->softirq_next_timer; + expires_next = soft; + } + + return expires_next; +} + static inline ktime_t hrtimer_update_base(struct hrtimer_cpu_base *base) { ktime_t *offs_real = &base->clock_base[HRTIMER_BASE_REALTIME].offset; @@ -631,23 +662,7 @@ hrtimer_force_reprogram(struct hrtimer_cpu_base *cpu_base, int skip_equal) { ktime_t expires_next; - /* - * Find the current next expiration time. - */ - expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL); - - if (cpu_base->next_timer && cpu_base->next_timer->is_soft) { - /* - * When the softirq is activated, hrtimer has to be - * programmed with the first hard hrtimer because soft - * timer interrupt could occur too late. - */ - if (cpu_base->softirq_activated) - expires_next = __hrtimer_get_next_event(cpu_base, - HRTIMER_ACTIVE_HARD); - else - cpu_base->softirq_expires_next = expires_next; - } + expires_next = hrtimer_update_next_event(cpu_base); if (skip_equal && expires_next == cpu_base->expires_next) return; @@ -1647,8 +1662,8 @@ void hrtimer_interrupt(struct clock_event_device *dev) __hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD); - /* Reevaluate the clock bases for the next expiry */ - expires_next = __hrtimer_get_next_event(cpu_base, HRTIMER_ACTIVE_ALL); + /* Reevaluate the clock bases for the [soft] next expiry */ + expires_next = hrtimer_update_next_event(cpu_base); /* * Store the new expiry value so the migration code can verify * against it. -- 2.20.1 ^ permalink raw reply related [flat|nested] 23+ messages in thread
end of thread, other threads:[~2021-05-14 19:29 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-04-13 16:55 [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() Maciej Żenczykowski 2021-04-13 17:14 ` Greg KH 2021-04-14 2:49 ` Lorenzo Colitti 2021-04-15 16:47 ` Thomas Gleixner 2021-04-20 3:12 ` Maciej Żenczykowski 2021-04-20 6:44 ` Thomas Gleixner 2021-04-20 8:15 ` Lorenzo Colitti 2021-04-20 14:19 ` Thomas Gleixner 2021-04-21 14:08 ` Lorenzo Colitti 2021-04-21 14:40 ` Lorenzo Colitti 2021-04-21 15:22 ` Greg KH 2021-04-22 0:08 ` Thomas Gleixner 2021-04-22 10:07 ` Thomas Gleixner 2021-04-22 14:20 ` Lorenzo Colitti 2021-04-22 15:35 ` Thomas Gleixner 2021-04-26 8:49 ` [PATCH] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() Thomas Gleixner 2021-04-26 9:40 ` Peter Zijlstra 2021-04-26 12:25 ` Peter Zijlstra 2021-05-14 19:29 ` Thomas Gleixner 2021-04-26 12:33 ` Thomas Gleixner 2021-04-26 12:40 ` Peter Zijlstra 2021-04-26 14:27 ` Thomas Gleixner -- strict thread matches above, loose matches on Subject: below -- 2021-02-12 13:38 Sv: [PATCH] hrtimer: Interrupt storm on clock_settime Beckius, Mikael 2021-02-23 16:02 ` [PATCH] hrtimer: Update softirq_expires_next correctly after __hrtimer_get_next_event() Anna-Maria Behnsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).