From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755586AbcGFS2r (ORCPT ); Wed, 6 Jul 2016 14:28:47 -0400 Received: from mail-pf0-f178.google.com ([209.85.192.178]:33875 "EHLO mail-pf0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755538AbcGFS2p (ORCPT ); Wed, 6 Jul 2016 14:28:45 -0400 Date: Wed, 6 Jul 2016 11:28:42 -0700 From: Viresh Kumar To: Tejun Heo Cc: Greg Kroah-Hartman , Linux Kernel Mailing List , vlevenetz@mm-sol.com, vaibhav.hiremath@linaro.org, alex.elder@linaro.org, johan@kernel.org, akpm@linux-foundation.org, jack@suse.cz, rostedt@goodmis.org Subject: Re: [Query] Preemption (hogging) of the work handler Message-ID: <20160706182842.GS2671@ubuntu> References: <20160701165959.GR12473@ubuntu> <20160701172232.GD28719@htj.duckdns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160701172232.GD28719@htj.duckdns.org> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01-07-16, 12:22, Tejun Heo wrote: > Hello, Viresh. > > On Fri, Jul 01, 2016 at 09:59:59AM -0700, Viresh Kumar wrote: > > The system watchdog uses a delayed-work (1 second) for petting the > > watchdog (resetting its counter) and if the work doesn't reset the > > counters in time (another 1 second), the watchdog resets the system. > > > > Petting-time: 1 second > > Watchdog Reset-time: 2 seconds > > > > The wq is allocated with: > > wdog_wq = alloc_workqueue("wdog", WQ_HIGHPRI, 0); > > You probably want WQ_MEM_RECLAIM there to guarantee that it can run > quickly under memory pressure. In reality, this shouldn't matter too > much as there aren't many competing highpri work items and thus it's > likely that there are ready highpri workers waiting for work items. > > > The watchdog's work-handler looks like this: > > > > static void pet_watchdog_work(struct work_struct *work) > > { > > > > ... > > > > pet_watchdog(); //Reset its counters > > > > /* CONFIG_HZ=300, queuing for 1 second */ > > queue_delayed_work(wdog_wq, &wdog_dwork, 300); > > } > > > > kernel: 3.10 (Yeah, you can rant me for that, but its not something I > > can decide on :) > > Android? > > > Symptoms: > > > > - The watchdog reboots the system sometimes. It is more reproducible > > in cases where an (out-of-tree) bus enumerated over USB is suddenly > > disconnected, which leads to removal of lots of kernel devices on > > that bus and a lot of print messages as well, due to failures for > > sending any more data for those devices.. > > I see. > > > Observations: > > > > I tried to get more into it and found this.. > > > > - The timer used by the delayed work fires at the time it was > > programmed for (checked timer->expires with value of jiffies) and > > the work-handler gets a chance to run and reset the counters pretty > > quickly after that. > > Hmmm... > > > - But somehow, the timer isn't programmed for the right time. > > > > - Something is happening between the time the work-handler starts > > running and we read jiffies from the add_timer() function which gets > > called from within the queue_delayed_work(). > > > > - For example, if the value of jiffies in the pet_watchdog_work() > > handler (before calling queue_delayed_work()) is say 1000000, then > > the value of jiffies after the call to queue_delayed_work() has > > returned becomes 1000310. i.e. it sometimes increases by a value of > > over 300, which is 1 second in our setup. I have seen this delta to > > vary from 50 to 350. If it crosses 300, the watchdog resets the > > system (as it was programmed for 2 seconds). > > That's weird. Once the work item starts executing, there isn't much > which can delay it. queue_delayed_work() doesn't even take any lock > before reading jiffies. In the failing cases, what's jiffies right > before and after pet_watchdog_work()? Can that take long? > > > So, we aren't able to queue the next timer in time and that causes all > > these problems. I haven't concluded on why is that so.. > > > > Questions: > > > > - I hope that the wq handler can be preempted, but can it be this bad? > > It doesn't get preempted more than any other kthread w/ -20 nice > value, so in most systems it shouldn't get preempted at all. > > > - Is it fine to use the wq-handler for petting the watchdog? Or should > > that only be done with help of interrupt-handlers? > > It's absoultely fine but you'd prolly want WQ_HIGHPRI | > WQ_MEM_RECLAIM. > > > - Any other clues you can give which can help us figure out what's > > going on? > > > > Thanks in advance and sorry to bother you :) > > I'd watch sched TPs and see what actually is going on. The described > scenario should work completely fine. Okay, so I did trace what's going on the CPU for that long and I know what it is now. The *print* messages :) I enabled traces with '-e all' to look at everything happening on the CPU. Following is what starts in the middle of the delayed-work handler: kworker/0:1H-40 [000] d..1 2994.918766: console: [ 2994.918754] *** kworker/0:1H-40 [000] d..1 2994.923057: console: [ 2994.918817] *** kworker/0:1H-40 [000] d..1 2994.935639: console: [ 2994.918842] *** ... (more messages) kworker/0:1H-40 [000] d..1 2996.264372: console: [ 2994.943074] *** kworker/0:1H-40 [000] d..1 2996.268622: console: [ 2994.943410] *** kworker/0:1H-40 [000] d..1 2996.275050: ipi_entry: (Rescheduling interrupts) (*** are some specific errors to our usecase and aren't relevant here) These print messages continue from 2994.918 to 2996.268 (1.35 seconds) and they hog the work-handler for that long, which results in watchdog reboot in our setup. The 3.10 kernel implementation of the printk looks like this (if I am not wrong): local_irq_save(); flush-console-buffer(); //console_unlock() local_irq_restore(); So, the current CPU will try to print all the messages from the buffer, before enabling the interrupts again on the local CPU and so I don't see the hrtimer fire at all for almost a second. I tried looking at if something related to this changed between 3.10 and mainline, and found few patches at least. One of the important ones is: commit 5874af2003b1 ("printk: enable interrupts before calling console_trylock_for_printk()") I wasn't able to backport it cleanly to 3.10 yet to see it makes thing work better though. But it looks like it was targeting similar problems. @Jan Kara, Right ? Any other useful piece of advice/information you guys have? I see Jan, Andrew and Steven do a bunch of stuff in printk.c to make it better and so Cc'ing them. -- viresh