From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753261AbbDBPnD (ORCPT <rfc822;w@1wt.eu>);
	Thu, 2 Apr 2015 11:43:03 -0400
Received: from mail-db3on0065.outbound.protection.outlook.com ([157.55.234.65]:3146
	"EHLO emea01-db3-obe.outbound.protection.outlook.com"
	rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
	id S1751180AbbDBPm7 (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Thu, 2 Apr 2015 11:42:59 -0400
Message-ID: <551D6373.2030000@ezchip.com>
Date: Thu, 2 Apr 2015 11:42:43 -0400
From: Chris Metcalf <cmetcalf@ezchip.com>
User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: Frederic Weisbecker <fweisbec@gmail.com>, Don Zickus <dzickus@redhat.com>
CC: Ingo Molnar <mingo@kernel.org>, Andrew Morton <akpm@linux-foundation.org>,
        Andrew Jones <drjones@redhat.com>,
        chai wen <chaiw.fnst@cn.fujitsu.com>,
        Ulrich Obergfell <uobergfe@redhat.com>,
        Fabian Frederick <fabf@skynet.be>, Aaron Tomlin <atomlin@redhat.com>,
        Ben Zhang <benzh@chromium.org>, Christoph Lameter <cl@linux.com>,
        Gilad Ben-Yossef <gilad@benyossef.com>,
        Steven Rostedt <rostedt@goodmis.org>,
        open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] watchdog: nohz: don't run watchdog on nohz_full cores
References: <1427741465-15747-1-git-send-email-cmetcalf@ezchip.com> <20150331072502.GA16754@gmail.com> <551AE7D4.3020608@ezchip.com> <20150402133502.GA175361@redhat.com> <551D48F9.6090101@ezchip.com> <20150402141527.GD175361@redhat.com> <20150402153827.GC10357@lerouge>
In-Reply-To: <20150402153827.GC10357@lerouge>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
X-Originating-IP: [12.216.194.146]
X-ClientProxiedBy: BN1PR12CA0033.namprd12.prod.outlook.com (25.160.77.43) To
 DB4PR02MB0543.eurprd02.prod.outlook.com (10.141.45.16)
Authentication-Results: vger.kernel.org; dkim=none (message not signed)
 header.d=none;
X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:DB4PR02MB0543;
X-Microsoft-Antispam-PRVS: <DB4PR02MB054357680235C5098F5E56F0AFF20@DB4PR02MB0543.eurprd02.prod.outlook.com>
X-Forefront-Antispam-Report: BMV:1;SFV:NSPM;SFS:(10009020)(6049001)(6009001)(24454002)(51704005)(377454003)(479174004)(23746002)(86362001)(46102003)(77096005)(62966003)(19580395003)(33656002)(76176999)(92566002)(93886004)(80316001)(87976001)(122386002)(50466002)(65816999)(42186005)(15975445007)(50986999)(87266999)(54356999)(64126003)(36756003)(2950100001)(66066001)(83506001)(77156002)(47776003)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:DB4PR02MB0543;H:[10.7.0.41];FPR:;SPF:None;MLV:sfv;LANG:en;
X-Exchange-Antispam-Report-Test: UriScan:;
X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(5005006)(5002010);SRVR:DB4PR02MB0543;BCL:0;PCL:0;RULEID:;SRVR:DB4PR02MB0543;
X-Forefront-PRVS: 0534947130
X-OriginatorOrg: ezchip.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 02 Apr 2015 15:42:55.4714 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB4PR02MB0543
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 04/02/2015 11:38 AM, Frederic Weisbecker wrote:
> On Thu, Apr 02, 2015 at 10:15:27AM -0400, Don Zickus wrote:
>> On Thu, Apr 02, 2015 at 09:49:45AM -0400, Chris Metcalf wrote:
>>>> Can I ask how the NO_HZ_FULL technology works from userspace?  Is there a
>>>> system command that has to be sent?  How does the kernel know to turn off
>>>> ticks and trust userspace to do the right thing?
>>> The NO_HZ_FULL option, when configured into the kernel, lets
>>> you boot with "nohz_full=1-15" (or whatever cpumask you like),
>>> typically in conjunction with "isolcpus=1-15".  At this point no tasks
>>> will run on those cores until explicitly placed there by affinity, and
>>> once there and running in userspace, the kernel will automatically
>>> get out of their way and not interrupt at all.  This lets those tasks
>>> run with 100.000% of the cpu, which is a requirement for many
>>> user-space device drivers running high throughput devices.
>>> (This is typically the use case for the tile architecture customers.)
>>>
>>> So, other than a boot flag, there are no system commands or
>>> other APIs to deal with.
>> Ah, I am starting to understand your approach in the original patch better.
>>
>>> Part of the requirement, though, is that there can be only one task
>>> bound and runnable on that cpu, otherwise the kernel has to be
>>> involved to do the context-switching off of the scheduler tick.
>>> This is why having the standard watchdog kernel thread doesn't
>>> work in this context.
>> So, there is no preemption happening, which means the softlockup is rather
>> pointless.
> Still useful actually because nohz full only takes effect when a single task runs
> on the CPU. But there can still be more than 1 task running, just nohz full will
> be disabled. It all happens dynamically.
>
>> Can interrupts be disabled or handled on that cpu?  I am trying
>> to see if the hardlockup detector becomes rather silly on those cpus too.
> No interrupts aren't disabled on these CPUs. Now the goal is to avoid them:
> migrate irqs, nohz full, etc...
>
> But there can be irqs. And actually there is at least 1 tick every second in
> order to keep the scheduler stats moving forward. We plan to get rid of it but
> anyway the point is that IRQ can happen on nohz full CPUs.
>
>>> I continue to suspect that the right model here is to disable the
>>> watchdog specifically on the cores that the user has tagged with
>>> the nohz_full boot argument.  I agree that there might be a case
>>> to be made for leaving the watchdog conditionally (as suggested
>>> by Ingo) but it should be possible to have the watchdogs on
>>> the nohz_full cores be turned off completely if desired.
>> I think I might be slowly coming around to your thoughts.  I might request a
>> different patch though based on the answers above.  Maybe even create a
>> subset of the online cpus for the watchdog to work off of.  The watchdog
>> would copy the online cpu mask, mask off the nohz cpus and just function
>> that way.  It would print loud messages for each nohz cpu it was masking
>> off.
> All agreed with that! We should at least keep the watchdog running on
> non-nohz-full CPUs. And also allow to re-enable it everywhere when needed,
> in case we have a lockup to chase on nohz full CPUs.
>
>> Then perhaps as a debug aid, expose a /proc/sys/kernel/watchdog_cpumask for
>> folks to modify in case they want to enable the watchdog on the nohz cpus.
> That sounds like a good idea.

OK, I will respin v2 of the patch as follows:

- Provide a watchdog_cpumask as suggested by Don.
- On a non-NO_HZ_FULL build, it defaults to cpu_possible as normal
- On a NO_HZ_FULL build, it defaults to the housekeeping cpus
- If the mask is modified, we disable and then re-enable the watchdog,
   so that the watchdog init code can exit() the appropriate threads as
   they start up

This should address the various concerns that have been raised.

-- 
Chris Metcalf, EZChip Semiconductor
http://www.ezchip.com