From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751996AbaKJU1N (ORCPT ); Mon, 10 Nov 2014 15:27:13 -0500 Received: from mail-wg0-f52.google.com ([74.125.82.52]:33294 "EHLO mail-wg0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751218AbaKJU1L (ORCPT ); Mon, 10 Nov 2014 15:27:11 -0500 Date: Mon, 10 Nov 2014 21:26:59 +0100 From: Frederic Weisbecker To: Christoph Lameter Cc: Thomas Gleixner , linux-kernel@vger.kernel.org, Gilad Ben-Yossef , Tejun Heo , John Stultz , Mike Frysinger , Minchan Kim , Hakan Akkan , Max Krasnyansky , "Paul E. McKenney" , Hugh Dickins , Viresh Kumar , "H. Peter Anvin" , Ingo Molnar , Peter Zijlstra Subject: Re: [NOHZ] Remove scheduler_tick_max_deferment Message-ID: <20141110202655.GB29741@lerouge> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Nov 01, 2014 at 04:52:13PM -0500, Christoph Lameter wrote: > On Sat, 1 Nov 2014, Thomas Gleixner wrote: > > > On Fri, 31 Oct 2014, Christoph Lameter wrote: > > > The reasoning behind this function is not clear to me and removal seems > > > > The comment above the function is clear enough. > > I looked around into the functions called by the timer interrupt for > accounting etc. They have measures to compensate if the HZ is not > occurring for some time. Not very well. They handle correctly dynticks idle but not dynticks full. Checkout update_cpu_load_active() -> __update_cpu_load() for example. There is a pending_update argument that take care of tickless delta but decay_load_miss() catch up with the missing cpu load assuming it was all 0 (idle) all that time. Generally speaking the scheduler assume dynticks to be idle dynticks. And that concerns the above example and probably many other accounting. Now the issue with update_cpu_load_active() is there, whether we keep 1 Hz or not, any delta of full dynticks workload makes it buggy because it's accounted as idle load. But removing the 1 Hz residual tick is dangerous because many accounting in the scheduler tick assume regular updates. It's mostly ok as long as the accounting is exclusively updated and read locally. But some accounting is also updated locally and read remotely. So if CPU 0 is full dynticks and runs for 1 hour in userspace and CPU 1 reads its stats, those will be buggy because of the missing updates. At best in this scenarion CPU 1 may consider that CPU 0 has been idle for 1 hour, at worst the stats can be junk and there can be crashes. Also a lot of the scheduler decisions is based on these accountings. Load balancing to the least. So we have two possible solutions: 1) Make the scheduler more full-dynticks aware. Which means that any remote stat accounting read must handle out of date results. That's going to be tricky: if you check scheduler_tick() and sched_class::task_tick(), even simply trying to sort out which stat is updated, can handle busy dynticks load, is read only locally or can be read remotely, handles overflow, etc... That's enough work for an army of ants. 2) Offload scheduler_tick() to the housekeeping. It looks like many of the updaters there can easily take a remote rq argument. There doesn't seem to be much local rq assumption. So that's the easiest solution. But we can't just remove scheduler_tick_max_deferment() and not fix things behind. The result will be unpredictably insane and dangerous. The only predictable thing that's going to happen if we do that is that nobody will ever fix it properly.