From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753068Ab1FZKT5 (ORCPT ); Sun, 26 Jun 2011 06:19:57 -0400 Received: from mail-fx0-f52.google.com ([209.85.161.52]:39714 "EHLO mail-fx0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753042Ab1FZKTv (ORCPT ); Sun, 26 Jun 2011 06:19:51 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; b=hcPRYLAN1IYi9ro3t8wvHBOcODn+qUg6QPKE1YeFgliAgf9DC6ww2jJl5LxZzmPuAj JNkhvs7GiI8pHf4XyZ5diyRkTbcE7e7iQWo3vsKvBrLPjIOzUg0HUGHTxGeyGx4Fo+wY 8Q/Av3lb51jIi62af0RFpUvb+Q5AueRe8+zMk= Date: Sun, 26 Jun 2011 12:19:45 +0200 From: Tejun Heo To: Thomas Gleixner Cc: Ingo Molnar , LKML , Peter Zijlstra , Jens Axboe , Linus Torvalds Subject: Re: [patch 4/4] sched: Distangle worker accounting from rq->lock Message-ID: <20110626101945.GB12200@mtj.dyndns.org> References: <20110622174659.496793734@linutronix.de> <20110622174919.135236139@linutronix.de> <20110623083722.GF30101@htj.dyndns.org> <20110623101541.GL30101@htj.dyndns.org> <20110623104455.GA9274@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Thomas. On Fri, Jun 24, 2011 at 11:01:12AM +0200, Thomas Gleixner wrote: > The whole design is completely plugged into the scheduler, as you > _abuse_ rq->lock to serialize the accounting and allow the unlocked > access to the idle_list. It does not matter at all whether that code > path is fast or not, it simply does not belong there. It didn't happen like that. Hooking into scheduler was the one of the biggest issues and there were several different attempts at using / generalizing existing hooks somehow which took good six months without much agreement from scheduler side. I don't recall all the details now but IIRC it tried to use fire_sched_out_preempt_notifiers() which was called inside rq lock which introduced all the convolution with rq lock. Later somebody suggested just open coding it and being done with it and that's how it ended up inside rq lock. Hindsight is 20/20 and we could have pulled it out of rq lock at that time but the rest of the code has been already running with various different hooking schemes for quite some time by then. Anyways, so, let's clean it up. > The only reason why you want this precise accounting thing is a corner > case of system overload, but not the common case. In case of system > overload you can do extra work to figure out how many of these beasts > are on the fly and whether it makes sense to start some more, simply > because it does not matter in a overload situation whether you run > through a list to gather information. Sure, there are multiple ways to resolve the situation. I was mostly worried about waking up / attempting to start more workers under CPU pressure (this is fine under memory pressure as CPU is idle and memory allocation failure / blocking would quickly recall rescuers). I've been looking at the code and am still not sure but it could be that this isn't really something to worry about. Workqueue code always wakes up the first idle worker and a worker leaves idle state not by its waker but on its own accord when it start running, which means that, with the suggested change, there can be multiple spurious wake ups but they'll all try to wake up the same idle worker which hasn't started running yet. We might still unnecessarily try to summon rescuers tho. It needs more considerations and experimentations. Another concern is the additional spin_lock in the sleeping path to synchronize against CPU unplug operation I mentioned in the other reply. Another way to do it would be using preempt_disable/enable() instead and calling synchronize_sched() from trustee. It would only add an extra preemption flipping compared to the current code. > Can we please stop adding stuff to the most crucial code pathes in the > kernel just to avoid thinking a bit harder about problems? Heh... sure, let's just keep it technical, okay? Thanks. -- tejun