From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41640) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VTdUD-0007nt-RH for qemu-devel@nongnu.org; Tue, 08 Oct 2013 16:02:31 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VTdU7-0006P4-Lk for qemu-devel@nongnu.org; Tue, 08 Oct 2013 16:02:29 -0400 Received: from mail.avalus.com ([2001:41c8:10:1dd::10]:48613) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VTdU7-0006Oo-3a for qemu-devel@nongnu.org; Tue, 08 Oct 2013 16:02:23 -0400 Mime-Version: 1.0 (Apple Message framework v1085) Content-Type: text/plain; charset=us-ascii From: Alex Bligh In-Reply-To: <52546000.6070308@redhat.com> Date: Tue, 8 Oct 2013 21:01:57 +0100 Content-Transfer-Encoding: 7bit Message-Id: <6CD4D4EE-341E-48D0-98F4-D55C0D3922D4@alex.org.uk> References: <1381259403-7386-1-git-send-email-hdegoede@redhat.com> <52545950.5070403@redhat.com> <52545B44.70005@redhat.com> <28EE4224-856B-4DC1-8159-A0C274BD269E@alex.org.uk> <52546000.6070308@redhat.com> Subject: Re: [Qemu-devel] [PATCH] main-loop: Don't lock starve io-threads when main_loop_tlg has pending events List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Hans de Goede Cc: Paolo Bonzini , qemu-devel@nongnu.org, Alex Bligh On 8 Oct 2013, at 20:41, Hans de Goede wrote: >>> >>> Wasn't it 1 ms until the offending commit (note 250 us does >>> sound better to me). >> >> I believe you've made it 1 nanosecond not 1 millisecond; > > Correct, the 1 ms I referred to was before your commit which changed > things from ms to ns. OK I was looking at the patch as it would apply to master now. > The purpose of the 1 ns timeout is to cause os_host_main_loop_wait > to unlock the iothread, as $subject says the problem I'm seeing seems > to be lock starvation not cpu starvation. > > Note as I already indicated I'm in no way an expert in this, if you > and or Paolo suspect cpu starvation may happen too, then bumping > the timeout to 250 us is fine with me too. > > If we go with 250 us that thus pose the question though if we should > always keep a minimum timeout of 250 us when not non-blocking, or only > bump it to 250 us when main_loop_tlg has already expired events and > thus is causing a timeout of 0. I am by no means an expert in the iothread bit, so let's pool our ignorance ... :-) Somewhere within that patch series (7b595f35 I think) I fixed up the spin counter bit, which made it slightly less yucky and work with milliseconds. I hope I didn't break it but there seems something slightly odd about the use case here. If you are getting the spin error, this implies something is pretty much constantly polling os_host_main_loop_wait with a zero timeout. As you point out this is going to be main_loop_wait and almost certainly main_loop_wait called with nonblocking set to 1. The comment at line 208 suggests that "the I/O thread is very busy or we are incorrectly busy waiting in the I/O thread". Do we know which is happening? Perhaps rather than give up the io_thread mutex on every call (which is in practice what a 1 nanosecond timeout does) we should give it up if we have not released it for X nanoseconds (maybe X=250us), or on every Y calls. I think someone other than me should consider the effect of dropping and reacquiring a mutex so frequently under heavy I/O load, but I'm not sure it's a great idea. So on reflection you might be more right with 1 nanosecond than 250us as a timeout of 250us, but I wonder whether a strategy of just dropping the lock occasionally (and still using a zero timeout) might be better. -- Alex Bligh