From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754250AbbFJWh6 (ORCPT ); Wed, 10 Jun 2015 18:37:58 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:46774 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752276AbbFJWhv (ORCPT ); Wed, 10 Jun 2015 18:37:51 -0400 Date: Thu, 11 Jun 2015 00:37:36 +0200 From: Peter Zijlstra To: Oleg Nesterov Cc: umgwanakikbuti@gmail.com, mingo@elte.hu, ktkhai@parallels.com, rostedt@goodmis.org, tglx@linutronix.de, juri.lelli@gmail.com, pang.xunlei@linaro.org, wanpeng.li@linux.intel.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH 08/14] hrtimer: Allow hrtimer::function() to free the timer Message-ID: <20150610223736.GL3644@twins.programming.kicks-ass.net> References: <20150605084836.364306429@infradead.org> <20150605085205.723058588@infradead.org> <20150607223317.GA5193@redhat.com> <20150608091417.GM19282@twins.programming.kicks-ass.net> <20150608124234.GW18673@twins.programming.kicks-ass.net> <20150609213318.GA12436@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150609213318.GA12436@redhat.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 09, 2015 at 11:33:18PM +0200, Oleg Nesterov wrote: > And. Note that we can rewrite these 2 "write" critical sections in > __run_hrtimer() and enqueue_hrtimer() as > > cpu_base->running = timer; > > write_seqcount_begin(cpu_base->seq); > write_seqcount_end(cpu_base->seq); > > __remove_hrtimer(timer); > > and > > timer->state |= HRTIMER_STATE_ENQUEUED; > > write_seqcount_begin(cpu_base->seq); > write_seqcount_end(cpu_base->seq); > > base->running = NULL; > > So we can probably use write_seqcount_barrier() except I am not sure > about the 2nd wmb... Which second wmb? In any case, you use that transform from your reply to Kirill, and I cannot currently see a hole in that. Lets call this transformation A. It gets us the quoted bit above. Now the above is: seq++; smp_wmb(); smp_wmb(); seq++; Now, double barriers are pointless, so I think we can all agree that the above is identical to the below. Lets call this tranformation B. seq++; smp_wmb(); seq++; And then because you use the traditional seqcount read side, which stalls when seq&1, we can transform the above into this. Transformation C. smp_wmb(); seq += 2; Which is write_seqcount_barrier(), as you say above. And since there are no odd numbers possible in that scheme, its identical to my modified read side with the single increment. Transform D. The only difference at this point is that I have my seq increment on the 'wrong' side on the first state. cpu_base->running = timer; seq++; smp_wmb(); timer->state = 0; ... timer->state = 1; smp_wmb(); seq++; cpu_base->running = NULL; Which, per my previous mail provides the following: [S] seq++ [R] seq RMB [R] ->running (== NULL) [S] ->running = timer; WMB [S] ->state = INACTIVE [R] ->state (== INACTIVE) RMB [R] seq (== seq) Which is where we had to modify the read side to do: [R] ->state RMB [R] ->running Now, if we use write_seqcount_barrier() that would become: __run_hrtimer() hrtimer_active() [S] ->running = timer; [R] seq WMB RMB [S] seq += 2; [R] ->running [S] ->state = 0; [R] ->state RMB [R] seq Which we can reorder like: [R] seq RMB [R] ->running (== NULL) [S] ->running = timer WMB [S] ->state = 0 [R] ->state (== 0) RMB [R] seq (== seq) [S] seq += 2 Which still gives us that false negative and would still require the read side to be modified to do: [R] ->state RMB [R] ->running IOW, one of our transforms (A-D) is faulty for it requires a modification to the read side. I suspect its T-C, where we loose the odd count that holds up the read side. Because the moment we go from: Y = true; seq++ WMB seq++ X = false; to: Y = true; WMB seq += 2; X = false; It becomes possible to re-order like: Y = true; WMB X = false seq += 2; And we loose our read order; or rather, where previously we ordered the read side by seq, the seq increments are no longer ordered. With this I think we can prove my code correct, however it also suggests that: cpu_base->running = timer; seq++; smp_wmb(); seq++; timer->state = 0; ... timer->state = 1; seq++; smp_wmb(); seq++; cpu_base->running = NULL; vs hrtimer_active(timer) { do { base = READ_ONCE(timer->base->cpu_base); seq = read_seqcount_begin(&cpu_base->seq); if (timer->state & ENQUEUED || base->running == timer) return true; } while (read_seqcount_retry(&cpu_base->seq, seq) || base != READ_ONCE(timer->base->cpu_base)); return false; } Is the all-round cheapest solution. Those extra seq increments are almost free on all archs as the cacheline will be hot and modified on the local cpu. Only under the very rare condition of a concurrent hrtimer_active() call will that seq line be pulled into shared state. I shall go sleep now, and update my patch tomorrow, lets see if I will still agree with myself after a sleep :-)