From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:48163) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TShq2-0005hZ-Ph for qemu-devel@nongnu.org; Mon, 29 Oct 2012 01:24:40 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TShq1-00008n-B7 for qemu-devel@nongnu.org; Mon, 29 Oct 2012 01:24:38 -0400 Received: from mail-wg0-f53.google.com ([74.125.82.53]:42981) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TShq1-00008P-0f for qemu-devel@nongnu.org; Mon, 29 Oct 2012 01:24:37 -0400 Received: by mail-wg0-f53.google.com with SMTP id dr1so2730161wgb.10 for ; Sun, 28 Oct 2012 22:24:35 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <50898979.8000203@siemens.com> References: <1350897839-29593-1-git-send-email-pingfank@linux.vnet.ibm.com> <1350897839-29593-13-git-send-email-pingfank@linux.vnet.ibm.com> <50865DB9.1060106@siemens.com> <50879606.3070004@web.de> <50890002.90701@redhat.com> <508906DC.8090502@web.de> <50896707.9060405@redhat.com> <50896B51.3040708@siemens.com> <50897089.50304@redhat.com> <50898979.8000203@siemens.com> From: liu ping fan Date: Mon, 29 Oct 2012 13:24:15 +0800 Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [patch v4 12/16] e1000: apply fine lock on e1000 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jan Kiszka Cc: Liu Ping Fan , Stefan Hajnoczi , Marcelo Tosatti , "qemu-devel@nongnu.org" , Avi Kivity , Anthony Liguori , Paolo Bonzini On Fri, Oct 26, 2012 at 2:48 AM, Jan Kiszka wrote: > On 2012-10-25 19:02, Avi Kivity wrote: >> On 10/25/2012 06:39 PM, Jan Kiszka wrote: >>>> >>>> That doesn't work cross-thread. >>>> >>>> vcpu A: write to device X, dma-ing to device Y >>>> vcpu B: write to device Y, dma-ing to device X >>> >>> We will deny DMA-ing from device X on behalf of a VCPU, ie. in dispatch >>> context, to Y. >>> >>> What we do not deny, though, is DMA-ing from an I/O thread that >>> processes an event for device X. >> >> I would really like to avoid depending on the context. In real hardware= , there is no such thing. > > The point is how we deal with any kind of access to a device that > requires taking that device's lock while holding another lock, provided > that scenario can also take place in reverse order at the same time. > Known scenarios are: > > - vcpu 1 -> access device A -> access device B > - vcpu 2 -> access device B -> access device A > > - event 1 -> device A event processing -> access device B > - event 2 -> device B event processing -> access device A > > and combinations of those pairs. > >> >>> If the invoked callback of device X >>> holds the device lock across some DMA request to Y, then we risk to run >>> into the same ABBA issue. Hmm... >> >> Yup. >> >>> >>>> >>>> My suggestion was to drop the locks around DMA, then re-acquire the lo= ck >>>> and re-validate data. >>> >>> Maybe possible, but hairy depending on the device model. >> >> It's unpleasant, yes. >> >> Note depending on the device, we may not need to re-validate data, it ma= y be sufficient to load it into local variables to we know it is consistent= at some point. But all those solutions suffer from requiring device model= authors to understand all those issues, rather than just add a simple lock= around access to their data structures. > > Right. And therefor it is a suboptimal way to start (patching). > >> >>>>>>> I see that we have a all-or-nothing problem here: to address this >>>>>>> properly, we need to convert the IRQ path to lock-less (or at least >>>>>>> compatible with holding per-device locks) as well. >>>>>> >>>>>> There is a transitional path where writing to a register that can ca= use >>>>>> IRQ changes takes both the big lock and the local lock. >>>>>> >>>>>> Eventually, though, of course all inner subsystems must be threaded = for >>>>>> this work to have value. >>>>>> >>>>> >>>>> But that transitional path must not introduce regressions. Opening a >>>>> race window between IRQ cause update and event injection is such a >>>>> thing, just like dropping concurrent requests on the floor. >>>> >>>> Can you explain the race? >>> >>> Context A Context B >>> >>> device.lock >>> ... >>> device.set interrupt_cause =3D 0 >>> lower_irq =3D true >>> ... >>> device.unlock >>> device.lock >>> ... >>> device.interrupt_cause =3D 42 >>> raise_irq =3D true >>> ... >>> device.unlock >>> if (raise_irq) >>> bql.lock >>> set_irq(device.irqno) >>> bql.unlock >>> if (lower_irq) >>> bql.lock >>> clear_irq(device.irqno) >>> bql.unlock >>> >>> >>> And there it goes, our interrupt event. >> >> Obviously you'll need to reacquire the device lock after taking bql and = revalidate stuff. But that is not what I am suggesting. Instead, any path= that can lead to an irq update (or timer update etc) will take both the bq= l and the device lock. This will leave after the first pass only side effe= ct free register reads and writes, which is silly if we keep it that way, b= ut we will follow with a threaded timer and irq subsystem and we'll peel aw= ay those big locks. >> >> device_mmio_write: >> if register is involved in irq or timers or block layer or really an= ything that matters: >> bql.acquire >> device.lock.acquire >> do stuff >> device.lock.release >> if that big condition from above was true: >> bql.release > > Looks simpler than it is as you cannot wrap complete handlers with that > pattern. An example where it would fail (until we solved the locking > issues above): > > mmio_write: > bql.conditional_lock > device.lock > device.check_state > issue_dma > device.update_state > update_irq, play_with_timers, etc. > device.unlock > bql.conditional_unlock > > If that DMA request hits an unconverted MMIO region or one that takes > BQL conditionally as above, we will lock up (or bail out as our mutexes > detect the error). E1000's start_xmit looks like this so far, and that's > a pretty import service. > > Moreover, I prefer having a representative cut-through over enjoying to > merge a first step that excludes some 80% of the problems. For that > reason I would be even be inclined to start with addressing the IRQ > injection topic first (patch-wise), then the other necessary backend > services for the e1000 or whatever and convert some device(s) last. > > IOW: cut out anything from this series that touches e1000 until the > building blocks for converting it reasonably are finished. Carrying > experimental, partially broken conversion on top is fine, try to merge > pieces of that not, IMHO. > Agreed. Just want to take this opportunity to discuss what is the next and what is nut. Regards, pingfan > Jan > > -- > Siemens AG, Corporate Technology, CT RTC ITP SDP-DE > Corporate Competence Center Embedded Linux