From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:48163)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <qemulist@gmail.com>) id 1TShq2-0005hZ-Ph
	for qemu-devel@nongnu.org; Mon, 29 Oct 2012 01:24:40 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <qemulist@gmail.com>) id 1TShq1-00008n-B7
	for qemu-devel@nongnu.org; Mon, 29 Oct 2012 01:24:38 -0400
Received: from mail-wg0-f53.google.com ([74.125.82.53]:42981)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <qemulist@gmail.com>) id 1TShq1-00008P-0f
	for qemu-devel@nongnu.org; Mon, 29 Oct 2012 01:24:37 -0400
Received: by mail-wg0-f53.google.com with SMTP id dr1so2730161wgb.10
	for <qemu-devel@nongnu.org>; Sun, 28 Oct 2012 22:24:35 -0700 (PDT)
MIME-Version: 1.0
In-Reply-To: <50898979.8000203@siemens.com>
References: <1350897839-29593-1-git-send-email-pingfank@linux.vnet.ibm.com>
	<1350897839-29593-13-git-send-email-pingfank@linux.vnet.ibm.com>
	<50865DB9.1060106@siemens.com>
	<CAJnKYQn2hEpQoeOufWZuAbz_ROgN1EuYFS-EC6if3ENKnmXC9w@mail.gmail.com>
	<50879606.3070004@web.de> <50890002.90701@redhat.com>
	<508906DC.8090502@web.de>
	<50896707.9060405@redhat.com> <50896B51.3040708@siemens.com>
	<50897089.50304@redhat.com> <50898979.8000203@siemens.com>
From: liu ping fan <qemulist@gmail.com>
Date: Mon, 29 Oct 2012 13:24:15 +0800
Message-ID: <CAJnKYQkCJvxdvQROZkeVmpf_xRWcfcpFabvvd0TNj3kFYg41eg@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [patch v4 12/16] e1000: apply fine lock on e1000
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Liu Ping Fan <pingfank@linux.vnet.ibm.com>, Stefan Hajnoczi <stefanha@gmail.com>, Marcelo Tosatti <mtosatti@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, Avi Kivity <avi@redhat.com>, Anthony Liguori <anthony@codemonkey.ws>, Paolo Bonzini <pbonzini@redhat.com>

On Fri, Oct 26, 2012 at 2:48 AM, Jan Kiszka <jan.kiszka@siemens.com> wrote:
> On 2012-10-25 19:02, Avi Kivity wrote:
>> On 10/25/2012 06:39 PM, Jan Kiszka wrote:
>>>>
>>>> That doesn't work cross-thread.
>>>>
>>>> vcpu A: write to device X, dma-ing to device Y
>>>> vcpu B: write to device Y, dma-ing to device X
>>>
>>> We will deny DMA-ing from device X on behalf of a VCPU, ie. in dispatch
>>> context, to Y.
>>>
>>> What we do not deny, though, is DMA-ing from an I/O thread that
>>> processes an event for device X.
>>
>> I would really like to avoid depending on the context.  In real hardware=
, there is no such thing.
>
> The point is how we deal with any kind of access to a device that
> requires taking that device's lock while holding another lock, provided
> that scenario can also take place in reverse order at the same time.
> Known scenarios are:
>
>  - vcpu 1 -> access device A -> access device B
>  - vcpu 2 -> access device B -> access device A
>
>  - event 1 -> device A event processing -> access device B
>  - event 2 -> device B event processing -> access device A
>
> and combinations of those pairs.
>
>>
>>> If the invoked callback of device X
>>> holds the device lock across some DMA request to Y, then we risk to run
>>> into the same ABBA issue. Hmm...
>>
>> Yup.
>>
>>>
>>>>
>>>> My suggestion was to drop the locks around DMA, then re-acquire the lo=
ck
>>>> and re-validate data.
>>>
>>> Maybe possible, but hairy depending on the device model.
>>
>> It's unpleasant, yes.
>>
>> Note depending on the device, we may not need to re-validate data, it ma=
y be sufficient to load it into local variables to we know it is consistent=
 at some point.  But all those solutions suffer from requiring device model=
 authors to understand all those issues, rather than just add a simple lock=
 around access to their data structures.
>
> Right. And therefor it is a suboptimal way to start (patching).
>
>>
>>>>>>> I see that we have a all-or-nothing problem here: to address this
>>>>>>> properly, we need to convert the IRQ path to lock-less (or at least
>>>>>>> compatible with holding per-device locks) as well.
>>>>>>
>>>>>> There is a transitional path where writing to a register that can ca=
use
>>>>>> IRQ changes takes both the big lock and the local lock.
>>>>>>
>>>>>> Eventually, though, of course all inner subsystems must be threaded =
for
>>>>>> this work to have value.
>>>>>>
>>>>>
>>>>> But that transitional path must not introduce regressions. Opening a
>>>>> race window between IRQ cause update and event injection is such a
>>>>> thing, just like dropping concurrent requests on the floor.
>>>>
>>>> Can you explain the race?
>>>
>>> Context A                            Context B
>>>
>>> device.lock
>>> ...
>>> device.set interrupt_cause =3D 0
>>> lower_irq =3D true
>>> ...
>>> device.unlock
>>>                                      device.lock
>>>                                      ...
>>>                                      device.interrupt_cause =3D 42
>>>                                      raise_irq =3D true
>>>                                      ...
>>>                                      device.unlock
>>>                                      if (raise_irq)
>>>                                              bql.lock
>>>                                              set_irq(device.irqno)
>>>                                              bql.unlock
>>> if (lower_irq)
>>>      bql.lock
>>>      clear_irq(device.irqno)
>>>      bql.unlock
>>>
>>>
>>> And there it goes, our interrupt event.
>>
>> Obviously you'll need to reacquire the device lock after taking bql and =
revalidate stuff.  But that is not what I am suggesting.  Instead, any path=
 that can lead to an irq update (or timer update etc) will take both the bq=
l and the device lock.  This will leave after the first pass only side effe=
ct free register reads and writes, which is silly if we keep it that way, b=
ut we will follow with a threaded timer and irq subsystem and we'll peel aw=
ay those big locks.
>>
>>   device_mmio_write:
>>     if register is involved in irq or timers or block layer or really an=
ything that matters:
>>       bql.acquire
>>     device.lock.acquire
>>     do stuff
>>     device.lock.release
>>     if that big condition from above was true:
>>       bql.release
>
> Looks simpler than it is as you cannot wrap complete handlers with that
> pattern. An example where it would fail (until we solved the locking
> issues above):
>
> mmio_write:
>   bql.conditional_lock
>   device.lock
>   device.check_state
>   issue_dma
>   device.update_state
>   update_irq, play_with_timers, etc.
>   device.unlock
>   bql.conditional_unlock
>
> If that DMA request hits an unconverted MMIO region or one that takes
> BQL conditionally as above, we will lock up (or bail out as our mutexes
> detect the error). E1000's start_xmit looks like this so far, and that's
> a pretty import service.
>
> Moreover, I prefer having a representative cut-through over enjoying to
> merge a first step that excludes some 80% of the problems. For that
> reason I would be even be inclined to start with addressing the IRQ
> injection topic first (patch-wise), then the other necessary backend
> services for the e1000 or whatever and convert some device(s) last.
>
> IOW: cut out anything from this series that touches e1000 until the
> building blocks for converting it reasonably are finished. Carrying
> experimental, partially broken conversion on top is fine, try to merge
> pieces of that not, IMHO.
>
Agreed.  Just want to take this opportunity to discuss what is the
next and what is nut.

Regards,
pingfan

> Jan
>
> --
> Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
> Corporate Competence Center Embedded Linux