From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57727)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jsnow@redhat.com>) id 1fuK1g-0006H8-Gk
	for qemu-devel@nongnu.org; Mon, 27 Aug 2018 12:06:06 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jsnow@redhat.com>) id 1fuK1c-00035w-B4
	for qemu-devel@nongnu.org; Mon, 27 Aug 2018 12:05:58 -0400
References: <20180629124052.331406-1-dplotnikov@virtuozzo.com>
	<4eca96cf-5520-fb82-2e23-afff3ba35077@virtuozzo.com>
	<c6f924f2-b70d-a460-f96c-7d40b9d06e22@redhat.com>
	<7da20590-58d7-4ef0-92b9-fcb955f28c9b@virtuozzo.com>
	<20180813163048.GO4323@localhost.localdomain>
	<ae864bb4-257e-240c-eeab-3c9c13462c0d@virtuozzo.com>
	<4dcf31f4-b2e0-e3cf-47ac-a9e745a9e2d8@virtuozzo.com>
From: John Snow <jsnow@redhat.com>
Message-ID: <eac4f402-4d67-acd3-8489-0253d93a62f3@redhat.com>
Date: Mon, 27 Aug 2018 12:05:37 -0400
MIME-Version: 1.0
In-Reply-To: <4dcf31f4-b2e0-e3cf-47ac-a9e745a9e2d8@virtuozzo.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [Qemu-block] [PATCH v0 0/2] Postponed actions
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Denis Plotnikov <dplotnikov@virtuozzo.com>, Kevin Wolf <kwolf@redhat.com>
Cc: famz@redhat.com, qemu-block@nongnu.org, qemu-devel@nongnu.org, qemu-stable@nongnu.org, stefanha@redhat.com, mreitz@redhat.com


On 08/27/2018 03:05 AM, Denis Plotnikov wrote:
> PING! PING!
>=20

Sorry, Kevin and Stefan are both on PTO right now, I think. I can't
promise I have the time to look soon, but you at least deserve an answer
for the radio silence the last week.

--js

> On 14.08.2018 10:08, Denis Plotnikov wrote:
>>
>>
>> On 13.08.2018 19:30, Kevin Wolf wrote:
>>> Am 13.08.2018 um 10:32 hat Denis Plotnikov geschrieben:
>>>> Ping ping!
>>>>
>>>> On 16.07.2018 21:59, John Snow wrote:
>>>>>
>>>>>
>>>>> On 07/16/2018 11:01 AM, Denis Plotnikov wrote:
>>>>>> Ping!
>>>>>>
>>>>>
>>>>> I never saw a reply to Stefan's question on July 2nd, did you reply
>>>>> off-list?
>>>>>
>>>>> --js
>>>> Yes, I did. I talked to Stefan why the patch set appeared.
>>>
>>> The rest of us still don't know the answer. I had the same question.
>>>
>>> Kevin
>> Yes, that's my fault. I should have post it earlier.
>>
>> I reviewed the problem once again and come up with the following
>> explanation.
>> Indeed, if the global lock has been taken by the main thread the vCPU
>> threads won't be able to execute mmio ide.
>> But, if the main thread will release the lock then nothing will preven=
t
>> vCPU threads form execution what they want, e.g writing to the block
>> device.
>>
>> In case of running the mirroring it is possible. Let's take a look
>> at the following snippet of mirror_run. This is a part the mirroring
>> completion part.
>>
>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 bdrv_drained_begin(bs);
>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 cnt =3D bdrv_get_dirty_count(s->dirty_bitmap);
>> =C2=A0>>>>>>=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (cnt > 0 || mirror_flush=
(s) < 0) {
>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 bdrv_drained_end(bs);
>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 continue;
>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
 }
>>
>> (X) >>>>=C2=A0=C2=A0=C2=A0 assert(QLIST_EMPTY(&bs->tracked_requests));
>>
>> mirror_flush here can yield the current coroutine so nothing more can
>> be executed.
>> We could end up with the situation when the main loop have to revolve
>> to poll for another timer/bh to process. While revolving it releases
>> the global lock. If the global lock is waited for by a vCPU (any
>> other) thread, the waiting thread will get the lock and make what it
>> intends.
>>
>> This is something that I can observe:
>>
>> mirror_flush yields coroutine, the main thread revolves and locks
>> because a vCPU was waiting for the lock. Now the vCPU thread owns the
>> lock and the main thread waits for the lock releasing.
>> The vCPU thread does cmd_write_dma and releases the lock. Then, the ma=
in
>> thread gets the lock and continues to run eventually proceeding with
>> the coroutine yeiled.
>> If the vCPU requests aren't completed by the moment we will assert at
>> (X). If the vCPU requests are completed we won't even notice that we
>> had some writes while in the drained section.
>>
>> Denis
>>>
>>>>>> On 29.06.2018 15:40, Denis Plotnikov wrote:
>>>>>>> There are cases when a request to a block driver state shouldn't
>>>>>>> have
>>>>>>> appeared producing dangerous race conditions.
>>>>>>> This misbehaviour is usually happens with storage devices emulate=
d
>>>>>>> without eventfd for guest to host notifications like IDE.
>>>>>>>
>>>>>>> The issue arises when the context is in the "drained" section
>>>>>>> and doesn't expect the request to come, but request comes from th=
e
>>>>>>> device not using iothread and which context is processed by the m=
ain
>>>>>>> loop.
>>>>>>>
>>>>>>> The main loop apart of the iothread event loop isn't blocked by t=
he
>>>>>>> "drained" section.
>>>>>>> The request coming and processing while in "drained" section can
>>>>>>> spoil
>>>>>>> the
>>>>>>> block driver state consistency.
>>>>>>>
>>>>>>> This behavior can be observed in the following KVM-based case:
>>>>>>>
>>>>>>> 1. Setup a VM with an IDE disk.
>>>>>>> 2. Inside a VM start a disk writing load for the IDE device
>>>>>>> =C2=A0 =C2=A0=C2=A0 e.g: dd if=3D<file> of=3D<file> bs=3DX count=3D=
Y oflag=3Ddirect
>>>>>>> 3. On the host create a mirroring block job for the IDE device
>>>>>>> =C2=A0 =C2=A0=C2=A0 e.g: drive_mirror <your_IDE> <your_path>
>>>>>>> 4. On the host finish the block job
>>>>>>> =C2=A0 =C2=A0=C2=A0 e.g: block_job_complete <your_IDE>
>>>>>>> =C2=A0 =C2=A0 Having done the 4th action, you could get an assert=
:
>>>>>>> assert(QLIST_EMPTY(&bs->tracked_requests)) from mirror_run.
>>>>>>> On my setup, the assert is 1/3 reproducible.
>>>>>>>
>>>>>>> The patch series introduces the mechanism to postpone the request=
s
>>>>>>> until the BDS leaves "drained" section for the devices not using
>>>>>>> iothreads.
>>>>>>> Also, it modifies the asynchronous block backend infrastructure
>>>>>>> to use
>>>>>>> that mechanism to release the assert bug for IDE devices.
>>>>>>>
>>>>>>> Denis Plotnikov (2):
>>>>>>> =C2=A0 =C2=A0=C2=A0 async: add infrastructure for postponed actio=
ns
>>>>>>> =C2=A0 =C2=A0=C2=A0 block: postpone the coroutine executing if th=
e BDS's is drained
>>>>>>>
>>>>>>> =C2=A0 =C2=A0 block/block-backend.c | 58
>>>>>>> ++++++++++++++++++++++++++++++---------
>>>>>>> =C2=A0 =C2=A0 include/block/aio.h=C2=A0=C2=A0 | 63
>>>>>>> +++++++++++++++++++++++++++++++++++++++++++
>>>>>>> =C2=A0 =C2=A0 util/async.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0 | 33 +++++++++++++++++++++++
>>>>>>> =C2=A0 =C2=A0 3 files changed, 142 insertions(+), 12 deletions(-)
>>>>>>>
>>>>>>
>>>>
>>>> --=C2=A0
>>>> Best,
>>>> Denis
>>
>=20

--=20
=E2=80=94js