From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:37063)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mst@redhat.com>) id 1bh61Z-0006Wd-U6
	for qemu-devel@nongnu.org; Mon, 05 Sep 2016 22:22:11 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mst@redhat.com>) id 1bh61V-0001WN-L0
	for qemu-devel@nongnu.org; Mon, 05 Sep 2016 22:22:08 -0400
Received: from mx1.redhat.com ([209.132.183.28]:60468)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mst@redhat.com>) id 1bh61V-0001WH-Bi
	for qemu-devel@nongnu.org; Mon, 05 Sep 2016 22:22:05 -0400
Date: Tue, 6 Sep 2016 05:22:03 +0300
From: "Michael S. Tsirkin" <mst@redhat.com>
Message-ID: <20160906051844-mutt-send-email-mst@kernel.org>
References: <1470842980-32481-4-git-send-email-mst@redhat.com>
	<20160812063828.GG2759@al.usersys.redhat.com>
	<D0BB11C2-AA9A-4AC8-80CF-9927DF67834F@nutanix.com>
	<20160814054808-mutt-send-email-mst@kernel.org>
	<09A27EBF-F644-45E7-949D-A5D55AE3BCB5@nutanix.com>
	<49c5ce42-8fa8-964d-920f-2f4126d6a229@redhat.com>
	<20160901164131-mutt-send-email-mst@kernel.org>
	<49f371cd-21e8-a567-0d1a-d71dca586b6d@redhat.com>
	<20160902202753-mutt-send-email-mst@kernel.org>
	<fcd2ec5a-117e-107a-61ad-a571e9935970@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <fcd2ec5a-117e-107a-61ad-a571e9935970@redhat.com>
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PULL 3/3] vhost-user: Attempt to fix a race with
 set_mem_table.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Prerna Saxena <prerna.saxena@nutanix.com>, "marcandre.lureau@redhat.com" <marcandre.lureau@redhat.com>, Peter Maydell <peter.maydell@linaro.org>, Fam Zheng <famz@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>

On Mon, Sep 05, 2016 at 03:06:09PM +0200, Maxime Coquelin wrote:
>=20
>=20
> On 09/02/2016 07:29 PM, Michael S. Tsirkin wrote:
> > On Fri, Sep 02, 2016 at 10:57:17AM +0200, Maxime Coquelin wrote:
> > >=20
> > >=20
> > > On 09/01/2016 03:46 PM, Michael S. Tsirkin wrote:
> > > > On Wed, Aug 31, 2016 at 01:19:47PM +0200, Maxime Coquelin wrote:
> > > > >=20
> > > > >=20
> > > > > On 08/14/2016 11:42 AM, Prerna Saxena wrote:
> > > > > > On 14/08/16 8:21 am, "Michael S. Tsirkin" <mst@redhat.com> wr=
ote:
> > > > > >=20
> > > > > >=20
> > > > > > > On Fri, Aug 12, 2016 at 07:16:34AM +0000, Prerna Saxena wro=
te:
> > > > > > > >=20
> > > > > > > > On 12/08/16 12:08 pm, "Fam Zheng" <famz@redhat.com> wrote=
:
> > > > > > > >=20
> > > > > > > >=20
> > > > > > > >=20
> > > > > > > >=20
> > > > > > > >=20
> > > > > > > > > On Wed, 08/10 18:30, Michael S. Tsirkin wrote:
> > > > > > > > > > From: Prerna Saxena <prerna.saxena@nutanix.com>
> > > > > > > > > >=20
> > > > > > > > > > The set_mem_table command currently does not seek a r=
eply. Hence, there is
> > > > > > > > > > no easy way for a remote application to notify to QEM=
U when it finished
> > > > > > > > > > setting up memory, or if there were errors doing so.
> > > > > > > > > >=20
> > > > > > > > > > As an example:
> > > > > > > > > > (1) Qemu sends a SET_MEM_TABLE to the backend (eg, a =
vhost-user net
> > > > > > > > > > application). SET_MEM_TABLE does not require a reply =
according to the spec.
> > > > > > > > > > (2) Qemu commits the memory to the guest.
> > > > > > > > > > (3) Guest issues an I/O operation over a new memory r=
egion which was configured on (1).
> > > > > > > > > > (4) The application has not yet remapped the memory, =
but it sees the I/O request.
> > > > > > > > > > (5) The application cannot satisfy the request becaus=
e it does not know about those GPAs.
> > > > > > > > > >=20
> > > > > > > > > > While a guaranteed fix would require a protocol exten=
sion (committed separately),
> > > > > > > > > > a best-effort workaround for existing applications is=
 to send a GET_FEATURES
> > > > > > > > > > message before completing the vhost_user_set_mem_tabl=
e() call.
> > > > > > > > > > Since GET_FEATURES requires a reply, an application t=
hat processes vhost-user
> > > > > > > > > > messages synchronously would probably have completed =
the SET_MEM_TABLE before replying.
> > > > > > > > > >=20
> > > > > > > > > > Signed-off-by: Prerna Saxena <prerna.saxena@nutanix.c=
om>
> > > > > > > > > > Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > >=20
> > > > > > > > > Sporadic hangs are seen with test-vhost-user after this=
 patch:
> > > > > > > > >=20
> > > > > > > > > https://travis-ci.org/qemu/qemu/builds
> > > > > > > > >=20
> > > > > > > > > Reverting seems to fix it for me.
> > > > > > > > >=20
> > > > > > > > > Is this a known problem?
> > > > > > > > >=20
> > > > > > > > > Fam
> > > > > > > >=20
> > > > > > > > Hi Fam,
> > > > > > > > Thanks for reporting the sporadic hangs. I had seen =E2=80=
=98make check=E2=80=99 pass on my Centos 6 environment, so missed this.
> > > > > > > > I am setting up the docker test env to repro this, but I =
think I can guess the problem :
> > > > > > > >=20
> > > > > > > > In tests/vhost-user-test.c:
> > > > > > > >=20
> > > > > > > > static void chr_read(void *opaque, const uint8_t *buf, in=
t size)
> > > > > > > > {
> > > > > > > > ..[snip]..
> > > > > > > >=20
> > > > > > > >     case VHOST_USER_SET_MEM_TABLE:
> > > > > > > >        /* received the mem table */
> > > > > > > >        memcpy(&s->memory, &msg.payload.memory, sizeof(msg=
.payload.memory));
> > > > > > > >        s->fds_num =3D qemu_chr_fe_get_msgfds(chr, s->fds,=
 G_N_ELEMENTS(s->fds));
> > > > > > > >=20
> > > > > > > >=20
> > > > > > > >        /* signal the test that it can continue */
> > > > > > > >        g_cond_signal(&s->data_cond);
> > > > > > > >        break;
> > > > > > > > ..[snip]..
> > > > > > > > }
> > > > > > > >=20
> > > > > > > >=20
> > > > > > > > The test seems to be marked complete as soon as mem_table=
 is copied.
> > > > > > > > However, this patch 3/3 changes the behaviour of the SET_=
MEM_TABLE vhost command implementation with qemu. SET_MEM_TABLE now sends=
 out a new message GET_FEATURES, and the call is only completed once it r=
eceives features from the remote application. (or the test framework, as =
is the case here.)
> > > > > > >=20
> > > > > > > Hmm but why does it matter that data_cond is woken up?
> > > > > >=20
> > > > > > Michael, sorry, I didn=E2=80=99t quite understand that. Could=
 you pls explain ?
> > > > > >=20
> > > > > > >=20
> > > > > > >=20
> > > > > > > > While the test itself can be modified (Do not signal comp=
letion until we=E2=80=99ve sent a follow-up response to GET_FEATURES), I =
am now wondering if this patch may break existing vhost applications too =
? If so, reverting it possibly better.
> > > > > > >=20
> > > > > > > What bothers me is that the new feature might cause the sam=
e
> > > > > > > issue once we enable it in the test.
> > > > > >=20
> > > > > > No it wont. The new feature is a protocol extension, and only=
 works if it has been negotiated with. If not negotiated, that part of co=
de is never executed.
> > > > > >=20
> > > > > > >=20
> > > > > > > How about a patch to tests/vhost-user-test.c adding the new
> > > > > > > protocol feature? I would be quite interested to see what
> > > > > > > is going on with it.
> > > > > >=20
> > > > > > Yes that can be done. But you can see that the protocol exten=
sion patch will not change the behaviour of the _existing_ test.
> > > > > >=20
> > > > > > >=20
> > > > > > >=20
> > > > > > > > What confuses me is why it doesn=E2=80=99t fail all the t=
ime, but only about 20% to 30% time as Fam reports.
> > > > > > >=20
> > > > > > > And succeeds every time on my systems :(
> > > > > >=20
> > > > > > +1 to that :( I have had no luck repro=E2=80=99ing it.
> > > > > >=20
> > > > > > >=20
> > > > > > > >=20
> > > > > > > > Thoughts : Michael, Fam, MarcAndre ?
> > > > >=20
> > > > > I have managed to reproduce the hang by adding some debug print=
s into
> > > > > vhost_user_get_features().
> > > > >=20
> > > > > Doing this the issue is reproducible quite easily.
> > > > > Another way to reproduce it in one shot is to strace (with foll=
owing
> > > > > forks option) vhost-user-test execution.
> > > > >=20
> > > > > So, by adding debug prints at vhost_user_get_features() entry a=
nd exit,
> > > > > we can see we never return from this function when hang happens=
.
> > > > > Strace of Qemu instance shows that its thread keeps retrying to=
 receive
> > > > > GET_FEATURE reply:
> > > > >=20
> > > > > write(1, "vhost_user_get_features IN: \n", 29) =3D 29
> > > > > sendmsg(11, {msg_name=3DNULL, msg_namelen=3D0,
> > > > >         msg_iov=3D[{iov_base=3D"\1\0\0\0\1\0\0\0\0\0\0\0", iov_=
len=3D12}],
> > > > >         msg_iovlen=3D1, msg_controllen=3D0, msg_flags=3D0}, 0) =
=3D 12
> > > > > recvmsg(11, {msg_namelen=3D0}, MSG_CMSG_CLOEXEC) =3D -1 EAGAIN
> > > > > nanosleep({0, 100000}, 0x7fff29f8dd70)  =3D 0
> > > > > ...
> > > > > recvmsg(11, {msg_namelen=3D0}, MSG_CMSG_CLOEXEC) =3D -1 EAGAIN
> > > > > nanosleep({0, 100000}, 0x7fff29f8dd70)  =3D 0
> > > > >=20
> > > > > The reason is that vhost-user-test never replies to Qemu,
> > > > > because its thread handling the GET_FEATURES command is waiting=
 for
> > > > > the s->data_mutex lock.
> > > > > This lock is held by the other vhost-user-test thread, executin=
g
> > > > > read_guest_mem().
> > > > >=20
> > > > > The lock is never released because the thread is blocked in rea=
d
> > > > > syscall, when read_guest_mem() is doing the readl().
> > > > >=20
> > > > > This is because on Qemu side, the thread polling the qtest sock=
et is
> > > > > waiting for the qemu_global_mutex (in os_host_main_loop_wait())=
, but
> > > > > the mutex is held by the thread trying to get the GET_FEATURE r=
eply
> > > > > (the TCG one).
> > > > >=20
> > > > > So here is the deadlock.
> > > > >=20
> > > > > That said, I don't see a clean way to solve this.
> > > > > Any thoughts?
> > > > >=20
> > > > > Regards,
> > > > > Maxime
> > > >=20
> > > > My thought is that we really need to do what I said:
> > > > avoid doing GET_FEATURES (and setting reply_ack)
> > > > on the first set_mem, and I quote:
> > > >=20
> > > > 	OK this all looks very reasonable (and I do like patch 1 too)
> > > > 	but there's one source of waste here: we do not need to
> > > > 	synchronize when we set up device the first time
> > > > 	when hdev->memory_changed is false.
> > > >=20
> > > > 	I think we should test that and skip synch in both patches
> > > > 	unless  hdev->memory_changed is set.
> > > >=20
> > > > with that change test will start passing.
> > >=20
> > > Actually, it looks like memory_changed is true even at first
> > > SET_MEM_TABLE request.
> > >=20
> > > Thanks,
> > > Maxime
> >=20
> > Let's add another flag then? What we care about is that it's not
> > the first time set specify translations for a given address.
>=20
> I added a dedicated flag, that skips sync on two conditions:
>  1. First set_mem_table call
>  2. If only a new regions are added
>=20
> It solves the hang seen with vhost-user-test app, and I think the patch
> makes sense.
>=20
> But IMHO the problem is deeper than that, and could under some
> conditions still hang when running in TCG mode.
> Imagine Qemu sends a random "GET_FEATURE" request after the
> set_mem_table, and vhost-user-test read_guest_mem() is executed just be=
fore
> this second call (Let's say it was not scheduled for some time).
>
> In this case, read_guest_mem() thread owns the data_mutex, and start
> doing readl() calls. On Qemu side, as we are sending an update of the
> mem table, we own the qemu_global_mutex, and the deadlock happen again:
>  - Vhost-user-test
>    * read_guest_mem() thread: Blocked in readl(), waiting for Qemu to
> handle it (TCG mode only), owning the data_mutex lock.
>    * Command handler thread: Received GET_FEATURE event, but wait for
> data_mutex ownership to handle it.
>=20
>  - Qemu
>    * FDs polling thread: Wait for qemu_global_mutex ownership, to be
> able to handle the readl() request from vhost-user-test.
>    * TCG thread: Own the qemu_global_mutex, and poll to receive the
> GET_FEATURE reply.
>=20
> Maybe the GET_FEATURE case is not realistic, but what about
> GET_VRING_BASE, that get called by vhost_net_stop()?
>=20
> Thanks in advance,
> Maxime

I think most applications expect to be able to handle GET_VRING_BASE
at any time.
If application isn't ready to handle GET_VRING_BASE at any time,
we won't be able to stop the device.

This is not the case for this test but that's because it's
just a unit test, so incomplete.


--=20
MST