Re: [Qemu-devel] [PULL 3/3] vhost-user: Attempt to fix a race with set_mem_table.

From: Maxime Coquelin <maxime.coquelin@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Prerna Saxena <prerna.saxena@nutanix.com>,
	"marcandre.lureau@redhat.com" <marcandre.lureau@redhat.com>,
	Peter Maydell <peter.maydell@linaro.org>,
	Fam Zheng <famz@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] [PULL 3/3] vhost-user: Attempt to fix a race with set_mem_table.
Date: Mon, 5 Sep 2016 15:06:09 +0200	[thread overview]
Message-ID: <fcd2ec5a-117e-107a-61ad-a571e9935970@redhat.com> (raw)
In-Reply-To: <20160902202753-mutt-send-email-mst@kernel.org>

On 09/02/2016 07:29 PM, Michael S. Tsirkin wrote:
> On Fri, Sep 02, 2016 at 10:57:17AM +0200, Maxime Coquelin wrote:
>>
>>
>> On 09/01/2016 03:46 PM, Michael S. Tsirkin wrote:
>>> On Wed, Aug 31, 2016 at 01:19:47PM +0200, Maxime Coquelin wrote:
>>>>
>>>>
>>>> On 08/14/2016 11:42 AM, Prerna Saxena wrote:
>>>>> On 14/08/16 8:21 am, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>
>>>>>
>>>>>> On Fri, Aug 12, 2016 at 07:16:34AM +0000, Prerna Saxena wrote:
>>>>>>>
>>>>>>> On 12/08/16 12:08 pm, "Fam Zheng" <famz@redhat.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Wed, 08/10 18:30, Michael S. Tsirkin wrote:
>>>>>>>>> From: Prerna Saxena <prerna.saxena@nutanix.com>
>>>>>>>>>
>>>>>>>>> The set_mem_table command currently does not seek a reply. Hence, there is
>>>>>>>>> no easy way for a remote application to notify to QEMU when it finished
>>>>>>>>> setting up memory, or if there were errors doing so.
>>>>>>>>>
>>>>>>>>> As an example:
>>>>>>>>> (1) Qemu sends a SET_MEM_TABLE to the backend (eg, a vhost-user net
>>>>>>>>> application). SET_MEM_TABLE does not require a reply according to the spec.
>>>>>>>>> (2) Qemu commits the memory to the guest.
>>>>>>>>> (3) Guest issues an I/O operation over a new memory region which was configured on (1).
>>>>>>>>> (4) The application has not yet remapped the memory, but it sees the I/O request.
>>>>>>>>> (5) The application cannot satisfy the request because it does not know about those GPAs.
>>>>>>>>>
>>>>>>>>> While a guaranteed fix would require a protocol extension (committed separately),
>>>>>>>>> a best-effort workaround for existing applications is to send a GET_FEATURES
>>>>>>>>> message before completing the vhost_user_set_mem_table() call.
>>>>>>>>> Since GET_FEATURES requires a reply, an application that processes vhost-user
>>>>>>>>> messages synchronously would probably have completed the SET_MEM_TABLE before replying.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Prerna Saxena <prerna.saxena@nutanix.com>
>>>>>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>
>>>>>>>> Sporadic hangs are seen with test-vhost-user after this patch:
>>>>>>>>
>>>>>>>> https://travis-ci.org/qemu/qemu/builds
>>>>>>>>
>>>>>>>> Reverting seems to fix it for me.
>>>>>>>>
>>>>>>>> Is this a known problem?
>>>>>>>>
>>>>>>>> Fam
>>>>>>>
>>>>>>> Hi Fam,
>>>>>>> Thanks for reporting the sporadic hangs. I had seen ‘make check’ pass on my Centos 6 environment, so missed this.
>>>>>>> I am setting up the docker test env to repro this, but I think I can guess the problem :
>>>>>>>
>>>>>>> In tests/vhost-user-test.c:
>>>>>>>
>>>>>>> static void chr_read(void *opaque, const uint8_t *buf, int size)
>>>>>>> {
>>>>>>> ..[snip]..
>>>>>>>
>>>>>>>     case VHOST_USER_SET_MEM_TABLE:
>>>>>>>        /* received the mem table */
>>>>>>>        memcpy(&s->memory, &msg.payload.memory, sizeof(msg.payload.memory));
>>>>>>>        s->fds_num = qemu_chr_fe_get_msgfds(chr, s->fds, G_N_ELEMENTS(s->fds));
>>>>>>>
>>>>>>>
>>>>>>>        /* signal the test that it can continue */
>>>>>>>        g_cond_signal(&s->data_cond);
>>>>>>>        break;
>>>>>>> ..[snip]..
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> The test seems to be marked complete as soon as mem_table is copied.
>>>>>>> However, this patch 3/3 changes the behaviour of the SET_MEM_TABLE vhost command implementation with qemu. SET_MEM_TABLE now sends out a new message GET_FEATURES, and the call is only completed once it receives features from the remote application. (or the test framework, as is the case here.)
>>>>>>
>>>>>> Hmm but why does it matter that data_cond is woken up?
>>>>>
>>>>> Michael, sorry, I didn’t quite understand that. Could you pls explain ?
>>>>>
>>>>>>
>>>>>>
>>>>>>> While the test itself can be modified (Do not signal completion until we’ve sent a follow-up response to GET_FEATURES), I am now wondering if this patch may break existing vhost applications too ? If so, reverting it possibly better.
>>>>>>
>>>>>> What bothers me is that the new feature might cause the same
>>>>>> issue once we enable it in the test.
>>>>>
>>>>> No it wont. The new feature is a protocol extension, and only works if it has been negotiated with. If not negotiated, that part of code is never executed.
>>>>>
>>>>>>
>>>>>> How about a patch to tests/vhost-user-test.c adding the new
>>>>>> protocol feature? I would be quite interested to see what
>>>>>> is going on with it.
>>>>>
>>>>> Yes that can be done. But you can see that the protocol extension patch will not change the behaviour of the _existing_ test.
>>>>>
>>>>>>
>>>>>>
>>>>>>> What confuses me is why it doesn’t fail all the time, but only about 20% to 30% time as Fam reports.
>>>>>>
>>>>>> And succeeds every time on my systems :(
>>>>>
>>>>> +1 to that :( I have had no luck repro’ing it.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thoughts : Michael, Fam, MarcAndre ?
>>>>
>>>> I have managed to reproduce the hang by adding some debug prints into
>>>> vhost_user_get_features().
>>>>
>>>> Doing this the issue is reproducible quite easily.
>>>> Another way to reproduce it in one shot is to strace (with following
>>>> forks option) vhost-user-test execution.
>>>>
>>>> So, by adding debug prints at vhost_user_get_features() entry and exit,
>>>> we can see we never return from this function when hang happens.
>>>> Strace of Qemu instance shows that its thread keeps retrying to receive
>>>> GET_FEATURE reply:
>>>>
>>>> write(1, "vhost_user_get_features IN: \n", 29) = 29
>>>> sendmsg(11, {msg_name=NULL, msg_namelen=0,
>>>>         msg_iov=[{iov_base="\1\0\0\0\1\0\0\0\0\0\0\0", iov_len=12}],
>>>>         msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 12
>>>> recvmsg(11, {msg_namelen=0}, MSG_CMSG_CLOEXEC) = -1 EAGAIN
>>>> nanosleep({0, 100000}, 0x7fff29f8dd70)  = 0
>>>> ...
>>>> recvmsg(11, {msg_namelen=0}, MSG_CMSG_CLOEXEC) = -1 EAGAIN
>>>> nanosleep({0, 100000}, 0x7fff29f8dd70)  = 0
>>>>
>>>> The reason is that vhost-user-test never replies to Qemu,
>>>> because its thread handling the GET_FEATURES command is waiting for
>>>> the s->data_mutex lock.
>>>> This lock is held by the other vhost-user-test thread, executing
>>>> read_guest_mem().
>>>>
>>>> The lock is never released because the thread is blocked in read
>>>> syscall, when read_guest_mem() is doing the readl().
>>>>
>>>> This is because on Qemu side, the thread polling the qtest socket is
>>>> waiting for the qemu_global_mutex (in os_host_main_loop_wait()), but
>>>> the mutex is held by the thread trying to get the GET_FEATURE reply
>>>> (the TCG one).
>>>>
>>>> So here is the deadlock.
>>>>
>>>> That said, I don't see a clean way to solve this.
>>>> Any thoughts?
>>>>
>>>> Regards,
>>>> Maxime
>>>
>>> My thought is that we really need to do what I said:
>>> avoid doing GET_FEATURES (and setting reply_ack)
>>> on the first set_mem, and I quote:
>>>
>>> 	OK this all looks very reasonable (and I do like patch 1 too)
>>> 	but there's one source of waste here: we do not need to
>>> 	synchronize when we set up device the first time
>>> 	when hdev->memory_changed is false.
>>>
>>> 	I think we should test that and skip synch in both patches
>>> 	unless  hdev->memory_changed is set.
>>>
>>> with that change test will start passing.
>>
>> Actually, it looks like memory_changed is true even at first
>> SET_MEM_TABLE request.
>>
>> Thanks,
>> Maxime
>
> Let's add another flag then? What we care about is that it's not
> the first time set specify translations for a given address.

I added a dedicated flag, that skips sync on two conditions:
  1. First set_mem_table call
  2. If only a new regions are added

It solves the hang seen with vhost-user-test app, and I think the patch
makes sense.

But IMHO the problem is deeper than that, and could under some
conditions still hang when running in TCG mode.
Imagine Qemu sends a random "GET_FEATURE" request after the
set_mem_table, and vhost-user-test read_guest_mem() is executed just 
before this second call (Let's say it was not scheduled for some time).

In this case, read_guest_mem() thread owns the data_mutex, and start
doing readl() calls. On Qemu side, as we are sending an update of the
mem table, we own the qemu_global_mutex, and the deadlock happen again:
  - Vhost-user-test
    * read_guest_mem() thread: Blocked in readl(), waiting for Qemu to
handle it (TCG mode only), owning the data_mutex lock.
    * Command handler thread: Received GET_FEATURE event, but wait for
data_mutex ownership to handle it.

  - Qemu
    * FDs polling thread: Wait for qemu_global_mutex ownership, to be
able to handle the readl() request from vhost-user-test.
    * TCG thread: Own the qemu_global_mutex, and poll to receive the
GET_FEATURE reply.

Maybe the GET_FEATURE case is not realistic, but what about
GET_VRING_BASE, that get called by vhost_net_stop()?

Thanks in advance,
Maxime