All of lore.kernel.org
 help / color / mirror / Atom feed
From: Maxime Coquelin <maxime.coquelin@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Prerna Saxena <prerna.saxena@nutanix.com>,
	"marcandre.lureau@redhat.com" <marcandre.lureau@redhat.com>,
	Peter Maydell <peter.maydell@linaro.org>,
	Fam Zheng <famz@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] [PULL 3/3] vhost-user: Attempt to fix a race with set_mem_table.
Date: Mon, 5 Sep 2016 15:06:09 +0200	[thread overview]
Message-ID: <fcd2ec5a-117e-107a-61ad-a571e9935970@redhat.com> (raw)
In-Reply-To: <20160902202753-mutt-send-email-mst@kernel.org>



On 09/02/2016 07:29 PM, Michael S. Tsirkin wrote:
> On Fri, Sep 02, 2016 at 10:57:17AM +0200, Maxime Coquelin wrote:
>>
>>
>> On 09/01/2016 03:46 PM, Michael S. Tsirkin wrote:
>>> On Wed, Aug 31, 2016 at 01:19:47PM +0200, Maxime Coquelin wrote:
>>>>
>>>>
>>>> On 08/14/2016 11:42 AM, Prerna Saxena wrote:
>>>>> On 14/08/16 8:21 am, "Michael S. Tsirkin" <mst@redhat.com> wrote:
>>>>>
>>>>>
>>>>>> On Fri, Aug 12, 2016 at 07:16:34AM +0000, Prerna Saxena wrote:
>>>>>>>
>>>>>>> On 12/08/16 12:08 pm, "Fam Zheng" <famz@redhat.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> On Wed, 08/10 18:30, Michael S. Tsirkin wrote:
>>>>>>>>> From: Prerna Saxena <prerna.saxena@nutanix.com>
>>>>>>>>>
>>>>>>>>> The set_mem_table command currently does not seek a reply. Hence, there is
>>>>>>>>> no easy way for a remote application to notify to QEMU when it finished
>>>>>>>>> setting up memory, or if there were errors doing so.
>>>>>>>>>
>>>>>>>>> As an example:
>>>>>>>>> (1) Qemu sends a SET_MEM_TABLE to the backend (eg, a vhost-user net
>>>>>>>>> application). SET_MEM_TABLE does not require a reply according to the spec.
>>>>>>>>> (2) Qemu commits the memory to the guest.
>>>>>>>>> (3) Guest issues an I/O operation over a new memory region which was configured on (1).
>>>>>>>>> (4) The application has not yet remapped the memory, but it sees the I/O request.
>>>>>>>>> (5) The application cannot satisfy the request because it does not know about those GPAs.
>>>>>>>>>
>>>>>>>>> While a guaranteed fix would require a protocol extension (committed separately),
>>>>>>>>> a best-effort workaround for existing applications is to send a GET_FEATURES
>>>>>>>>> message before completing the vhost_user_set_mem_table() call.
>>>>>>>>> Since GET_FEATURES requires a reply, an application that processes vhost-user
>>>>>>>>> messages synchronously would probably have completed the SET_MEM_TABLE before replying.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Prerna Saxena <prerna.saxena@nutanix.com>
>>>>>>>>> Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>>>>
>>>>>>>> Sporadic hangs are seen with test-vhost-user after this patch:
>>>>>>>>
>>>>>>>> https://travis-ci.org/qemu/qemu/builds
>>>>>>>>
>>>>>>>> Reverting seems to fix it for me.
>>>>>>>>
>>>>>>>> Is this a known problem?
>>>>>>>>
>>>>>>>> Fam
>>>>>>>
>>>>>>> Hi Fam,
>>>>>>> Thanks for reporting the sporadic hangs. I had seen ‘make check’ pass on my Centos 6 environment, so missed this.
>>>>>>> I am setting up the docker test env to repro this, but I think I can guess the problem :
>>>>>>>
>>>>>>> In tests/vhost-user-test.c:
>>>>>>>
>>>>>>> static void chr_read(void *opaque, const uint8_t *buf, int size)
>>>>>>> {
>>>>>>> ..[snip]..
>>>>>>>
>>>>>>>     case VHOST_USER_SET_MEM_TABLE:
>>>>>>>        /* received the mem table */
>>>>>>>        memcpy(&s->memory, &msg.payload.memory, sizeof(msg.payload.memory));
>>>>>>>        s->fds_num = qemu_chr_fe_get_msgfds(chr, s->fds, G_N_ELEMENTS(s->fds));
>>>>>>>
>>>>>>>
>>>>>>>        /* signal the test that it can continue */
>>>>>>>        g_cond_signal(&s->data_cond);
>>>>>>>        break;
>>>>>>> ..[snip]..
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> The test seems to be marked complete as soon as mem_table is copied.
>>>>>>> However, this patch 3/3 changes the behaviour of the SET_MEM_TABLE vhost command implementation with qemu. SET_MEM_TABLE now sends out a new message GET_FEATURES, and the call is only completed once it receives features from the remote application. (or the test framework, as is the case here.)
>>>>>>
>>>>>> Hmm but why does it matter that data_cond is woken up?
>>>>>
>>>>> Michael, sorry, I didn’t quite understand that. Could you pls explain ?
>>>>>
>>>>>>
>>>>>>
>>>>>>> While the test itself can be modified (Do not signal completion until we’ve sent a follow-up response to GET_FEATURES), I am now wondering if this patch may break existing vhost applications too ? If so, reverting it possibly better.
>>>>>>
>>>>>> What bothers me is that the new feature might cause the same
>>>>>> issue once we enable it in the test.
>>>>>
>>>>> No it wont. The new feature is a protocol extension, and only works if it has been negotiated with. If not negotiated, that part of code is never executed.
>>>>>
>>>>>>
>>>>>> How about a patch to tests/vhost-user-test.c adding the new
>>>>>> protocol feature? I would be quite interested to see what
>>>>>> is going on with it.
>>>>>
>>>>> Yes that can be done. But you can see that the protocol extension patch will not change the behaviour of the _existing_ test.
>>>>>
>>>>>>
>>>>>>
>>>>>>> What confuses me is why it doesn’t fail all the time, but only about 20% to 30% time as Fam reports.
>>>>>>
>>>>>> And succeeds every time on my systems :(
>>>>>
>>>>> +1 to that :( I have had no luck repro’ing it.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thoughts : Michael, Fam, MarcAndre ?
>>>>
>>>> I have managed to reproduce the hang by adding some debug prints into
>>>> vhost_user_get_features().
>>>>
>>>> Doing this the issue is reproducible quite easily.
>>>> Another way to reproduce it in one shot is to strace (with following
>>>> forks option) vhost-user-test execution.
>>>>
>>>> So, by adding debug prints at vhost_user_get_features() entry and exit,
>>>> we can see we never return from this function when hang happens.
>>>> Strace of Qemu instance shows that its thread keeps retrying to receive
>>>> GET_FEATURE reply:
>>>>
>>>> write(1, "vhost_user_get_features IN: \n", 29) = 29
>>>> sendmsg(11, {msg_name=NULL, msg_namelen=0,
>>>>         msg_iov=[{iov_base="\1\0\0\0\1\0\0\0\0\0\0\0", iov_len=12}],
>>>>         msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 12
>>>> recvmsg(11, {msg_namelen=0}, MSG_CMSG_CLOEXEC) = -1 EAGAIN
>>>> nanosleep({0, 100000}, 0x7fff29f8dd70)  = 0
>>>> ...
>>>> recvmsg(11, {msg_namelen=0}, MSG_CMSG_CLOEXEC) = -1 EAGAIN
>>>> nanosleep({0, 100000}, 0x7fff29f8dd70)  = 0
>>>>
>>>> The reason is that vhost-user-test never replies to Qemu,
>>>> because its thread handling the GET_FEATURES command is waiting for
>>>> the s->data_mutex lock.
>>>> This lock is held by the other vhost-user-test thread, executing
>>>> read_guest_mem().
>>>>
>>>> The lock is never released because the thread is blocked in read
>>>> syscall, when read_guest_mem() is doing the readl().
>>>>
>>>> This is because on Qemu side, the thread polling the qtest socket is
>>>> waiting for the qemu_global_mutex (in os_host_main_loop_wait()), but
>>>> the mutex is held by the thread trying to get the GET_FEATURE reply
>>>> (the TCG one).
>>>>
>>>> So here is the deadlock.
>>>>
>>>> That said, I don't see a clean way to solve this.
>>>> Any thoughts?
>>>>
>>>> Regards,
>>>> Maxime
>>>
>>> My thought is that we really need to do what I said:
>>> avoid doing GET_FEATURES (and setting reply_ack)
>>> on the first set_mem, and I quote:
>>>
>>> 	OK this all looks very reasonable (and I do like patch 1 too)
>>> 	but there's one source of waste here: we do not need to
>>> 	synchronize when we set up device the first time
>>> 	when hdev->memory_changed is false.
>>>
>>> 	I think we should test that and skip synch in both patches
>>> 	unless  hdev->memory_changed is set.
>>>
>>> with that change test will start passing.
>>
>> Actually, it looks like memory_changed is true even at first
>> SET_MEM_TABLE request.
>>
>> Thanks,
>> Maxime
>
> Let's add another flag then? What we care about is that it's not
> the first time set specify translations for a given address.

I added a dedicated flag, that skips sync on two conditions:
  1. First set_mem_table call
  2. If only a new regions are added

It solves the hang seen with vhost-user-test app, and I think the patch
makes sense.

But IMHO the problem is deeper than that, and could under some
conditions still hang when running in TCG mode.
Imagine Qemu sends a random "GET_FEATURE" request after the
set_mem_table, and vhost-user-test read_guest_mem() is executed just 
before this second call (Let's say it was not scheduled for some time).

In this case, read_guest_mem() thread owns the data_mutex, and start
doing readl() calls. On Qemu side, as we are sending an update of the
mem table, we own the qemu_global_mutex, and the deadlock happen again:
  - Vhost-user-test
    * read_guest_mem() thread: Blocked in readl(), waiting for Qemu to
handle it (TCG mode only), owning the data_mutex lock.
    * Command handler thread: Received GET_FEATURE event, but wait for
data_mutex ownership to handle it.

  - Qemu
    * FDs polling thread: Wait for qemu_global_mutex ownership, to be
able to handle the readl() request from vhost-user-test.
    * TCG thread: Own the qemu_global_mutex, and poll to receive the
GET_FEATURE reply.

Maybe the GET_FEATURE case is not realistic, but what about
GET_VRING_BASE, that get called by vhost_net_stop()?

Thanks in advance,
Maxime

  reply	other threads:[~2016-09-05 13:06 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-10 15:30 [Qemu-devel] [PULL 0/3] virtio/vhost: fixes Michael S. Tsirkin
2016-08-10 15:30 ` [Qemu-devel] [PULL 1/3] vhost: check for vhost_ops before using Michael S. Tsirkin
2016-08-10 15:30 ` [Qemu-devel] [PULL 2/3] vhost-user: Introduce a new protocol feature REPLY_ACK Michael S. Tsirkin
2016-08-10 15:30 ` [Qemu-devel] [PULL 3/3] vhost-user: Attempt to fix a race with set_mem_table Michael S. Tsirkin
2016-08-12  6:38   ` Fam Zheng
2016-08-12  7:16     ` Prerna Saxena
2016-08-12  7:20       ` Marc-André Lureau
2016-08-12 12:01         ` Peter Maydell
2016-08-12 15:49           ` Michael S. Tsirkin
2016-08-15 10:20             ` Peter Maydell
2016-08-12 15:47         ` Michael S. Tsirkin
2016-08-12 15:54           ` Marc-André Lureau
2016-08-12 21:12             ` Michael S. Tsirkin
2016-08-13  6:13               ` Marc-André Lureau
2016-08-14  2:30                 ` Michael S. Tsirkin
2016-08-14  2:44                 ` Michael S. Tsirkin
2016-08-14  2:51       ` Michael S. Tsirkin
2016-08-14  9:42         ` Prerna Saxena
2016-08-14 21:39           ` Michael S. Tsirkin
2016-08-31 11:19           ` Maxime Coquelin
2016-09-01 13:46             ` Michael S. Tsirkin
2016-09-02  8:57               ` Maxime Coquelin
2016-09-02 17:29                 ` Michael S. Tsirkin
2016-09-05 13:06                   ` Maxime Coquelin [this message]
2016-09-06  2:22                     ` Michael S. Tsirkin
2016-08-10 17:32 ` [Qemu-devel] [PULL 0/3] virtio/vhost: fixes Peter Maydell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fcd2ec5a-117e-107a-61ad-a571e9935970@redhat.com \
    --to=maxime.coquelin@redhat.com \
    --cc=famz@redhat.com \
    --cc=marcandre.lureau@redhat.com \
    --cc=mst@redhat.com \
    --cc=peter.maydell@linaro.org \
    --cc=prerna.saxena@nutanix.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.