[Qemu-devel] [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
@ 2010-01-11  7:40 Vadim Rozenfeld
  2010-01-11  8:30 ` [Qemu-devel] " Avi Kivity
  0 siblings, 1 reply; 36+ messages in thread
From: Vadim Rozenfeld @ 2010-01-11  7:40 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dor Laor, Avi Kivity

The following patch allows us to improve Windows virtio 
block driver performance on small size requests. 
Additionally, it leads to reducing of cpu usage on write IOs  

repository: /home/vadimr/work/win7/qemu
branch: master
commit 68290c4e9c96f345d544ca5d2b89f27a1e67e27a
Author: Vadim Rozenfeld <vadimr@localhost.localdomain>
Date:   Mon Jan 11 09:00:21 2010 +0200

[RFC][PATCH] small IOs performance for windows guests, running on top of
virtio block device

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index a2f0639..0e3a8d5 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -28,6 +28,7 @@ typedef struct VirtIOBlock
     char serial_str[BLOCK_SERIAL_STRLEN + 1];
     QEMUBH *bh;
     size_t config_size;
+    unsigned int pending;
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -87,6 +88,8 @@ typedef struct VirtIOBlockReq
     struct VirtIOBlockReq *next;
 } VirtIOBlockReq;
 
+static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue
*vq);
+
 static void virtio_blk_req_complete(VirtIOBlockReq *req, int status)
 {
     VirtIOBlock *s = req->dev;
@@ -95,6 +98,11 @@ static void virtio_blk_req_complete(VirtIOBlockReq
*req, int status)
     virtqueue_push(s->vq, &req->elem, req->qiov.size +
sizeof(*req->in));
     virtio_notify(&s->vdev, s->vq);
 
+    if(--s->pending == 0) { 
+        virtio_queue_set_notification(s->vq, 1);
+        virtio_blk_handle_output(&s->vdev, s->vq);
+    }
+
     qemu_free(req);
 }
 
@@ -340,6 +348,9 @@ static void virtio_blk_handle_output(VirtIODevice
*vdev, VirtQueue *vq)
             exit(1);
         }
 
+        if(++s->pending == 1)
+            virtio_queue_set_notification(s->vq, 0);
+
         req->out = (void *)req->elem.out_sg[0].iov_base;
         req->in = (void *)req->elem.in_sg[req->elem.in_num -
1].iov_base;
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11  7:40 [Qemu-devel] [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device Vadim Rozenfeld
@ 2010-01-11  8:30 ` Avi Kivity
       [not found]   ` <4B4AE95D.7080305@redhat.com>
  2010-01-11 13:42   ` Christoph Hellwig
  0 siblings, 2 replies; 36+ messages in thread
From: Avi Kivity @ 2010-01-11  8:30 UTC (permalink / raw)
  To: Vadim Rozenfeld; +Cc: Dor Laor, qemu-devel

On 01/11/2010 09:40 AM, Vadim Rozenfeld wrote:
> The following patch allows us to improve Windows virtio
> block driver performance on small size requests.
> Additionally, it leads to reducing of cpu usage on write IOs
>
>    

Note, this is not an improvement for Windows specifically.

> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index a2f0639..0e3a8d5 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -28,6 +28,7 @@ typedef struct VirtIOBlock
>       char serial_str[BLOCK_SERIAL_STRLEN + 1];
>       QEMUBH *bh;
>       size_t config_size;
> +    unsigned int pending;
>   } VirtIOBlock;
>
>   static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
> @@ -87,6 +88,8 @@ typedef struct VirtIOBlockReq
>       struct VirtIOBlockReq *next;
>   } VirtIOBlockReq;
>
> +static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue
> *vq);
> +
>   static void virtio_blk_req_complete(VirtIOBlockReq *req, int status)
>   {
>       VirtIOBlock *s = req->dev;
> @@ -95,6 +98,11 @@ static void virtio_blk_req_complete(VirtIOBlockReq
> *req, int status)
>       virtqueue_push(s->vq,&req->elem, req->qiov.size +
> sizeof(*req->in));
>       virtio_notify(&s->vdev, s->vq);
>
> +    if(--s->pending == 0) {
> +        virtio_queue_set_notification(s->vq, 1);
> +        virtio_blk_handle_output(&s->vdev, s->vq);
> +    }
> +
>    

Coding style: space after if.  See the CODING_STYLE file.

> @@ -340,6 +348,9 @@ static void virtio_blk_handle_output(VirtIODevice
> *vdev, VirtQueue *vq)
>               exit(1);
>           }
>
> +        if(++s->pending == 1)
> +            virtio_queue_set_notification(s->vq, 0);
> +
>           req->out = (void *)req->elem.out_sg[0].iov_base;
>           req->in = (void *)req->elem.in_sg[req->elem.in_num -
> 1].iov_base;
>
>    

Coding style: space after if, braces after if.

Your patch is word wrapped, please send it correctly.  Easiest using git 
send-email.

The patch has potential to reduce performance on volumes with multiple 
spindles.  Consider two processes issuing sequential reads into a RAID 
array.  With this patch, the reads will be executed sequentially rather 
than in parallel, so I think a follow-on patch to make the minimum depth 
a parameter (set by the guest? the host?) would be helpful.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
       [not found]   ` <4B4AE95D.7080305@redhat.com>
@ 2010-01-11  9:19     ` Dor Laor
  2010-01-11 13:11       ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Dor Laor @ 2010-01-11  9:19 UTC (permalink / raw)
  Cc: qemu-devel, Avi Kivity, Vadim Rozenfeld

On 01/11/2010 11:03 AM, Dor Laor wrote:
> On 01/11/2010 10:30 AM, Avi Kivity wrote:
>> On 01/11/2010 09:40 AM, Vadim Rozenfeld wrote:
>>> The following patch allows us to improve Windows virtio
>>> block driver performance on small size requests.
>>> Additionally, it leads to reducing of cpu usage on write IOs
>>>
>>
>> Note, this is not an improvement for Windows specifically.
>>
>>> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
>>> index a2f0639..0e3a8d5 100644
>>> --- a/hw/virtio-blk.c
>>> +++ b/hw/virtio-blk.c
>>> @@ -28,6 +28,7 @@ typedef struct VirtIOBlock
>>> char serial_str[BLOCK_SERIAL_STRLEN + 1];
>>> QEMUBH *bh;
>>> size_t config_size;
>>> + unsigned int pending;
>>> } VirtIOBlock;
>>>
>>> static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
>>> @@ -87,6 +88,8 @@ typedef struct VirtIOBlockReq
>>> struct VirtIOBlockReq *next;
>>> } VirtIOBlockReq;
>>>
>>> +static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue
>>> *vq);
>>> +
>>> static void virtio_blk_req_complete(VirtIOBlockReq *req, int status)
>>> {
>>> VirtIOBlock *s = req->dev;
>>> @@ -95,6 +98,11 @@ static void virtio_blk_req_complete(VirtIOBlockReq
>>> *req, int status)
>>> virtqueue_push(s->vq,&req->elem, req->qiov.size +
>>> sizeof(*req->in));
>>> virtio_notify(&s->vdev, s->vq);
>>>
>>> + if(--s->pending == 0) {
>>> + virtio_queue_set_notification(s->vq, 1);
>>> + virtio_blk_handle_output(&s->vdev, s->vq);
>
> The above line should be moved out of the 'if'.
>
> Attached results with rhel5.4 (qemu0.11) for win2k8 32bit guest. Note
> the drastic reduction in cpu consumption.

Attachment did not survive the email server, so you'll have to trust me 
saying that cpu consumption was done from 65% -> 40% for reads and from 
80% --> 30% for writes

>
>>> + }
>>> +
>>
>> Coding style: space after if. See the CODING_STYLE file.
>>
>>> @@ -340,6 +348,9 @@ static void virtio_blk_handle_output(VirtIODevice
>>> *vdev, VirtQueue *vq)
>>> exit(1);
>>> }
>>>
>>> + if(++s->pending == 1)
>>> + virtio_queue_set_notification(s->vq, 0);
>>> +
>>> req->out = (void *)req->elem.out_sg[0].iov_base;
>>> req->in = (void *)req->elem.in_sg[req->elem.in_num -
>>> 1].iov_base;
>>>
>>
>> Coding style: space after if, braces after if.
>>
>> Your patch is word wrapped, please send it correctly. Easiest using git
>> send-email.
>>
>> The patch has potential to reduce performance on volumes with multiple
>> spindles. Consider two processes issuing sequential reads into a RAID
>> array. With this patch, the reads will be executed sequentially rather
>> than in parallel, so I think a follow-on patch to make the minimum depth
>> a parameter (set by the guest? the host?) would be helpful.
>>
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11  9:19     ` Dor Laor
@ 2010-01-11 13:11       ` Christoph Hellwig
  2010-01-11 13:13         ` Avi Kivity
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2010-01-11 13:11 UTC (permalink / raw)
  To: Dor Laor; +Cc: Vadim Rozenfeld, qemu-devel, Avi Kivity

On Mon, Jan 11, 2010 at 11:19:21AM +0200, Dor Laor wrote:
> >Attached results with rhel5.4 (qemu0.11) for win2k8 32bit guest. Note
> >the drastic reduction in cpu consumption.
> 
> Attachment did not survive the email server, so you'll have to trust me 
> saying that cpu consumption was done from 65% -> 40% for reads and from 
> 80% --> 30% for writes

For what kind of workload, and using what qemu parameters, and cpu
consumtion in the host or guest?  Either way this is an awfull lot, did
you use kernel AIO on the host?

Any chance you could publish the benchmark, guest and host configs
so we have meaningfull numbers?

FYI below is the manually applied patch without all the wrapping:


Index: qemu/hw/virtio-blk.c
===================================================================
--- qemu.orig/hw/virtio-blk.c	2010-01-11 14:05:09.619254004 +0100
+++ qemu/hw/virtio-blk.c	2010-01-11 14:06:54.385013004 +0100
@@ -28,6 +28,7 @@ typedef struct VirtIOBlock
     char serial_str[BLOCK_SERIAL_STRLEN + 1];
     QEMUBH *bh;
     size_t config_size;
+    unsigned int pending;
 } VirtIOBlock;
 
 static VirtIOBlock *to_virtio_blk(VirtIODevice *vdev)
@@ -87,6 +88,8 @@ typedef struct VirtIOBlockReq
     struct VirtIOBlockReq *next;
 } VirtIOBlockReq;
 
+static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq);
+
 static void virtio_blk_req_complete(VirtIOBlockReq *req, int status)
 {
     VirtIOBlock *s = req->dev;
@@ -95,6 +98,12 @@ static void virtio_blk_req_complete(Virt
     virtqueue_push(s->vq, &req->elem, req->qiov.size + sizeof(*req->in));
     virtio_notify(&s->vdev, s->vq);
 
+    if (--s->pending == 0) {
+        virtio_queue_set_notification(s->vq, 1);
+        virtio_blk_handle_output(&s->vdev, s->vq);
+    }
+
+
     qemu_free(req);
 }
 
@@ -340,6 +349,9 @@ static void virtio_blk_handle_output(Vir
             exit(1);
         }
 
+        if (++s->pending == 1)
+            virtio_queue_set_notification(s->vq, 0);
+
         req->out = (void *)req->elem.out_sg[0].iov_base;
         req->in = (void *)req->elem.in_sg[req->elem.in_num - 1].iov_base;
 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 13:11       ` Christoph Hellwig
@ 2010-01-11 13:13         ` Avi Kivity
  2010-01-11 13:16           ` Christoph Hellwig
  2010-01-11 13:47           ` Christoph Hellwig
  0 siblings, 2 replies; 36+ messages in thread
From: Avi Kivity @ 2010-01-11 13:13 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dor Laor, qemu-devel, Vadim Rozenfeld

On 01/11/2010 03:11 PM, Christoph Hellwig wrote:
> FYI below is the manually applied patch without all the wrapping:
>   static void virtio_blk_req_complete(VirtIOBlockReq *req, int status)
>   {
>       VirtIOBlock *s = req->dev;
> @@ -95,6 +98,12 @@ static void virtio_blk_req_complete(Virt
>       virtqueue_push(s->vq,&req->elem, req->qiov.size + sizeof(*req->in));
>       virtio_notify(&s->vdev, s->vq);
>
> +    if (--s->pending == 0) {
> +        virtio_queue_set_notification(s->vq, 1);
> +        virtio_blk_handle_output(&s->vdev, s->vq);
> +    }
> +
>    

As Dor points out, the call to virtio_blk_handle_output() wants to be 
before the test for pending, so we scan the ring as early as possible

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 13:13         ` Avi Kivity
@ 2010-01-11 13:16           ` Christoph Hellwig
  2010-01-11 13:47           ` Christoph Hellwig
  1 sibling, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2010-01-11 13:16 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, qemu-devel

On Mon, Jan 11, 2010 at 03:13:53PM +0200, Avi Kivity wrote:
> As Dor points out, the call to virtio_blk_handle_output() wants to be 
> before the test for pending, so we scan the ring as early as possible

I just reposted the patch in a way that it applies to share the work I
did when starting to review it.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11  8:30 ` [Qemu-devel] " Avi Kivity
       [not found]   ` <4B4AE95D.7080305@redhat.com>
@ 2010-01-11 13:42   ` Christoph Hellwig
  2010-01-11 13:49     ` Anthony Liguori
  2010-01-11 14:25     ` [Qemu-devel] " Avi Kivity
  1 sibling, 2 replies; 36+ messages in thread
From: Christoph Hellwig @ 2010-01-11 13:42 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Dor Laor, Vadim Rozenfeld, qemu-devel

On Mon, Jan 11, 2010 at 10:30:53AM +0200, Avi Kivity wrote:
> The patch has potential to reduce performance on volumes with multiple 
> spindles.  Consider two processes issuing sequential reads into a RAID 
> array.  With this patch, the reads will be executed sequentially rather 
> than in parallel, so I think a follow-on patch to make the minimum depth 
> a parameter (set by the guest? the host?) would be helpful.

Let's think about the life cycle of I/O requests a bit.

We have an idle virtqueue (aka one virtio-blk device).  The first (read)
request comes in, we get the virtio notify from the guest, which calls
into virtio_blk_handle_output.  With the new code we now disable the
notify once we start processing the first request.  If the second
request hits the queue before we call into virtio_blk_get_request
the second time we're fine even with the new code as we keep picking it
up.  If however it hits after we leave virtio_blk_handle_output, but
before we complete the first request we do indeed introduce additional
latency.

So instead of disabling notify while requests are active we might want
to only disable it while we are inside virtio_blk_handle_output.
Something like the following minimally tested patch:


Index: qemu/hw/virtio-blk.c
===================================================================
--- qemu.orig/hw/virtio-blk.c	2010-01-11 14:28:42.896010503 +0100
+++ qemu/hw/virtio-blk.c	2010-01-11 14:40:13.535256353 +0100
@@ -328,7 +328,15 @@ static void virtio_blk_handle_output(Vir
     int num_writes = 0;
     BlockDriverState *old_bs = NULL;
 
+    /*
+     * While we are processing requests there is no need to get further
+     * notifications from the guest - it'll just burn cpu cycles doing
+     * useless context switches into the host.
+     */
+    virtio_queue_set_notification(s->vq, 0);
+
     while ((req = virtio_blk_get_request(s))) {
+handle_request:
         if (req->elem.out_num < 1 || req->elem.in_num < 1) {
             fprintf(stderr, "virtio-blk missing headers\n");
             exit(1);
@@ -358,6 +366,18 @@ static void virtio_blk_handle_output(Vir
         }
     }
 
+    /*
+     * Once we're done processing all pending requests re-enable the queue
+     * notification.  If there's an entry pending after we enabled
+     * notification again we hit a race and need to process it before
+     * returning.
+     */
+    virtio_queue_set_notification(s->vq, 1);
+    req = virtio_blk_get_request(s);
+    if (req) {
+        goto handle_request;
+    }
+
     if (num_writes > 0) {
         do_multiwrite(old_bs, blkreq, num_writes);
     }

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 13:13         ` Avi Kivity
  2010-01-11 13:16           ` Christoph Hellwig
@ 2010-01-11 13:47           ` Christoph Hellwig
  2010-01-11 14:00             ` Anthony Liguori
  1 sibling, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2010-01-11 13:47 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, qemu-devel

On Mon, Jan 11, 2010 at 03:13:53PM +0200, Avi Kivity wrote:
> As Dor points out, the call to virtio_blk_handle_output() wants to be 
> before the test for pending, so we scan the ring as early as possible

It could cause a race window where we add an entry to the ring after
we run virtio_blk_handle_output, but before re-enabling the
notification.  But I think my variant of the patch that I just posted
should deal with this in an even better way.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 13:42   ` Christoph Hellwig
@ 2010-01-11 13:49     ` Anthony Liguori
  2010-01-11 14:29       ` Avi Kivity
  2010-01-11 14:25     ` [Qemu-devel] " Avi Kivity
  1 sibling, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2010-01-11 13:49 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: qemu-devel, Dor Laor, Avi Kivity, Vadim Rozenfeld

On 01/11/2010 07:42 AM, Christoph Hellwig wrote:
> On Mon, Jan 11, 2010 at 10:30:53AM +0200, Avi Kivity wrote:
>    
>> The patch has potential to reduce performance on volumes with multiple
>> spindles.  Consider two processes issuing sequential reads into a RAID
>> array.  With this patch, the reads will be executed sequentially rather
>> than in parallel, so I think a follow-on patch to make the minimum depth
>> a parameter (set by the guest? the host?) would be helpful.
>>      
> Let's think about the life cycle of I/O requests a bit.
>
> We have an idle virtqueue (aka one virtio-blk device).  The first (read)
> request comes in, we get the virtio notify from the guest, which calls
> into virtio_blk_handle_output.  With the new code we now disable the
> notify once we start processing the first request.  If the second
> request hits the queue before we call into virtio_blk_get_request
> the second time we're fine even with the new code as we keep picking it
> up.  If however it hits after we leave virtio_blk_handle_output, but
> before we complete the first request we do indeed introduce additional
> latency.
>
> So instead of disabling notify while requests are active we might want
> to only disable it while we are inside virtio_blk_handle_output.
> Something like the following minimally tested patch:
>    

I'd suggest that we get even more aggressive and install an idle bottom 
half that checks the queue for newly submitted requests.  If we keep 
getting requests submitted before a new one completes, we'll never take 
an I/O exit.

The same approach is probably a good idea for virtio-net.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 13:47           ` Christoph Hellwig
@ 2010-01-11 14:00             ` Anthony Liguori
  2010-02-24  2:58               ` Paul Brook
  0 siblings, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2010-01-11 14:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: qemu-devel, Dor Laor, Avi Kivity, Vadim Rozenfeld

On 01/11/2010 07:47 AM, Christoph Hellwig wrote:
> On Mon, Jan 11, 2010 at 03:13:53PM +0200, Avi Kivity wrote:
>    
>> As Dor points out, the call to virtio_blk_handle_output() wants to be
>> before the test for pending, so we scan the ring as early as possible
>>      
> It could cause a race window where we add an entry to the ring after
> we run virtio_blk_handle_output, but before re-enabling the
> notification.  But I think my variant of the patch that I just posted
> should deal with this in an even better way.
>    

When we've disabled notifications, we should use any opportunity we have 
in userspace to check the rings to see if anything was added.

Bottom halves are run at the very end of the event loop which means that 
they're guaranteed to be the last thing run.  idle bottom halves can be 
rescheduled without causing an infinite loop and do not affect the 
select timeout (which normal bottom halves do).

Regards,

Anthony Liguori

>
>    

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 13:42   ` Christoph Hellwig
  2010-01-11 13:49     ` Anthony Liguori
@ 2010-01-11 14:25     ` Avi Kivity
  1 sibling, 0 replies; 36+ messages in thread
From: Avi Kivity @ 2010-01-11 14:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dor Laor, Vadim Rozenfeld, qemu-devel

On 01/11/2010 03:42 PM, Christoph Hellwig wrote:
> On Mon, Jan 11, 2010 at 10:30:53AM +0200, Avi Kivity wrote:
>    
>> The patch has potential to reduce performance on volumes with multiple
>> spindles.  Consider two processes issuing sequential reads into a RAID
>> array.  With this patch, the reads will be executed sequentially rather
>> than in parallel, so I think a follow-on patch to make the minimum depth
>> a parameter (set by the guest? the host?) would be helpful.
>>      
> Let's think about the life cycle of I/O requests a bit.
>
> We have an idle virtqueue (aka one virtio-blk device).  The first (read)
> request comes in, we get the virtio notify from the guest, which calls
> into virtio_blk_handle_output.  With the new code we now disable the
> notify once we start processing the first request.  If the second
> request hits the queue before we call into virtio_blk_get_request
> the second time we're fine even with the new code as we keep picking it
> up.  If however it hits after we leave virtio_blk_handle_output, but
> before we complete the first request we do indeed introduce additional
> latency.
>
> So instead of disabling notify while requests are active we might want
> to only disable it while we are inside virtio_blk_handle_output.
> Something like the following minimally tested patch:
>
>
> Index: qemu/hw/virtio-blk.c
> ===================================================================
> --- qemu.orig/hw/virtio-blk.c	2010-01-11 14:28:42.896010503 +0100
> +++ qemu/hw/virtio-blk.c	2010-01-11 14:40:13.535256353 +0100
> @@ -328,7 +328,15 @@ static void virtio_blk_handle_output(Vir
>       int num_writes = 0;
>       BlockDriverState *old_bs = NULL;
>
> +    /*
> +     * While we are processing requests there is no need to get further
> +     * notifications from the guest - it'll just burn cpu cycles doing
> +     * useless context switches into the host.
> +     */
> +    virtio_queue_set_notification(s->vq, 0);
> +
>       while ((req = virtio_blk_get_request(s))) {
> +handle_request:
>           if (req->elem.out_num<  1 || req->elem.in_num<  1) {
>               fprintf(stderr, "virtio-blk missing headers\n");
>               exit(1);
> @@ -358,6 +366,18 @@ static void virtio_blk_handle_output(Vir
>           }
>       }
>
> +    /*
> +     * Once we're done processing all pending requests re-enable the queue
> +     * notification.  If there's an entry pending after we enabled
> +     * notification again we hit a race and need to process it before
> +     * returning.
> +     */
> +    virtio_queue_set_notification(s->vq, 1);
> +    req = virtio_blk_get_request(s);
> +    if (req) {
> +        goto handle_request;
> +    }
> +
>    

I don't think this will have much effect.  First, the time spent in 
virtio_blk_handle_output() is a small fraction of total guest time, so 
the probability of the guest hitting the notifications closed window is 
low.  Second, while we're in that function, the vcpu that kicked us is 
stalled, and other vcpus are likely to be locked out of the queue by the 
guest.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 13:49     ` Anthony Liguori
@ 2010-01-11 14:29       ` Avi Kivity
  2010-01-11 14:37         ` Anthony Liguori
  0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2010-01-11 14:29 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Dor Laor, Christoph Hellwig, Vadim Rozenfeld

On 01/11/2010 03:49 PM, Anthony Liguori wrote:
>> So instead of disabling notify while requests are active we might want
>> to only disable it while we are inside virtio_blk_handle_output.
>> Something like the following minimally tested patch:
>
>
> I'd suggest that we get even more aggressive and install an idle 
> bottom half that checks the queue for newly submitted requests.  If we 
> keep getting requests submitted before a new one completes, we'll 
> never take an I/O exit.
>

That has the downside of bouncing a cache line on unrelated exits.  It 
probably doesn't matter with qemu as it is now, since it will bounce 
qemu_mutex, but it will hurt with large guests (especially if they have 
many rings).

IMO we should get things to work well without riding on unrelated exits, 
especially as we're trying to reduce those exits.

> The same approach is probably a good idea for virtio-net.

With vhost-net you don't see exits.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 14:29       ` Avi Kivity
@ 2010-01-11 14:37         ` Anthony Liguori
  2010-01-11 14:46           ` Avi Kivity
  2010-01-11 18:20           ` Michael S. Tsirkin
  0 siblings, 2 replies; 36+ messages in thread
From: Anthony Liguori @ 2010-01-11 14:37 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Dor Laor, Christoph Hellwig, Vadim Rozenfeld

On 01/11/2010 08:29 AM, Avi Kivity wrote:
> On 01/11/2010 03:49 PM, Anthony Liguori wrote:
>>> So instead of disabling notify while requests are active we might want
>>> to only disable it while we are inside virtio_blk_handle_output.
>>> Something like the following minimally tested patch:
>>
>>
>> I'd suggest that we get even more aggressive and install an idle 
>> bottom half that checks the queue for newly submitted requests.  If 
>> we keep getting requests submitted before a new one completes, we'll 
>> never take an I/O exit.
>>
>
> That has the downside of bouncing a cache line on unrelated exits.

The read and write sides of the ring are widely separated in physical 
memory specifically to avoid cache line bouncing.

>   It probably doesn't matter with qemu as it is now, since it will 
> bounce qemu_mutex, but it will hurt with large guests (especially if 
> they have many rings).
>
> IMO we should get things to work well without riding on unrelated 
> exits, especially as we're trying to reduce those exits.

A block I/O request can potentially be very, very long lived.  By 
serializing requests like this, there's a high likelihood that it's 
going to kill performance with anything capable of processing multiple 
requests.

OTOH, if we aggressively poll the ring when we have an opportunity to, 
there's very little down side to that and it addresses the serialization 
problem.

>> The same approach is probably a good idea for virtio-net.
>
> With vhost-net you don't see exits.

The point is, when we've disabled notification, we should poll on the 
ring for additional requests instead of waiting for one to complete 
before looking at another one.

Even with vhost-net, this logic is still applicable although potentially 
harder to achieve.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 14:37         ` Anthony Liguori
@ 2010-01-11 14:46           ` Avi Kivity
  2010-01-11 15:13             ` Anthony Liguori
  2010-01-11 18:20           ` Michael S. Tsirkin
  1 sibling, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2010-01-11 14:46 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Dor Laor, Christoph Hellwig, Vadim Rozenfeld

On 01/11/2010 04:37 PM, Anthony Liguori wrote:
>> That has the downside of bouncing a cache line on unrelated exits.
>
>
> The read and write sides of the ring are widely separated in physical 
> memory specifically to avoid cache line bouncing.

I meant, exits on random vcpus will cause the cacheline containing the 
notification disable flag to bounce around.  As it is, we read it on the 
vcpu that owns the queue and write it on that vcpu or the I/O thread.

>>   It probably doesn't matter with qemu as it is now, since it will 
>> bounce qemu_mutex, but it will hurt with large guests (especially if 
>> they have many rings).
>>
>> IMO we should get things to work well without riding on unrelated 
>> exits, especially as we're trying to reduce those exits.
>
> A block I/O request can potentially be very, very long lived.  By 
> serializing requests like this, there's a high likelihood that it's 
> going to kill performance with anything capable of processing multiple 
> requests.

Right, that's why I suggested having a queue depth at which disabling 
notification kicks in.  The patch hardcodes this depth to 1, unpatched 
qemu is infinite, a good value is probably spindle count + VAT.

> OTOH, if we aggressively poll the ring when we have an opportunity to, 
> there's very little down side to that and it addresses the 
> serialization problem.

But we can't guarantee that we'll get those opportunities, so it doesn't 
address the problem in a general way.  A guest that doesn't use hpet and 
only has a single virtio-blk device will not have any reason to exit to 
qemu.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 14:46           ` Avi Kivity
@ 2010-01-11 15:13             ` Anthony Liguori
  2010-01-11 15:19               ` Avi Kivity
  2010-01-11 18:22               ` [Qemu-devel] " Michael S. Tsirkin
  0 siblings, 2 replies; 36+ messages in thread
From: Anthony Liguori @ 2010-01-11 15:13 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Dor Laor, Christoph Hellwig, Vadim Rozenfeld

On 01/11/2010 08:46 AM, Avi Kivity wrote:
> On 01/11/2010 04:37 PM, Anthony Liguori wrote:
>>> That has the downside of bouncing a cache line on unrelated exits.
>>
>>
>> The read and write sides of the ring are widely separated in physical 
>> memory specifically to avoid cache line bouncing.
>
> I meant, exits on random vcpus will cause the cacheline containing the 
> notification disable flag to bounce around.  As it is, we read it on 
> the vcpu that owns the queue and write it on that vcpu or the I/O thread.

Bottom halves are always run from the IO thread.
>>>   It probably doesn't matter with qemu as it is now, since it will 
>>> bounce qemu_mutex, but it will hurt with large guests (especially if 
>>> they have many rings).
>>>
>>> IMO we should get things to work well without riding on unrelated 
>>> exits, especially as we're trying to reduce those exits.
>>
>> A block I/O request can potentially be very, very long lived.  By 
>> serializing requests like this, there's a high likelihood that it's 
>> going to kill performance with anything capable of processing 
>> multiple requests.
>
> Right, that's why I suggested having a queue depth at which disabling 
> notification kicks in.  The patch hardcodes this depth to 1, unpatched 
> qemu is infinite, a good value is probably spindle count + VAT.

That means we would need a user visible option which is quite unfortunate.

Also, that logic only really makes sense with cache=off.  With 
cache=writethrough, you can get pathological cases whereas you have an 
uncached access followed by cached accesses.  In fact, with read-ahead, 
this is probably not an uncommon scenario.

>> OTOH, if we aggressively poll the ring when we have an opportunity 
>> to, there's very little down side to that and it addresses the 
>> serialization problem.
>
> But we can't guarantee that we'll get those opportunities, so it 
> doesn't address the problem in a general way.  A guest that doesn't 
> use hpet and only has a single virtio-blk device will not have any 
> reason to exit to qemu.

We can mitigate this with a timer but honestly, we need to do perf 
measurements to see.  My feeling is that we will need some more 
aggressive form of polling than just waiting for IO completion.  I don't 
think queue depth is enough because it assumes that all requests are 
equal.  When dealing with cache=off or even just storage with it's own 
cache, that's simply not the case.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 15:13             ` Anthony Liguori
@ 2010-01-11 15:19               ` Avi Kivity
  2010-01-11 15:22                 ` Anthony Liguori
  2010-01-11 18:22               ` [Qemu-devel] " Michael S. Tsirkin
  1 sibling, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2010-01-11 15:19 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Dor Laor, Christoph Hellwig, Vadim Rozenfeld

On 01/11/2010 05:13 PM, Anthony Liguori wrote:
> On 01/11/2010 08:46 AM, Avi Kivity wrote:
>> On 01/11/2010 04:37 PM, Anthony Liguori wrote:
>>>> That has the downside of bouncing a cache line on unrelated exits.
>>>
>>>
>>> The read and write sides of the ring are widely separated in 
>>> physical memory specifically to avoid cache line bouncing.
>>
>> I meant, exits on random vcpus will cause the cacheline containing 
>> the notification disable flag to bounce around.  As it is, we read it 
>> on the vcpu that owns the queue and write it on that vcpu or the I/O 
>> thread.
>
> Bottom halves are always run from the IO thread.

Okay, so that won't be an issue.

>>>>   It probably doesn't matter with qemu as it is now, since it will 
>>>> bounce qemu_mutex, but it will hurt with large guests (especially 
>>>> if they have many rings).
>>>>
>>>> IMO we should get things to work well without riding on unrelated 
>>>> exits, especially as we're trying to reduce those exits.
>>>
>>> A block I/O request can potentially be very, very long lived.  By 
>>> serializing requests like this, there's a high likelihood that it's 
>>> going to kill performance with anything capable of processing 
>>> multiple requests.
>>
>> Right, that's why I suggested having a queue depth at which disabling 
>> notification kicks in.  The patch hardcodes this depth to 1, 
>> unpatched qemu is infinite, a good value is probably spindle count + 
>> VAT.
>
> That means we would need a user visible option which is quite 
> unfortunate.

We could guess it, perhaps.

> Also, that logic only really makes sense with cache=off.  With 
> cache=writethrough, you can get pathological cases whereas you have an 
> uncached access followed by cached accesses.  In fact, with 
> read-ahead, this is probably not an uncommon scenario.

So you'd increase the disable depths when cache!=none.

>>> OTOH, if we aggressively poll the ring when we have an opportunity 
>>> to, there's very little down side to that and it addresses the 
>>> serialization problem.
>>
>> But we can't guarantee that we'll get those opportunities, so it 
>> doesn't address the problem in a general way.  A guest that doesn't 
>> use hpet and only has a single virtio-blk device will not have any 
>> reason to exit to qemu.
>
> We can mitigate this with a timer but honestly, we need to do perf 
> measurements to see.  My feeling is that we will need some more 
> aggressive form of polling than just waiting for IO completion.  I 
> don't think queue depth is enough because it assumes that all requests 
> are equal.  When dealing with cache=off or even just storage with it's 
> own cache, that's simply not the case.

Maybe we can adapt behaviour dynamically based on how fast results come in.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 15:19               ` Avi Kivity
@ 2010-01-11 15:22                 ` Anthony Liguori
  2010-01-11 15:31                   ` Avi Kivity
  0 siblings, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2010-01-11 15:22 UTC (permalink / raw)
  To: Avi Kivity; +Cc: qemu-devel, Dor Laor, Christoph Hellwig, Vadim Rozenfeld

On 01/11/2010 09:19 AM, Avi Kivity wrote:
>>>> OTOH, if we aggressively poll the ring when we have an opportunity 
>>>> to, there's very little down side to that and it addresses the 
>>>> serialization problem.
>>>
>>> But we can't guarantee that we'll get those opportunities, so it 
>>> doesn't address the problem in a general way.  A guest that doesn't 
>>> use hpet and only has a single virtio-blk device will not have any 
>>> reason to exit to qemu.
>>
>> We can mitigate this with a timer but honestly, we need to do perf 
>> measurements to see.  My feeling is that we will need some more 
>> aggressive form of polling than just waiting for IO completion.  I 
>> don't think queue depth is enough because it assumes that all 
>> requests are equal.  When dealing with cache=off or even just storage 
>> with it's own cache, that's simply not the case.
>
> Maybe we can adapt behaviour dynamically based on how fast results 
> come in.

Based on our experiences with virtio-net, what I'd suggest is to make a 
lot of tunable options (ring size, various tx mitigation schemes, 
timeout durations, etc) and then we can do some deep performance studies 
to see how things interact with each other.

I think we should do that before making any changes because I'm deeply 
concerned that we'll introduce significant performance regressions.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 15:22                 ` Anthony Liguori
@ 2010-01-11 15:31                   ` Avi Kivity
  2010-01-11 15:32                     ` Anthony Liguori
  0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2010-01-11 15:31 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Vadim Rozenfeld, Dor Laor, qemu-devel, Christoph Hellwig

On 01/11/2010 05:22 PM, Anthony Liguori wrote:
>
> Based on our experiences with virtio-net, what I'd suggest is to make 
> a lot of tunable options (ring size, various tx mitigation schemes, 
> timeout durations, etc) and then we can do some deep performance 
> studies to see how things interact with each other.
>
> I think we should do that before making any changes because I'm deeply 
> concerned that we'll introduce significant performance regressions.
>

I agree.  We can start with this patch, with a tunable depth, defaulting 
to current behaviour.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 15:31                   ` Avi Kivity
@ 2010-01-11 15:32                     ` Anthony Liguori
  2010-01-11 15:35                       ` Avi Kivity
  0 siblings, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2010-01-11 15:32 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Vadim Rozenfeld, Dor Laor, qemu-devel, Christoph Hellwig

On 01/11/2010 09:31 AM, Avi Kivity wrote:
> On 01/11/2010 05:22 PM, Anthony Liguori wrote:
>>
>> Based on our experiences with virtio-net, what I'd suggest is to make 
>> a lot of tunable options (ring size, various tx mitigation schemes, 
>> timeout durations, etc) and then we can do some deep performance 
>> studies to see how things interact with each other.
>>
>> I think we should do that before making any changes because I'm 
>> deeply concerned that we'll introduce significant performance 
>> regressions.
>>
>
> I agree.  We can start with this patch, with a tunable depth, 
> defaulting to current behaviour.

I wouldn't be opposed to that provided we made it clear that these 
options were not supported long term.  I don't want management tools 
(like libvirt) to start relying on them.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 15:32                     ` Anthony Liguori
@ 2010-01-11 15:35                       ` Avi Kivity
  2010-01-11 15:38                         ` Anthony Liguori
  0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2010-01-11 15:35 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Vadim Rozenfeld, Dor Laor, qemu-devel, Christoph Hellwig

On 01/11/2010 05:32 PM, Anthony Liguori wrote:
> On 01/11/2010 09:31 AM, Avi Kivity wrote:
>> On 01/11/2010 05:22 PM, Anthony Liguori wrote:
>>>
>>> Based on our experiences with virtio-net, what I'd suggest is to 
>>> make a lot of tunable options (ring size, various tx mitigation 
>>> schemes, timeout durations, etc) and then we can do some deep 
>>> performance studies to see how things interact with each other.
>>>
>>> I think we should do that before making any changes because I'm 
>>> deeply concerned that we'll introduce significant performance 
>>> regressions.
>>>
>>
>> I agree.  We can start with this patch, with a tunable depth, 
>> defaulting to current behaviour.
>
> I wouldn't be opposed to that provided we made it clear that these 
> options were not supported long term.  I don't want management tools 
> (like libvirt) to start relying on them.
>

x-option-name for experimental options?

-device disk,if=virtio,x-queue-depth-suppress-notify=4

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 15:35                       ` Avi Kivity
@ 2010-01-11 15:38                         ` Anthony Liguori
  0 siblings, 0 replies; 36+ messages in thread
From: Anthony Liguori @ 2010-01-11 15:38 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Vadim Rozenfeld, Dor Laor, qemu-devel, Christoph Hellwig

On 01/11/2010 09:35 AM, Avi Kivity wrote:
> On 01/11/2010 05:32 PM, Anthony Liguori wrote:
>> On 01/11/2010 09:31 AM, Avi Kivity wrote:
>>> On 01/11/2010 05:22 PM, Anthony Liguori wrote:
>>>>
>>>> Based on our experiences with virtio-net, what I'd suggest is to 
>>>> make a lot of tunable options (ring size, various tx mitigation 
>>>> schemes, timeout durations, etc) and then we can do some deep 
>>>> performance studies to see how things interact with each other.
>>>>
>>>> I think we should do that before making any changes because I'm 
>>>> deeply concerned that we'll introduce significant performance 
>>>> regressions.
>>>>
>>>
>>> I agree.  We can start with this patch, with a tunable depth, 
>>> defaulting to current behaviour.
>>
>> I wouldn't be opposed to that provided we made it clear that these 
>> options were not supported long term.  I don't want management tools 
>> (like libvirt) to start relying on them.
>>
>
> x-option-name for experimental options?
>
> -device disk,if=virtio,x-queue-depth-suppress-notify=4

Sounds reasonable to me.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Qemu-devel] Re: Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 14:37         ` Anthony Liguori
  2010-01-11 14:46           ` Avi Kivity
@ 2010-01-11 18:20           ` Michael S. Tsirkin
  1 sibling, 0 replies; 36+ messages in thread
From: Michael S. Tsirkin @ 2010-01-11 18:20 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, Avi Kivity, qemu-devel

On Mon, Jan 11, 2010 at 08:37:10AM -0600, Anthony Liguori wrote:
> On 01/11/2010 08:29 AM, Avi Kivity wrote:
>> On 01/11/2010 03:49 PM, Anthony Liguori wrote:
>>>> So instead of disabling notify while requests are active we might want
>>>> to only disable it while we are inside virtio_blk_handle_output.
>>>> Something like the following minimally tested patch:
>>>
>>>
>>> I'd suggest that we get even more aggressive and install an idle  
>>> bottom half that checks the queue for newly submitted requests.  If  
>>> we keep getting requests submitted before a new one completes, we'll  
>>> never take an I/O exit.
>>>
>>
>> That has the downside of bouncing a cache line on unrelated exits.
>
> The read and write sides of the ring are widely separated in physical  
> memory specifically to avoid cache line bouncing.
>
>>   It probably doesn't matter with qemu as it is now, since it will  
>> bounce qemu_mutex, but it will hurt with large guests (especially if  
>> they have many rings).
>>
>> IMO we should get things to work well without riding on unrelated  
>> exits, especially as we're trying to reduce those exits.
>
> A block I/O request can potentially be very, very long lived.  By  
> serializing requests like this, there's a high likelihood that it's  
> going to kill performance with anything capable of processing multiple  
> requests.
>
> OTOH, if we aggressively poll the ring when we have an opportunity to,  
> there's very little down side to that and it addresses the serialization  
> problem.
>
>>> The same approach is probably a good idea for virtio-net.
>>
>> With vhost-net you don't see exits.
>
> The point is, when we've disabled notification, we should poll on the  
> ring for additional requests instead of waiting for one to complete  
> before looking at another one.
>
> Even with vhost-net, this logic is still applicable although potentially  
> harder to achieve.
>
> Regards,
>
> Anthony Liguori
>

vhost net does this already: it has a mode where it poll when skbs have
left send queue: for tap this is when they have crossed the bridge, for packet
socket this is when they have been transmitted.

-- 
MST

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Qemu-devel] Re: Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 15:13             ` Anthony Liguori
  2010-01-11 15:19               ` Avi Kivity
@ 2010-01-11 18:22               ` Michael S. Tsirkin
  1 sibling, 0 replies; 36+ messages in thread
From: Michael S. Tsirkin @ 2010-01-11 18:22 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, Avi Kivity, qemu-devel

On Mon, Jan 11, 2010 at 09:13:23AM -0600, Anthony Liguori wrote:
> On 01/11/2010 08:46 AM, Avi Kivity wrote:
>> On 01/11/2010 04:37 PM, Anthony Liguori wrote:
>>>> That has the downside of bouncing a cache line on unrelated exits.
>>>
>>>
>>> The read and write sides of the ring are widely separated in physical 
>>> memory specifically to avoid cache line bouncing.
>>
>> I meant, exits on random vcpus will cause the cacheline containing the  
>> notification disable flag to bounce around.  As it is, we read it on  
>> the vcpu that owns the queue and write it on that vcpu or the I/O 
>> thread.
>
> Bottom halves are always run from the IO thread.
>>>>   It probably doesn't matter with qemu as it is now, since it will  
>>>> bounce qemu_mutex, but it will hurt with large guests (especially 
>>>> if they have many rings).
>>>>
>>>> IMO we should get things to work well without riding on unrelated  
>>>> exits, especially as we're trying to reduce those exits.
>>>
>>> A block I/O request can potentially be very, very long lived.  By  
>>> serializing requests like this, there's a high likelihood that it's  
>>> going to kill performance with anything capable of processing  
>>> multiple requests.
>>
>> Right, that's why I suggested having a queue depth at which disabling  
>> notification kicks in.  The patch hardcodes this depth to 1, unpatched  
>> qemu is infinite, a good value is probably spindle count + VAT.
>
> That means we would need a user visible option which is quite unfortunate.
>
> Also, that logic only really makes sense with cache=off.  With  
> cache=writethrough, you can get pathological cases whereas you have an  
> uncached access followed by cached accesses.  In fact, with read-ahead,  
> this is probably not an uncommon scenario.
>
>>> OTOH, if we aggressively poll the ring when we have an opportunity  
>>> to, there's very little down side to that and it addresses the  
>>> serialization problem.
>>
>> But we can't guarantee that we'll get those opportunities, so it  
>> doesn't address the problem in a general way.  A guest that doesn't  
>> use hpet and only has a single virtio-blk device will not have any  
>> reason to exit to qemu.
>
> We can mitigate this with a timer but honestly, we need to do perf  
> measurements to see.  My feeling is that we will need some more  
> aggressive form of polling than just waiting for IO completion.  I don't  
> think queue depth is enough because it assumes that all requests are  
> equal.  When dealing with cache=off or even just storage with it's own  
> cache, that's simply not the case.
>
> Regards,
>
> Anthony Liguori
>

BTW this is why vhost net uses queue depth in bytes.

-- 
MST

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-01-11 14:00             ` Anthony Liguori
@ 2010-02-24  2:58               ` Paul Brook
  2010-02-24 14:59                 ` Anthony Liguori
  0 siblings, 1 reply; 36+ messages in thread
From: Paul Brook @ 2010-02-24  2:58 UTC (permalink / raw)
  To: qemu-devel; +Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, Avi Kivity

> Bottom halves are run at the very end of the event loop which means that
> they're guaranteed to be the last thing run.  idle bottom halves can be
> rescheduled without causing an infinite loop and do not affect the
> select timeout (which normal bottom halves do).

Idle bottom halves (i.e. qemu_bh_schedule_idle) are just bugs waiting to 
happen, and should never be used for anything.

Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-24  2:58               ` Paul Brook
@ 2010-02-24 14:59                 ` Anthony Liguori
  2010-02-25 15:06                   ` Paul Brook
  0 siblings, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2010-02-24 14:59 UTC (permalink / raw)
  To: Paul Brook
  Cc: Vadim Rozenfeld, Avi Kivity, Dor Laor, qemu-devel, Christoph Hellwig

On 02/23/2010 08:58 PM, Paul Brook wrote:
>> Bottom halves are run at the very end of the event loop which means that
>> they're guaranteed to be the last thing run.  idle bottom halves can be
>> rescheduled without causing an infinite loop and do not affect the
>> select timeout (which normal bottom halves do).
>>      
> Idle bottom halves (i.e. qemu_bh_schedule_idle) are just bugs waiting to
> happen, and should never be used for anything.
>    

Idle bottom halves make considerable more sense than the normal bottom 
halves.

The fact that rescheduling a bottom half within a bottom half results in 
an infinite loop is absurd.  It is equally absurd that bottoms halves 
alter the select timeout.  The result of that is that if a bottom half 
schedules another bottom half, and that bottom half schedules the 
previous, you get a tight infinite loop.  Since bottom halves are used 
often times deep within functions, the result is very subtle infinite 
loops (that we've absolutely encountered in the past).

A main loop should have only a few characteristics.  It should enable 
timeouts (based on select), it should enable fd event dispatch, and it 
should allow for idle functions to be registered.  There should be no 
guarantees on when idle functions are executed other than they'll 
eventually be executed.

The way we use "bottom halves" today should be implemented in terms of a 
relative timeout of 0 or an absolute timeout of now.  The fact that we 
can't do that in our main loop is due to the very strange dependencies 
deep within various devices on io dispatch ordering.  I would love to 
eliminate this but I've not been able to spend any time on this recently.

Regards,

Anthony Liguori

> Paul
>    

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-24 14:59                 ` Anthony Liguori
@ 2010-02-25 15:06                   ` Paul Brook
  2010-02-25 15:23                     ` Anthony Liguori
  2010-02-25 17:11                     ` Avi Kivity
  0 siblings, 2 replies; 36+ messages in thread
From: Paul Brook @ 2010-02-25 15:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: Dor Laor, Christoph Hellwig, Vadim Rozenfeld, Avi Kivity

> > Idle bottom halves (i.e. qemu_bh_schedule_idle) are just bugs waiting to
> > happen, and should never be used for anything.
> 
> Idle bottom halves make considerable more sense than the normal bottom
> halves.
> 
> The fact that rescheduling a bottom half within a bottom half results in
> an infinite loop is absurd.  It is equally absurd that bottoms halves
> alter the select timeout.  The result of that is that if a bottom half
> schedules another bottom half, and that bottom half schedules the
> previous, you get a tight infinite loop.  Since bottom halves are used
> often times deep within functions, the result is very subtle infinite
> loops (that we've absolutely encountered in the past).

I disagree. The "select timeout" is a completely irrelevant implementation 
detail. Anything that relies on it is just plain wrong. If you require a delay 
then you should be using a timer. If scheduling a BH directly then you should 
expect it to be processed without delay.

If a BH reschedules itself (indirectly or indirectly) without useful work 
occuring then you absolutely should expect an infinite loop. Rescheduling 
itself after doing useful work should never cause an infinite loop. The only 
way it can loop inifinitely is if we have infinite amount of work to do, in 
which case you loose either way.  Looping over work via recursive BHs is 
probably not the most efficient way to do things, but I guess it may sometimes 
be the simplest in practice.

Interaction between multiple BH is slightly trickier. By my reading BH are 
processed in the order they are created. It may be reasonable to guarantee 
that BH are processed in the order they are scheduled. However I'm reluctant 
to even go that far without a good use-case. You could probably come up with 
arguments for processing them in most-recently-scheduled order.

>A main loop should have only a few characteristics.  It should enable
>timeouts (based on select), it should enable fd event dispatch, and it
>should allow for idle functions to be registered.  There should be no
>guarantees on when idle functions are executed other than they'll
>eventually be executed.

If you don't provide any guarantees, then surely processing them immediately 
must be an acceptable implementation.  I don't believe there is a useful 
definition of "idle".
All existing uses of qemu_bh_schedule_idle are in fact poorly implemented 
periodic polling. Furthermore these should not be using periodic polling, they 
can and should be event driven.  They only exist because noone has bothered to 
fix the old code properly. Having arbitrary 10ms latency on DMA transfers is 
just plain wrong.

>The way we use "bottom halves" today should be implemented in terms of a
>relative timeout of 0 or an absolute timeout of now.  The fact that we
>can't do that in our main loop is due to the very strange dependencies
>deep within various devices on io dispatch ordering.  I would love to
>eliminate this but I've not been able to spend any time on this recently.

I don't see how this helps. A self-triggering event with a timeout of "now" is 
still an infinite loop. Any delay is a bugs in the dispatch loop. "idle" BHs 
are relying on this bug.

Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-25 15:06                   ` Paul Brook
@ 2010-02-25 15:23                     ` Anthony Liguori
  2010-02-25 16:48                       ` Paul Brook
  2010-02-25 17:11                     ` Avi Kivity
  1 sibling, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2010-02-25 15:23 UTC (permalink / raw)
  To: Paul Brook
  Cc: Avi Kivity, Dor Laor, Christoph Hellwig, qemu-devel, Vadim Rozenfeld

On 02/25/2010 09:06 AM, Paul Brook wrote:
>>> Idle bottom halves (i.e. qemu_bh_schedule_idle) are just bugs waiting to
>>> happen, and should never be used for anything.
>>>        
>> Idle bottom halves make considerable more sense than the normal bottom
>> halves.
>>
>> The fact that rescheduling a bottom half within a bottom half results in
>> an infinite loop is absurd.  It is equally absurd that bottoms halves
>> alter the select timeout.  The result of that is that if a bottom half
>> schedules another bottom half, and that bottom half schedules the
>> previous, you get a tight infinite loop.  Since bottom halves are used
>> often times deep within functions, the result is very subtle infinite
>> loops (that we've absolutely encountered in the past).
>>      
> I disagree. The "select timeout" is a completely irrelevant implementation
> detail. Anything that relies on it is just plain wrong.

No, it's an extremely important detail because its the difference 
between an infinite loop in an idle function.

Very simply, without idle bottom halves, there's no way to implement 
polling with the main loop.  If we dropped idle bottom halves, we would 
have to add explicit polling back to the main loop.

How would you implement polling?

>   If you require a delay
> then you should be using a timer. If scheduling a BH directly then you should
> expect it to be processed without delay.
>    

If you need something that doesn't require a delay, schedule a timer 
with no delay.  What's the point of BHs?

It's because there's very subtle differences between BHs and timers and 
because we don't adjust the select timeout for the next timer deadline, 
there's a race between when we check for timer expiration and when we 
invoke select.

Actually, that's probably fixed now because we've changed the SIGALRM 
handler to write to a file descriptor to eliminate the signal/select 
race.  I'll give it a try, it might actually work now.

> If a BH reschedules itself (indirectly or indirectly) without useful work
> occuring then you absolutely should expect an infinite loop. Rescheduling
> itself after doing useful work should never cause an infinite loop. The only
> way it can loop inifinitely is if we have infinite amount of work to do, in
> which case you loose either way.  Looping over work via recursive BHs is
> probably not the most efficient way to do things, but I guess it may sometimes
> be the simplest in practice.
>    

I think the point is getting lost.  My contention is that a main loop 
needs three things 1) an idle callback 2) timers 3) io notification.  
Bottom halves act both today as a no-delay timer and an idle callback.  
I agree, that's unfortunate.  What we should do is remove idle bottom 
halves, add proper idle callbacks, and make bottom halves implemented in 
terms of no-delay timers.

> Interaction between multiple BH is slightly trickier. By my reading BH are
> processed in the order they are created. It may be reasonable to guarantee
> that BH are processed in the order they are scheduled. However I'm reluctant
> to even go that far without a good use-case. You could probably come up with
> arguments for processing them in most-recently-scheduled order.
>    

It's no different than two timers that happen to fire at the same 
deadline.  You can process in terms of when the timers were created 
initially or when they were last scheduled.  Ultimately, it shouldn't 
matter to any code that uses them.

>> A main loop should have only a few characteristics.  It should enable
>> timeouts (based on select), it should enable fd event dispatch, and it
>> should allow for idle functions to be registered.  There should be no
>> guarantees on when idle functions are executed other than they'll
>> eventually be executed.
>>      
> If you don't provide any guarantees, then surely processing them immediately
> must be an acceptable implementation.  I don't believe there is a useful
> definition of "idle".
>    

idle is necessary to implement polling.  You can argue that polling 
should never be necessary but practically speaking, it almost always is.

> All existing uses of qemu_bh_schedule_idle are in fact poorly implemented
> periodic polling. Furthermore these should not be using periodic polling, they
> can and should be event driven.  They only exist because noone has bothered to
> fix the old code properly. Having arbitrary 10ms latency on DMA transfers is
> just plain wrong.
>    

I agree with you here :-)

But there are more legitimate uses of polling.  For instance, with 
virtio-net, we use a timer to implement tx mitigation.  It's a huge 
boost for throughput but it tends to add latency on packet transmission 
because our notification comes at a longer time.  We really just need 
the notification as a means to re-enable interrupts, not necessarily to 
slow down packet transmission.

So using an idle callback to transmit packets would certainly decrease 
latency in many cases while still getting the benefits of throughput 
improvement.

>> The way we use "bottom halves" today should be implemented in terms of a
>> relative timeout of 0 or an absolute timeout of now.  The fact that we
>> can't do that in our main loop is due to the very strange dependencies
>> deep within various devices on io dispatch ordering.  I would love to
>> eliminate this but I've not been able to spend any time on this recently.
>>      
> I don't see how this helps. A self-triggering event with a timeout of "now" is
> still an infinite loop. Any delay is a bugs in the dispatch loop. "idle" BHs
> are relying on this bug.
>    

The main point is that BHs should not be implemented in the actual main 
loop and that "idle" BHs are really the only type of BHs that should 
exist as far as the main loop is concerned.  s/"idle" BHs/idle 
callbacks/g and I think we're at a more agreeable place.

Regards,

Anthony Liguori

> Paul
>    

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-25 15:23                     ` Anthony Liguori
@ 2010-02-25 16:48                       ` Paul Brook
  0 siblings, 0 replies; 36+ messages in thread
From: Paul Brook @ 2010-02-25 16:48 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Avi Kivity, Dor Laor, Christoph Hellwig, qemu-devel, Vadim Rozenfeld

> Very simply, without idle bottom halves, there's no way to implement
> polling with the main loop.  If we dropped idle bottom halves, we would
> have to add explicit polling back to the main loop.
> 
> How would you implement polling?

AFAICS any sort of polling is by definition time based so use a timer.
Forcing the user to explicitly decide how often to poll is a feature. If they 
don't know this then they probably shouldn't be using polling.

> > I don't see how this helps. A self-triggering event with a timeout of
> > "now" is still an infinite loop. Any delay is a bugs in the dispatch
> > loop. "idle" BHs are relying on this bug.
> 
> The main point is that BHs should not be implemented in the actual main
> loop and that "idle" BHs are really the only type of BHs that should
> exist as far as the main loop is concerned.  s/"idle" BHs/idle
> callbacks/g and I think we're at a more agreeable place.

Part of my difficulty is that I don't have a clear idea what "idle" means. It 
certainly isn't what qemu_bh_schedule_idle implements. 

The only vaguely sane definition I can come up with is once the main loop has 
run out of useful things to do and is about to suspend itself.  Typically no 
significant guest code will be executed between requesting the idle callback 
and the callback occurring. In an SMP host environment it may be possible for 
guest CPUs to trigger or observe intermediate events, but this can not be 
relied upon. Given this definition I'm unclear how useful this would be.

A BH is a deferred callback that is used to allow events to be processed. IMO 
the important feature is that it is a deferred until after the current event 
has been processed, so avoid a whole set of reentrancy problems.  Of course if 
you misuse them you can cause infinite loops, in the same way that misusing a 
regular callback will lead to infinite recursion.

I'm not sure that replacing BHs with zero interval timers actually gains us 
anything.  From a user(device) perspective I'd be more inclined to make timers 
trigger a BH when they expire, like the ptimer code.  Idle events can then be 
handled in exactly the same way: the user provides a BH which is triggered the 
next time the idle event occurs.

The exact source of a call to a BH routine from is an implementation detail. 
The important thing is that they will never be invoked from within or 
concurrent with any other device callback.

Paul

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-25 15:06                   ` Paul Brook
  2010-02-25 15:23                     ` Anthony Liguori
@ 2010-02-25 17:11                     ` Avi Kivity
  2010-02-25 17:15                       ` Anthony Liguori
  1 sibling, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2010-02-25 17:11 UTC (permalink / raw)
  To: Paul Brook; +Cc: Dor Laor, Christoph Hellwig, qemu-devel, Vadim Rozenfeld

On 02/25/2010 05:06 PM, Paul Brook wrote:
>>> Idle bottom halves (i.e. qemu_bh_schedule_idle) are just bugs waiting to
>>> happen, and should never be used for anything.
>>>        
>> Idle bottom halves make considerable more sense than the normal bottom
>> halves.
>>
>> The fact that rescheduling a bottom half within a bottom half results in
>> an infinite loop is absurd.  It is equally absurd that bottoms halves
>> alter the select timeout.  The result of that is that if a bottom half
>> schedules another bottom half, and that bottom half schedules the
>> previous, you get a tight infinite loop.  Since bottom halves are used
>> often times deep within functions, the result is very subtle infinite
>> loops (that we've absolutely encountered in the past).
>>      
> I disagree. The "select timeout" is a completely irrelevant implementation
> detail. Anything that relies on it is just plain wrong. If you require a delay
> then you should be using a timer. If scheduling a BH directly then you should
> expect it to be processed without delay.
>    

I agree.  Further, once we fine-grain device threading, the iothread 
essentially disappears and is replaced by device-specific threads.  
There's no "idle" anymore.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-25 17:11                     ` Avi Kivity
@ 2010-02-25 17:15                       ` Anthony Liguori
  2010-02-25 17:33                         ` Avi Kivity
  0 siblings, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2010-02-25 17:15 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, Paul Brook, qemu-devel

On 02/25/2010 11:11 AM, Avi Kivity wrote:
> On 02/25/2010 05:06 PM, Paul Brook wrote:
>>>> Idle bottom halves (i.e. qemu_bh_schedule_idle) are just bugs 
>>>> waiting to
>>>> happen, and should never be used for anything.
>>> Idle bottom halves make considerable more sense than the normal bottom
>>> halves.
>>>
>>> The fact that rescheduling a bottom half within a bottom half 
>>> results in
>>> an infinite loop is absurd.  It is equally absurd that bottoms halves
>>> alter the select timeout.  The result of that is that if a bottom half
>>> schedules another bottom half, and that bottom half schedules the
>>> previous, you get a tight infinite loop.  Since bottom halves are used
>>> often times deep within functions, the result is very subtle infinite
>>> loops (that we've absolutely encountered in the past).
>> I disagree. The "select timeout" is a completely irrelevant 
>> implementation
>> detail. Anything that relies on it is just plain wrong. If you 
>> require a delay
>> then you should be using a timer. If scheduling a BH directly then 
>> you should
>> expect it to be processed without delay.
>
> I agree.  Further, once we fine-grain device threading, the iothread 
> essentially disappears and is replaced by device-specific threads.  
> There's no "idle" anymore.

That's a nice idea, but how is io dispatch handled?  Is everything 
synchronous or do we continue to program asynchronously?

It's very difficult to mix concepts.  I personally don't anticipate 
per-device threading but rather anticipate re-entrant device models.  I 
would expect all I/O to be dispatched within the I/O thread and the VCPU 
threads to be able to execute device models simultaneously with the I/O 
thread.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-25 17:15                       ` Anthony Liguori
@ 2010-02-25 17:33                         ` Avi Kivity
  2010-02-25 18:05                           ` malc
  2010-02-25 19:55                           ` Anthony Liguori
  0 siblings, 2 replies; 36+ messages in thread
From: Avi Kivity @ 2010-02-25 17:33 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, Paul Brook, qemu-devel

On 02/25/2010 07:15 PM, Anthony Liguori wrote:
>> I agree.  Further, once we fine-grain device threading, the iothread 
>> essentially disappears and is replaced by device-specific threads.  
>> There's no "idle" anymore.
>
>
> That's a nice idea, but how is io dispatch handled?  Is everything 
> synchronous or do we continue to program asynchronously?

Simple stuff can be kept asynchronous, complex stuff (like qcow2) ought 
to be made synchronous (it uses threads anyway, so we don't lose 
anything).  Stuff like vnc can go either way.

>
> It's very difficult to mix concepts. 

We're complicated enough to have conflicting requirements and a large 
code base with its own inertia, so no choice really.

> I personally don't anticipate per-device threading but rather 
> anticipate re-entrant device models.  I would expect all I/O to be 
> dispatched within the I/O thread and the VCPU threads to be able to 
> execute device models simultaneously with the I/O thread.

That means long-running operations on the iothread can lock out other 
completions.

Candidates for own threads are:
- live migration
- block format drivers (except linux-aio, perhaps have a thread for the 
aio completion handler)
- vnc
- sdl
- sound?
- hotplug, esp. memory

Each such thread could run the same loop as the iothread.  Any pollable 
fd or timer would be associated with a thread, so things continue as 
normal more or less.  Unassociated objects continue with the main iothread.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-25 17:33                         ` Avi Kivity
@ 2010-02-25 18:05                           ` malc
  2010-02-25 19:55                           ` Anthony Liguori
  1 sibling, 0 replies; 36+ messages in thread
From: malc @ 2010-02-25 18:05 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Dor Laor, qemu-devel, Paul Brook, Christoph Hellwig, Vadim Rozenfeld

On Thu, 25 Feb 2010, Avi Kivity wrote:

> On 02/25/2010 07:15 PM, Anthony Liguori wrote:
> > > I agree.  Further, once we fine-grain device threading, the iothread
> > > essentially disappears and is replaced by device-specific threads.
> > > There's no "idle" anymore.
> > 
> > 
> > That's a nice idea, but how is io dispatch handled?  Is everything
> > synchronous or do we continue to program asynchronously?
> 
> Simple stuff can be kept asynchronous, complex stuff (like qcow2) ought to be
> made synchronous (it uses threads anyway, so we don't lose anything).  Stuff
> like vnc can go either way.
> 
> > 
> > It's very difficult to mix concepts. 
> 
> We're complicated enough to have conflicting requirements and a large code
> base with its own inertia, so no choice really.
> 
> > I personally don't anticipate per-device threading but rather anticipate
> > re-entrant device models.  I would expect all I/O to be dispatched within
> > the I/O thread and the VCPU threads to be able to execute device models
> > simultaneously with the I/O thread.
> 
> That means long-running operations on the iothread can lock out other
> completions.
> 
> Candidates for own threads are:
> - live migration
> - block format drivers (except linux-aio, perhaps have a thread for the aio
> completion handler)
> - vnc
> - sdl
> - sound?

Had(ve?) that on a private branch, there's no point in that complication.

> - hotplug, esp. memory
> 
> Each such thread could run the same loop as the iothread.  Any pollable fd or
> timer would be associated with a thread, so things continue as normal more or
> less.  Unassociated objects continue with the main iothread.
> 
> 

-- 
mailto:av1474@comtv.ru

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-25 17:33                         ` Avi Kivity
  2010-02-25 18:05                           ` malc
@ 2010-02-25 19:55                           ` Anthony Liguori
  2010-02-26  8:47                             ` Avi Kivity
  1 sibling, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2010-02-25 19:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, Paul Brook, qemu-devel

On 02/25/2010 11:33 AM, Avi Kivity wrote:
> On 02/25/2010 07:15 PM, Anthony Liguori wrote:
>>> I agree.  Further, once we fine-grain device threading, the iothread 
>>> essentially disappears and is replaced by device-specific threads.  
>>> There's no "idle" anymore.
>>
>>
>> That's a nice idea, but how is io dispatch handled?  Is everything 
>> synchronous or do we continue to program asynchronously?
>
> Simple stuff can be kept asynchronous, complex stuff (like qcow2) 
> ought to be made synchronous (it uses threads anyway, so we don't lose 
> anything).  Stuff like vnc can go either way.

We've discussed this before and I still contend that threads do not make 
qcow2 any simpler.

>>
>> It's very difficult to mix concepts. 
>
> We're complicated enough to have conflicting requirements and a large 
> code base with its own inertia, so no choice really.
>
>> I personally don't anticipate per-device threading but rather 
>> anticipate re-entrant device models.  I would expect all I/O to be 
>> dispatched within the I/O thread and the VCPU threads to be able to 
>> execute device models simultaneously with the I/O thread.
>
> That means long-running operations on the iothread can lock out other 
> completions.
>
> Candidates for own threads are:
> - live migration
> - block format drivers (except linux-aio, perhaps have a thread for 
> the aio completion handler)
> - vnc
> - sdl
> - sound?
> - hotplug, esp. memory
>
> Each such thread could run the same loop as the iothread.  Any 
> pollable fd or timer would be associated with a thread, so things 
> continue as normal more or less.  Unassociated objects continue with 
> the main iothread.

Is the point latency or increasing available CPU resources?  If the 
device models are re-entrant, that reduces a ton of the demand on the 
qemu_mutex which means that IO thread can run uncontended.  While we 
have evidence that the VCPU threads and IO threads are competing with 
each other today, I don't think we have any evidence to suggest that the 
IO thread is self-starving itself with long running events.

With the device model, I'd like to see us move toward a very well 
defined API for each device to use.  Part of the reason for this is to 
limit the scope of the devices in such a way that we can enforce this at 
compile time.  Then we can introduce locking within devices with some 
level of guarantee that we've covered the API devices are actually 
consuming.

For host services though, it's much more difficult to isolate them like 
this.  I'm not necessarily claiming that this will never be the right 
thing to do, but I don't think we really have the evidence today to 
suggest that we should focus on this in the short term.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-25 19:55                           ` Anthony Liguori
@ 2010-02-26  8:47                             ` Avi Kivity
  2010-02-26 14:36                               ` Anthony Liguori
  0 siblings, 1 reply; 36+ messages in thread
From: Avi Kivity @ 2010-02-26  8:47 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, Paul Brook, qemu-devel

On 02/25/2010 09:55 PM, Anthony Liguori wrote:
> On 02/25/2010 11:33 AM, Avi Kivity wrote:
>> On 02/25/2010 07:15 PM, Anthony Liguori wrote:
>>>> I agree.  Further, once we fine-grain device threading, the 
>>>> iothread essentially disappears and is replaced by device-specific 
>>>> threads.  There's no "idle" anymore.
>>>
>>>
>>> That's a nice idea, but how is io dispatch handled?  Is everything 
>>> synchronous or do we continue to program asynchronously?
>>
>> Simple stuff can be kept asynchronous, complex stuff (like qcow2) 
>> ought to be made synchronous (it uses threads anyway, so we don't 
>> lose anything).  Stuff like vnc can go either way.
>
> We've discussed this before and I still contend that threads do not 
> make qcow2 any simpler.

qcow2 is still not fully asynchronous.  All the other format drivers 
(except raw) are fully synchronous.  If we had a threaded 
infrastructure, we could convert them all in a day.  As it is, you can 
only use the other block format drivers in 'qemu-img convert'.

>>
>> Each such thread could run the same loop as the iothread.  Any 
>> pollable fd or timer would be associated with a thread, so things 
>> continue as normal more or less.  Unassociated objects continue with 
>> the main iothread.
>
> Is the point latency or increasing available CPU resources? 

Yes.

> If the device models are re-entrant, that reduces a ton of the demand 
> on the qemu_mutex which means that IO thread can run uncontended.  
> While we have evidence that the VCPU threads and IO threads are 
> competing with each other today, I don't think we have any evidence to 
> suggest that the IO thread is self-starving itself with long running 
> events.

I agree we have no evidence and that this is all speculation.  But 
consider a 64-vcpu guest, it has a 1:64 ratio of vcpu time (initiations) 
to iothread time (completions).  If each vcpu generates 5000 initiations 
per second, the iothread needs to handle 320,000 completions per 
second.  At that rate you will see some internal competition.  That 
thread will also have a hard time shuffling data since every 
completion's data will reside in the wrong cpu cache.

Note, an alternative to multiple iothreads is to move completion 
handling back to vcpus, provided we can steer the handler close to the 
guest completion handler.

>
> With the device model, I'd like to see us move toward a very well 
> defined API for each device to use.  Part of the reason for this is to 
> limit the scope of the devices in such a way that we can enforce this 
> at compile time.  Then we can introduce locking within devices with 
> some level of guarantee that we've covered the API devices are 
> actually consuming.

Yes.  On the other hand, the shape of the API will be influenced by the 
locking model, so we'll have to take iterative steps, unless someone 
comes out with a brilliant design.

>
> For host services though, it's much more difficult to isolate them 
> like this. 

What do you mean by host services?

> I'm not necessarily claiming that this will never be the right thing 
> to do, but I don't think we really have the evidence today to suggest 
> that we should focus on this in the short term.

Agreed.  We will start to see evidence (one way or the other) as fully 
loaded 64-vcpu guests are benchmarked.  Another driver may be real-time 
guests; if a timer can be deferred by some block device initiation or 
completion, then we can say goodbye to any realtime guarantees we want 
to make.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-26  8:47                             ` Avi Kivity
@ 2010-02-26 14:36                               ` Anthony Liguori
  2010-02-26 15:39                                 ` Avi Kivity
  0 siblings, 1 reply; 36+ messages in thread
From: Anthony Liguori @ 2010-02-26 14:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, Paul Brook, qemu-devel

On 02/26/2010 02:47 AM, Avi Kivity wrote:
> qcow2 is still not fully asynchronous.  All the other format drivers 
> (except raw) are fully synchronous.  If we had a threaded 
> infrastructure, we could convert them all in a day.  As it is, you can 
> only use the other block format drivers in 'qemu-img convert'.

I've got a healthy amount of scepticism that it's that easy.  But I'm 
happy to consider patches :-)

>
>>>
>>> Each such thread could run the same loop as the iothread.  Any 
>>> pollable fd or timer would be associated with a thread, so things 
>>> continue as normal more or less.  Unassociated objects continue with 
>>> the main iothread.
>>
>> Is the point latency or increasing available CPU resources? 
>
> Yes.
>
>> If the device models are re-entrant, that reduces a ton of the demand 
>> on the qemu_mutex which means that IO thread can run uncontended.  
>> While we have evidence that the VCPU threads and IO threads are 
>> competing with each other today, I don't think we have any evidence 
>> to suggest that the IO thread is self-starving itself with long 
>> running events.
>
> I agree we have no evidence and that this is all speculation.  But 
> consider a 64-vcpu guest, it has a 1:64 ratio of vcpu time 
> (initiations) to iothread time (completions).  If each vcpu generates 
> 5000 initiations per second, the iothread needs to handle 320,000 
> completions per second.  At that rate you will see some internal 
> competition.  That thread will also have a hard time shuffling data 
> since every completion's data will reside in the wrong cpu cache.

Ultimately, it depends on what you're optimizing for.  If you've got a 
64-vcpu guest on a 128-way box, then sure, we want to have 64 IO threads 
because that will absolutely increase throughput.

But realistically, it's more likely that if you've got a 64-vcpu guest, 
you're on a 1024-way box and you've got 64 guests running at once.  
Having 64 IO threads per VM means you've got 4k threads floating.  It's 
still just as likely that one completion will get delayed by something 
less important.  Now with all of these threads on a box like this, you 
get nasty NUMA interactions too.

The difference between the two models is that with threads, we rely on 
pre-emption to enforce fairness and the Linux scheduler to perform 
scheduling.  With a single IO thread, we're determining execution order 
and priority.

A lot of main loops have a notion of priority for timer and idle 
callbacks.  For something that is latency sensitive, you absolutely 
could introduce the concept of priority for bottom halves.  It would 
ensure that a +1 priority bottom half would get scheduled before 
handling any lower priority I/O/BHs.

> Note, an alternative to multiple iothreads is to move completion 
> handling back to vcpus, provided we can steer the handler close to the 
> guest completion handler.

Looking at something like linux-aio, I think we might actually want to 
do that.  We can submit the request from the VCPU thread and we can 
certainly program the signal to get delivered to that VCPU thread.  
Maintaining affinity for the request is likely a benefit.

>>
>> For host services though, it's much more difficult to isolate them 
>> like this. 
>
> What do you mean by host services?

Things like VNC and live migration.  Things that aren't directly related 
to a guest's activity.  One model I can imagine is to continue to 
relegate these things to a single IO thread, but then move device driven 
callbacks either back to the originating thread or to a dedicated device 
callback thread.  Host services generally have a much lower priority.

>> I'm not necessarily claiming that this will never be the right thing 
>> to do, but I don't think we really have the evidence today to suggest 
>> that we should focus on this in the short term.
>
> Agreed.  We will start to see evidence (one way or the other) as fully 
> loaded 64-vcpu guests are benchmarked.  Another driver may be 
> real-time guests; if a timer can be deferred by some block device 
> initiation or completion, then we can say goodbye to any realtime 
> guarantees we want to make.

I'm wary of making decisions based on performance of a 64-vcpu guest.  
It's an important workload to characterize because it's an extreme case 
but I think 64 1-vcpu guests will continue to be significantly more 
important than 1 64-vcpu guest.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device
  2010-02-26 14:36                               ` Anthony Liguori
@ 2010-02-26 15:39                                 ` Avi Kivity
  0 siblings, 0 replies; 36+ messages in thread
From: Avi Kivity @ 2010-02-26 15:39 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Vadim Rozenfeld, Dor Laor, Christoph Hellwig, Paul Brook, qemu-devel

On 02/26/2010 04:36 PM, Anthony Liguori wrote:
> On 02/26/2010 02:47 AM, Avi Kivity wrote:
>> qcow2 is still not fully asynchronous.  All the other format drivers 
>> (except raw) are fully synchronous.  If we had a threaded 
>> infrastructure, we could convert them all in a day.  As it is, you 
>> can only use the other block format drivers in 'qemu-img convert'.
>
> I've got a healthy amount of scepticism that it's that easy.  But I'm 
> happy to consider patches :-)

I'd be happy to have time to write them.

>
>>> If the device models are re-entrant, that reduces a ton of the 
>>> demand on the qemu_mutex which means that IO thread can run 
>>> uncontended.  While we have evidence that the VCPU threads and IO 
>>> threads are competing with each other today, I don't think we have 
>>> any evidence to suggest that the IO thread is self-starving itself 
>>> with long running events.
>>
>> I agree we have no evidence and that this is all speculation.  But 
>> consider a 64-vcpu guest, it has a 1:64 ratio of vcpu time 
>> (initiations) to iothread time (completions).  If each vcpu generates 
>> 5000 initiations per second, the iothread needs to handle 320,000 
>> completions per second.  At that rate you will see some internal 
>> competition.  That thread will also have a hard time shuffling data 
>> since every completion's data will reside in the wrong cpu cache.
>
> Ultimately, it depends on what you're optimizing for.  If you've got a 
> 64-vcpu guest on a 128-way box, then sure, we want to have 64 IO 
> threads because that will absolutely increase throughput.
>
> But realistically, it's more likely that if you've got a 64-vcpu 
> guest, you're on a 1024-way box and you've got 64 guests running at 
> once.  Having 64 IO threads per VM means you've got 4k threads 
> floating.  It's still just as likely that one completion will get 
> delayed by something less important.  Now with all of these threads on 
> a box like this, you get nasty NUMA interactions too.

I'm not suggesting to scale out - the number of vcpus (across all 
guests) will usually be higher than the number of cpus.  But if you have 
multiple device threads, the scheduler has flexibility in placing them 
around and filling bubbles.  A single heavily loaded iothread is more 
difficult.

> The difference between the two models is that with threads, we rely on 
> pre-emption to enforce fairness and the Linux scheduler to perform 
> scheduling.  With a single IO thread, we're determining execution 
> order and priority.

We could define priorities with multiple threads as well (using thread 
priorities), and we'd never have a short task delayed behind a long 
task, unless the host is out of resources.

>
> A lot of main loops have a notion of priority for timer and idle 
> callbacks.  For something that is latency sensitive, you absolutely 
> could introduce the concept of priority for bottom halves.  It would 
> ensure that a +1 priority bottom half would get scheduled before 
> handling any lower priority I/O/BHs.

What if it becomes available after the low prio task has started to run?

>
>> Note, an alternative to multiple iothreads is to move completion 
>> handling back to vcpus, provided we can steer the handler close to 
>> the guest completion handler.
>
> Looking at something like linux-aio, I think we might actually want to 
> do that.  We can submit the request from the VCPU thread and we can 
> certainly program the signal to get delivered to that VCPU thread.  
> Maintaining affinity for the request is likely a benefit.

Likely to benefit when we have multiqueue virtio.

>
>>>
>>> For host services though, it's much more difficult to isolate them 
>>> like this. 
>>
>> What do you mean by host services?
>
> Things like VNC and live migration.  Things that aren't directly 
> related to a guest's activity.  One model I can imagine is to continue 
> to relegate these things to a single IO thread, but then move device 
> driven callbacks either back to the originating thread or to a 
> dedicated device callback thread.  Host services generally have a much 
> lower priority.

Or just 'a thread'.  Nothing prevents vnc or live migration from running 
in a thread, using the current code.

>
>>> I'm not necessarily claiming that this will never be the right thing 
>>> to do, but I don't think we really have the evidence today to 
>>> suggest that we should focus on this in the short term.
>>
>> Agreed.  We will start to see evidence (one way or the other) as 
>> fully loaded 64-vcpu guests are benchmarked.  Another driver may be 
>> real-time guests; if a timer can be deferred by some block device 
>> initiation or completion, then we can say goodbye to any realtime 
>> guarantees we want to make.
>
> I'm wary of making decisions based on performance of a 64-vcpu guest.  
> It's an important workload to characterize because it's an extreme 
> case but I think 64 1-vcpu guests will continue to be significantly 
> more important than 1 64-vcpu guest.

Agreed.  64-vcpu guests will make the headlines and marketing 
checklists, though.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2010-02-26 15:39 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-11  7:40 [Qemu-devel] [RFC][PATCH] performance improvement for windows guests, running on top of virtio block device Vadim Rozenfeld
2010-01-11  8:30 ` [Qemu-devel] " Avi Kivity
     [not found]   ` <4B4AE95D.7080305@redhat.com>
2010-01-11  9:19     ` Dor Laor
2010-01-11 13:11       ` Christoph Hellwig
2010-01-11 13:13         ` Avi Kivity
2010-01-11 13:16           ` Christoph Hellwig
2010-01-11 13:47           ` Christoph Hellwig
2010-01-11 14:00             ` Anthony Liguori
2010-02-24  2:58               ` Paul Brook
2010-02-24 14:59                 ` Anthony Liguori
2010-02-25 15:06                   ` Paul Brook
2010-02-25 15:23                     ` Anthony Liguori
2010-02-25 16:48                       ` Paul Brook
2010-02-25 17:11                     ` Avi Kivity
2010-02-25 17:15                       ` Anthony Liguori
2010-02-25 17:33                         ` Avi Kivity
2010-02-25 18:05                           ` malc
2010-02-25 19:55                           ` Anthony Liguori
2010-02-26  8:47                             ` Avi Kivity
2010-02-26 14:36                               ` Anthony Liguori
2010-02-26 15:39                                 ` Avi Kivity
2010-01-11 13:42   ` Christoph Hellwig
2010-01-11 13:49     ` Anthony Liguori
2010-01-11 14:29       ` Avi Kivity
2010-01-11 14:37         ` Anthony Liguori
2010-01-11 14:46           ` Avi Kivity
2010-01-11 15:13             ` Anthony Liguori
2010-01-11 15:19               ` Avi Kivity
2010-01-11 15:22                 ` Anthony Liguori
2010-01-11 15:31                   ` Avi Kivity
2010-01-11 15:32                     ` Anthony Liguori
2010-01-11 15:35                       ` Avi Kivity
2010-01-11 15:38                         ` Anthony Liguori
2010-01-11 18:22               ` [Qemu-devel] " Michael S. Tsirkin
2010-01-11 18:20           ` Michael S. Tsirkin
2010-01-11 14:25     ` [Qemu-devel] " Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.