All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations
@ 2013-04-05  9:32 Stefan Hajnoczi
  2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 1/4] block: fix I/O throttling accounting blind spot Stefan Hajnoczi
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2013-04-05  9:32 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoît Canet, Stefan Hajnoczi, Zhi Yong Wu

Benoît Canet <benoit@irqsave.net> reported that QEMU I/O throttling can
oscillate under continuous I/O.  The test case runs 50 threads performing
random writes and a -drive iops=150 limit is used.

Since QEMU I/O throttling is implemented using 100 millisecond time slices,
we'd expect 150 +/- 15 IOPS.  Anything outside that range indicates a problem
with the I/O throttling algorithm.

It turned out that even a single thread performing sequential I/O continuously
is throttled outside the 150 +/- 15 IOPS range.  The continous stream of I/O
slows down as time goes on but resets to 150 IOPS again when interrupted.  This
can be tested with:

  $ iostat -d 1 -x /dev/vdb &
  $ dd if=/dev/vdb of=/dev/null bs=4096 iflag=direct

This patches addresses these problems as follows:

1. Account for I/O requests when they are submitted instead of completed.  This
   ensures that we do not exceed the budget for this slice.  Exceeding the
   budget leads to fluctuations since we have to make up for this later.

2. Use constant 100 millisecond slice time.  Adjusting the slice time at
   run-time led to oscillations.  Since the reason for adjusting slice time is
   not clear, drop this behavior.

I have also included two code clean-up patches.

Tested-By: Benoit Canet <benoit@irqsave.net>

v2:
 * Account slice_submitted after both bps and iops checks pass [kwolf]

Stefan Hajnoczi (4):
  block: fix I/O throttling accounting blind spot
  block: keep I/O throttling slice time constant
  block: drop duplicated slice extension code
  block: clean up I/O throttling wait_time code

 block.c                   | 49 +++++++++++++++++++++--------------------------
 blockdev.c                |  1 -
 include/block/block_int.h |  3 +--
 3 files changed, 23 insertions(+), 30 deletions(-)

-- 
1.8.1.4

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH v2 1/4] block: fix I/O throttling accounting blind spot
  2013-04-05  9:32 [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations Stefan Hajnoczi
@ 2013-04-05  9:32 ` Stefan Hajnoczi
  2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 2/4] block: keep I/O throttling slice time constant Stefan Hajnoczi
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2013-04-05  9:32 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoît Canet, Stefan Hajnoczi, Zhi Yong Wu

I/O throttling relies on bdrv_acct_done() which is called when a request
completes.  This leaves a blind spot since we only charge for completed
requests, not submitted requests.

For example, if there is 1 operation remaining in this time slice the
guest could submit 3 operations and they will all be submitted
successfully since they don't actually get accounted for until they
complete.

Originally we probably thought this is okay since the requests will be
accounted when the time slice is extended.  In practice it causes
fluctuations since the guest can exceed its I/O limit and it will be
punished for this later on.

Account for I/O upon submission so that I/O limits are enforced
properly.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-By: Benoit Canet <benoit@irqsave.net>
---
 block.c                   | 21 ++++++++++-----------
 include/block/block_int.h |  2 +-
 2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index 0ae2e93..25976b5 100644
--- a/block.c
+++ b/block.c
@@ -141,7 +141,6 @@ void bdrv_io_limits_disable(BlockDriverState *bs)
     bs->slice_start = 0;
     bs->slice_end   = 0;
     bs->slice_time  = 0;
-    memset(&bs->io_base, 0, sizeof(bs->io_base));
 }
 
 static void bdrv_block_timer(void *opaque)
@@ -1436,8 +1435,8 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
     bs_dest->slice_time         = bs_src->slice_time;
     bs_dest->slice_start        = bs_src->slice_start;
     bs_dest->slice_end          = bs_src->slice_end;
+    bs_dest->slice_submitted    = bs_src->slice_submitted;
     bs_dest->io_limits          = bs_src->io_limits;
-    bs_dest->io_base            = bs_src->io_base;
     bs_dest->throttled_reqs     = bs_src->throttled_reqs;
     bs_dest->block_timer        = bs_src->block_timer;
     bs_dest->io_limits_enabled  = bs_src->io_limits_enabled;
@@ -3768,9 +3767,9 @@ static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
     slice_time = bs->slice_end - bs->slice_start;
     slice_time /= (NANOSECONDS_PER_SECOND);
     bytes_limit = bps_limit * slice_time;
-    bytes_base  = bs->nr_bytes[is_write] - bs->io_base.bytes[is_write];
+    bytes_base  = bs->slice_submitted.bytes[is_write];
     if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
-        bytes_base += bs->nr_bytes[!is_write] - bs->io_base.bytes[!is_write];
+        bytes_base += bs->slice_submitted.bytes[!is_write];
     }
 
     /* bytes_base: the bytes of data which have been read/written; and
@@ -3828,9 +3827,9 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
     slice_time = bs->slice_end - bs->slice_start;
     slice_time /= (NANOSECONDS_PER_SECOND);
     ios_limit  = iops_limit * slice_time;
-    ios_base   = bs->nr_ops[is_write] - bs->io_base.ios[is_write];
+    ios_base   = bs->slice_submitted.ios[is_write];
     if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
-        ios_base += bs->nr_ops[!is_write] - bs->io_base.ios[!is_write];
+        ios_base += bs->slice_submitted.ios[!is_write];
     }
 
     if (ios_base + 1 <= ios_limit) {
@@ -3875,11 +3874,7 @@ static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
         bs->slice_start = now;
         bs->slice_end   = now + bs->slice_time;
 
-        bs->io_base.bytes[is_write]  = bs->nr_bytes[is_write];
-        bs->io_base.bytes[!is_write] = bs->nr_bytes[!is_write];
-
-        bs->io_base.ios[is_write]    = bs->nr_ops[is_write];
-        bs->io_base.ios[!is_write]   = bs->nr_ops[!is_write];
+        memset(&bs->slice_submitted, 0, sizeof(bs->slice_submitted));
     }
 
     elapsed_time  = now - bs->slice_start;
@@ -3907,6 +3902,10 @@ static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
         *wait = 0;
     }
 
+    bs->slice_submitted.bytes[is_write] += (int64_t)nb_sectors *
+                                           BDRV_SECTOR_SIZE;
+    bs->slice_submitted.ios[is_write]++;
+
     return false;
 }
 
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 0986a2d..83941d8 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -256,7 +256,7 @@ struct BlockDriverState {
     int64_t slice_start;
     int64_t slice_end;
     BlockIOLimit io_limits;
-    BlockIOBaseValue  io_base;
+    BlockIOBaseValue slice_submitted;
     CoQueue      throttled_reqs;
     QEMUTimer    *block_timer;
     bool         io_limits_enabled;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH v2 2/4] block: keep I/O throttling slice time constant
  2013-04-05  9:32 [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations Stefan Hajnoczi
  2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 1/4] block: fix I/O throttling accounting blind spot Stefan Hajnoczi
@ 2013-04-05  9:32 ` Stefan Hajnoczi
  2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 3/4] block: drop duplicated slice extension code Stefan Hajnoczi
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2013-04-05  9:32 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoît Canet, Stefan Hajnoczi, Zhi Yong Wu

It is not necessary to adjust the slice time at runtime.  We already
extend the current slice in order to carry over accounting into the next
slice.  Changing the actual slice time value introduces oscillations.

The guest may experience large changes in throughput or IOPS from one
moment to the next when slice times are adjusted.

Reported-by: Benoît Canet <benoit@irqsave.net>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-By: Benoit Canet <benoit@irqsave.net>
---
 block.c                   | 19 +++++++++----------
 blockdev.c                |  1 -
 include/block/block_int.h |  1 -
 3 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/block.c b/block.c
index 25976b5..00eca27 100644
--- a/block.c
+++ b/block.c
@@ -140,7 +140,6 @@ void bdrv_io_limits_disable(BlockDriverState *bs)
 
     bs->slice_start = 0;
     bs->slice_end   = 0;
-    bs->slice_time  = 0;
 }
 
 static void bdrv_block_timer(void *opaque)
@@ -1432,7 +1431,6 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
     bs_dest->enable_write_cache = bs_src->enable_write_cache;
 
     /* i/o timing parameters */
-    bs_dest->slice_time         = bs_src->slice_time;
     bs_dest->slice_start        = bs_src->slice_start;
     bs_dest->slice_end          = bs_src->slice_end;
     bs_dest->slice_submitted    = bs_src->slice_submitted;
@@ -3749,6 +3747,7 @@ static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
                  bool is_write, double elapsed_time, uint64_t *wait)
 {
     uint64_t bps_limit = 0;
+    uint64_t extension;
     double   bytes_limit, bytes_base, bytes_res;
     double   slice_time, wait_time;
 
@@ -3796,8 +3795,10 @@ static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
      * info can be kept until the timer fire, so it is increased and tuned
      * based on the result of experiment.
      */
-    bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
-    bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
+    extension = wait_time * NANOSECONDS_PER_SECOND;
+    extension = DIV_ROUND_UP(extension, BLOCK_IO_SLICE_TIME) *
+                BLOCK_IO_SLICE_TIME;
+    bs->slice_end += extension;
     if (wait) {
         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
     }
@@ -3848,8 +3849,8 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
         wait_time = 0;
     }
 
-    bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
-    bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
+    /* Exceeded current slice, extend it by another slice time */
+    bs->slice_end += BLOCK_IO_SLICE_TIME;
     if (wait) {
         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
     }
@@ -3868,12 +3869,10 @@ static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
     now = qemu_get_clock_ns(vm_clock);
     if ((bs->slice_start < now)
         && (bs->slice_end > now)) {
-        bs->slice_end = now + bs->slice_time;
+        bs->slice_end = now + BLOCK_IO_SLICE_TIME;
     } else {
-        bs->slice_time  =  5 * BLOCK_IO_SLICE_TIME;
         bs->slice_start = now;
-        bs->slice_end   = now + bs->slice_time;
-
+        bs->slice_end   = now + BLOCK_IO_SLICE_TIME;
         memset(&bs->slice_submitted, 0, sizeof(bs->slice_submitted));
     }
 
diff --git a/blockdev.c b/blockdev.c
index 8cdc9ce..6dc999d 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1069,7 +1069,6 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd,
     }
 
     bs->io_limits = io_limits;
-    bs->slice_time = BLOCK_IO_SLICE_TIME;
 
     if (!bs->io_limits_enabled && bdrv_io_limits_enabled(bs)) {
         bdrv_io_limits_enable(bs);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 83941d8..9aa98b5 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -252,7 +252,6 @@ struct BlockDriverState {
     unsigned int copy_on_read_in_flight;
 
     /* the time for latest disk I/O */
-    int64_t slice_time;
     int64_t slice_start;
     int64_t slice_end;
     BlockIOLimit io_limits;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH v2 3/4] block: drop duplicated slice extension code
  2013-04-05  9:32 [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations Stefan Hajnoczi
  2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 1/4] block: fix I/O throttling accounting blind spot Stefan Hajnoczi
  2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 2/4] block: keep I/O throttling slice time constant Stefan Hajnoczi
@ 2013-04-05  9:32 ` Stefan Hajnoczi
  2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 4/4] block: clean up I/O throttling wait_time code Stefan Hajnoczi
  2013-04-05 12:32 ` [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations Kevin Wolf
  4 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2013-04-05  9:32 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoît Canet, Stefan Hajnoczi, Zhi Yong Wu

The current slice is extended when an I/O request exceeds the limit.
There is no need to extend the slice every time we check a request.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-By: Benoit Canet <benoit@irqsave.net>
---
 block.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/block.c b/block.c
index 00eca27..aa16fc4 100644
--- a/block.c
+++ b/block.c
@@ -3867,10 +3867,7 @@ static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
     int      bps_ret, iops_ret;
 
     now = qemu_get_clock_ns(vm_clock);
-    if ((bs->slice_start < now)
-        && (bs->slice_end > now)) {
-        bs->slice_end = now + BLOCK_IO_SLICE_TIME;
-    } else {
+    if (now > bs->slice_end) {
         bs->slice_start = now;
         bs->slice_end   = now + BLOCK_IO_SLICE_TIME;
         memset(&bs->slice_submitted, 0, sizeof(bs->slice_submitted));
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Qemu-devel] [PATCH v2 4/4] block: clean up I/O throttling wait_time code
  2013-04-05  9:32 [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations Stefan Hajnoczi
                   ` (2 preceding siblings ...)
  2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 3/4] block: drop duplicated slice extension code Stefan Hajnoczi
@ 2013-04-05  9:32 ` Stefan Hajnoczi
  2013-04-05 12:32 ` [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations Kevin Wolf
  4 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2013-04-05  9:32 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Benoît Canet, Stefan Hajnoczi, Zhi Yong Wu

The wait_time variable is in seconds.  Reflect this in a comment and use
NANOSECONDS_PER_SECOND instead of BLOCK_IO_SLICE_TIME * 10 (which
happens to have the right value).

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-By: Benoit Canet <benoit@irqsave.net>
---
 block.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index aa16fc4..602d8a4 100644
--- a/block.c
+++ b/block.c
@@ -3800,7 +3800,7 @@ static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
                 BLOCK_IO_SLICE_TIME;
     bs->slice_end += extension;
     if (wait) {
-        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
+        *wait = wait_time * NANOSECONDS_PER_SECOND;
     }
 
     return true;
@@ -3841,7 +3841,7 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
         return false;
     }
 
-    /* Calc approx time to dispatch */
+    /* Calc approx time to dispatch, in seconds */
     wait_time = (ios_base + 1) / iops_limit;
     if (wait_time > elapsed_time) {
         wait_time = wait_time - elapsed_time;
@@ -3852,7 +3852,7 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
     /* Exceeded current slice, extend it by another slice time */
     bs->slice_end += BLOCK_IO_SLICE_TIME;
     if (wait) {
-        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
+        *wait = wait_time * NANOSECONDS_PER_SECOND;
     }
 
     return true;
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations
  2013-04-05  9:32 [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations Stefan Hajnoczi
                   ` (3 preceding siblings ...)
  2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 4/4] block: clean up I/O throttling wait_time code Stefan Hajnoczi
@ 2013-04-05 12:32 ` Kevin Wolf
  4 siblings, 0 replies; 6+ messages in thread
From: Kevin Wolf @ 2013-04-05 12:32 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Zhi Yong Wu, qemu-devel, Benoît Canet

Am 05.04.2013 um 11:32 hat Stefan Hajnoczi geschrieben:
> Benoît Canet <benoit@irqsave.net> reported that QEMU I/O throttling can
> oscillate under continuous I/O.  The test case runs 50 threads performing
> random writes and a -drive iops=150 limit is used.
> 
> Since QEMU I/O throttling is implemented using 100 millisecond time slices,
> we'd expect 150 +/- 15 IOPS.  Anything outside that range indicates a problem
> with the I/O throttling algorithm.
> 
> It turned out that even a single thread performing sequential I/O continuously
> is throttled outside the 150 +/- 15 IOPS range.  The continous stream of I/O
> slows down as time goes on but resets to 150 IOPS again when interrupted.  This
> can be tested with:
> 
>   $ iostat -d 1 -x /dev/vdb &
>   $ dd if=/dev/vdb of=/dev/null bs=4096 iflag=direct
> 
> This patches addresses these problems as follows:
> 
> 1. Account for I/O requests when they are submitted instead of completed.  This
>    ensures that we do not exceed the budget for this slice.  Exceeding the
>    budget leads to fluctuations since we have to make up for this later.
> 
> 2. Use constant 100 millisecond slice time.  Adjusting the slice time at
>    run-time led to oscillations.  Since the reason for adjusting slice time is
>    not clear, drop this behavior.
> 
> I have also included two code clean-up patches.
> 
> Tested-By: Benoit Canet <benoit@irqsave.net>
> 
> v2:
>  * Account slice_submitted after both bps and iops checks pass [kwolf]

Thanks, applied all to the block branch.

Kevin

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-04-05 12:33 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-04-05  9:32 [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations Stefan Hajnoczi
2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 1/4] block: fix I/O throttling accounting blind spot Stefan Hajnoczi
2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 2/4] block: keep I/O throttling slice time constant Stefan Hajnoczi
2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 3/4] block: drop duplicated slice extension code Stefan Hajnoczi
2013-04-05  9:32 ` [Qemu-devel] [PATCH v2 4/4] block: clean up I/O throttling wait_time code Stefan Hajnoczi
2013-04-05 12:32 ` [Qemu-devel] [PATCH v2 0/4] block: fix I/O throttling oscillations Kevin Wolf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.