All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] block migration and MAX_IN_FLIGHT_IO
@ 2018-02-22 11:13 Peter Lieven
  2018-03-05 11:45 ` Stefan Hajnoczi
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Lieven @ 2018-02-22 11:13 UTC (permalink / raw)
  To: Stefan Hajnoczi, Dr. David Alan Gilbert, Fam Zheng,
	Juan Quintela, wency, qemu block, qemu-devel

Hi,


I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason

to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise

while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed

and have a fast storage on the destination VM.


In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make

this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?


Peter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] block migration and MAX_IN_FLIGHT_IO
  2018-02-22 11:13 [Qemu-devel] block migration and MAX_IN_FLIGHT_IO Peter Lieven
@ 2018-03-05 11:45 ` Stefan Hajnoczi
  2018-03-05 14:37   ` Peter Lieven
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Hajnoczi @ 2018-03-05 11:45 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Dr. David Alan Gilbert, Fam Zheng, Juan Quintela, wency,
	qemu block, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1133 bytes --]

On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason
> to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise
> while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed
> and have a fast storage on the destination VM.
> 
> In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make
> this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?

I don't know if benchmarks were run when choosing the value.  From the
commit description it sounds like the main purpose was to limit the
amount of memory that can be consumed.

16 MB also fulfills that criteria :), but why is the source VM more
responsive with a lower value?

Perhaps the issue is queue depth on the storage device - the block
migration code enqueues up to 512 MB worth of reads, and guest I/O has
to wait?

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] block migration and MAX_IN_FLIGHT_IO
  2018-03-05 11:45 ` Stefan Hajnoczi
@ 2018-03-05 14:37   ` Peter Lieven
  2018-03-05 14:52     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Lieven @ 2018-03-05 14:37 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Dr. David Alan Gilbert, Fam Zheng, Juan Quintela, wency,
	qemu block, qemu-devel

Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason
>> to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise
>> while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed
>> and have a fast storage on the destination VM.
>>
>> In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make
>> this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?
> I don't know if benchmarks were run when choosing the value.  From the
> commit description it sounds like the main purpose was to limit the
> amount of memory that can be consumed.
>
> 16 MB also fulfills that criteria :), but why is the source VM more
> responsive with a lower value?
>
> Perhaps the issue is queue depth on the storage device - the block
> migration code enqueues up to 512 MB worth of reads, and guest I/O has
> to wait?

That is my guess. Especially if the destination storage is faster we basically alsways have
512 I/Os in flight on the source storage.

Does anyone mind if the reduce that value to 16MB or do we need a better mechanism?

Peter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] block migration and MAX_IN_FLIGHT_IO
  2018-03-05 14:37   ` Peter Lieven
@ 2018-03-05 14:52     ` Dr. David Alan Gilbert
  2018-03-06 16:07       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
  2018-03-06 16:14       ` [Qemu-devel] " Peter Lieven
  0 siblings, 2 replies; 15+ messages in thread
From: Dr. David Alan Gilbert @ 2018-03-05 14:52 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Stefan Hajnoczi, Fam Zheng, Juan Quintela, wency, qemu block, qemu-devel

* Peter Lieven (pl@kamp.de) wrote:
> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
> > On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
> >> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason
> >> to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise
> >> while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed
> >> and have a fast storage on the destination VM.
> >>
> >> In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make
> >> this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?
> > I don't know if benchmarks were run when choosing the value.  From the
> > commit description it sounds like the main purpose was to limit the
> > amount of memory that can be consumed.
> >
> > 16 MB also fulfills that criteria :), but why is the source VM more
> > responsive with a lower value?
> >
> > Perhaps the issue is queue depth on the storage device - the block
> > migration code enqueues up to 512 MB worth of reads, and guest I/O has
> > to wait?
> 
> That is my guess. Especially if the destination storage is faster we basically alsways have
> 512 I/Os in flight on the source storage.
> 
> Does anyone mind if the reduce that value to 16MB or do we need a better mechanism?

We've got migration-parameters these days; you could connect it to one
of those fairly easily I think.
Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
already there.
Then you can set it to whatever you like.

Dave

> Peter
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO
  2018-03-05 14:52     ` Dr. David Alan Gilbert
@ 2018-03-06 16:07       ` Stefan Hajnoczi
  2018-03-06 16:35         ` Peter Lieven
  2018-03-06 16:14       ` [Qemu-devel] " Peter Lieven
  1 sibling, 1 reply; 15+ messages in thread
From: Stefan Hajnoczi @ 2018-03-06 16:07 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Peter Lieven, Fam Zheng, wency, qemu block, Juan Quintela,
	qemu-devel, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2586 bytes --]

On Mon, Mar 05, 2018 at 02:52:16PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Lieven (pl@kamp.de) wrote:
> > Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
> > > On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
> > >> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason
> > >> to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise
> > >> while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed
> > >> and have a fast storage on the destination VM.
> > >>
> > >> In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make
> > >> this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?
> > > I don't know if benchmarks were run when choosing the value.  From the
> > > commit description it sounds like the main purpose was to limit the
> > > amount of memory that can be consumed.
> > >
> > > 16 MB also fulfills that criteria :), but why is the source VM more
> > > responsive with a lower value?
> > >
> > > Perhaps the issue is queue depth on the storage device - the block
> > > migration code enqueues up to 512 MB worth of reads, and guest I/O has
> > > to wait?
> > 
> > That is my guess. Especially if the destination storage is faster we basically alsways have
> > 512 I/Os in flight on the source storage.
> > 
> > Does anyone mind if the reduce that value to 16MB or do we need a better mechanism?
> 
> We've got migration-parameters these days; you could connect it to one
> of those fairly easily I think.
> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
> already there.
> Then you can set it to whatever you like.

It would be nice to solve the performance problem without adding a
tuneable.

On the other hand, QEMU has no idea what the queue depth of the device
is.  Therefore it cannot prioritize guest I/O over block migration I/O.

512 parallel requests is much too high.  Most parallel I/O benchmarking
is done at 32-64 queue depth.

I think that 16 parallel requests is a reasonable maximum number for a
background job.

We need to be clear though that the purpose of this change is unrelated
to the original 512 MB memory footprint goal.  It just happens to touch
the same constant but the goal is now to submit at most 16 I/O requests
in parallel to avoid monopolizing the I/O device.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] block migration and MAX_IN_FLIGHT_IO
  2018-03-05 14:52     ` Dr. David Alan Gilbert
  2018-03-06 16:07       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2018-03-06 16:14       ` Peter Lieven
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Lieven @ 2018-03-06 16:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Stefan Hajnoczi, Fam Zheng, Juan Quintela, wency, qemu block, qemu-devel

Am 05.03.2018 um 15:52 schrieb Dr. David Alan Gilbert:
> * Peter Lieven (pl@kamp.de) wrote:
>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
>>> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
>>>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason
>>>> to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise
>>>> while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed
>>>> and have a fast storage on the destination VM.
>>>>
>>>> In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make
>>>> this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?
>>> I don't know if benchmarks were run when choosing the value.  From the
>>> commit description it sounds like the main purpose was to limit the
>>> amount of memory that can be consumed.
>>>
>>> 16 MB also fulfills that criteria :), but why is the source VM more
>>> responsive with a lower value?
>>>
>>> Perhaps the issue is queue depth on the storage device - the block
>>> migration code enqueues up to 512 MB worth of reads, and guest I/O has
>>> to wait?
>> That is my guess. Especially if the destination storage is faster we basically alsways have
>> 512 I/Os in flight on the source storage.
>>
>> Does anyone mind if the reduce that value to 16MB or do we need a better mechanism?
> We've got migration-parameters these days; you could connect it to one
> of those fairly easily I think.
> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
> already there.
> Then you can set it to whatever you like.

I will have a look at this.

Thank you,
Peter

>
> Dave
>
>> Peter
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO
  2018-03-06 16:07       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
@ 2018-03-06 16:35         ` Peter Lieven
  2018-03-07  7:55           ` Peter Lieven
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Lieven @ 2018-03-06 16:35 UTC (permalink / raw)
  To: Stefan Hajnoczi, Dr. David Alan Gilbert
  Cc: Fam Zheng, wency, qemu block, Juan Quintela, qemu-devel, Stefan Hajnoczi

Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
> On Mon, Mar 05, 2018 at 02:52:16PM +0000, Dr. David Alan Gilbert wrote:
>> * Peter Lieven (pl@kamp.de) wrote:
>>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
>>>> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
>>>>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason
>>>>> to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise
>>>>> while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed
>>>>> and have a fast storage on the destination VM.
>>>>>
>>>>> In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make
>>>>> this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?
>>>> I don't know if benchmarks were run when choosing the value.  From the
>>>> commit description it sounds like the main purpose was to limit the
>>>> amount of memory that can be consumed.
>>>>
>>>> 16 MB also fulfills that criteria :), but why is the source VM more
>>>> responsive with a lower value?
>>>>
>>>> Perhaps the issue is queue depth on the storage device - the block
>>>> migration code enqueues up to 512 MB worth of reads, and guest I/O has
>>>> to wait?
>>> That is my guess. Especially if the destination storage is faster we basically alsways have
>>> 512 I/Os in flight on the source storage.
>>>
>>> Does anyone mind if the reduce that value to 16MB or do we need a better mechanism?
>> We've got migration-parameters these days; you could connect it to one
>> of those fairly easily I think.
>> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
>> already there.
>> Then you can set it to whatever you like.
> It would be nice to solve the performance problem without adding a
> tuneable.
>
> On the other hand, QEMU has no idea what the queue depth of the device
> is.  Therefore it cannot prioritize guest I/O over block migration I/O.
>
> 512 parallel requests is much too high.  Most parallel I/O benchmarking
> is done at 32-64 queue depth.
>
> I think that 16 parallel requests is a reasonable maximum number for a
> background job.
>
> We need to be clear though that the purpose of this change is unrelated
> to the original 512 MB memory footprint goal.  It just happens to touch
> the same constant but the goal is now to submit at most 16 I/O requests
> in parallel to avoid monopolizing the I/O device.

I think we should really look at this. The variables that control if we stay in the while loop or not are incremented and decremented
at the following places:

mig_save_device_dirty:
mig_save_device_bulk:
    block_mig_state.submitted++;

blk_mig_read_cb:
    block_mig_state.submitted--;
    block_mig_state.read_done++;

flush_blks:
    block_mig_state.read_done--;

The condition of the while loop is:
(block_mig_state.submitted +
            block_mig_state.read_done) * BLOCK_SIZE <
           qemu_file_get_rate_limit(f) &&
           (block_mig_state.submitted +
            block_mig_state.read_done) <
           MAX_INFLIGHT_IO)

At first I wonder if we ever reach the rate-limit because we put the read buffers onto f AFTER we exit the while loop?

And even if we reach the limit we constantly maintain 512 I/Os in parallel because we immediately decrement read_done
when we put the buffers to f in flush_blks. In the next iteration of the while loop we then read again until we have 512 in-flight I/Os.

And shouldn't we have a time limit to limit the time we stay in the while loop? I think we artificially delay sending data to f?

Peter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO
  2018-03-06 16:35         ` Peter Lieven
@ 2018-03-07  7:55           ` Peter Lieven
  2018-03-07  8:06             ` [Qemu-devel] block migration and dirty bitmap reset Peter Lieven
  2018-03-07  9:47             ` [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO Stefan Hajnoczi
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Lieven @ 2018-03-07  7:55 UTC (permalink / raw)
  To: Stefan Hajnoczi, Dr. David Alan Gilbert
  Cc: Fam Zheng, wency, qemu block, Juan Quintela, qemu-devel, Stefan Hajnoczi

Am 06.03.2018 um 17:35 schrieb Peter Lieven:
> Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
>> On Mon, Mar 05, 2018 at 02:52:16PM +0000, Dr. David Alan Gilbert wrote:
>>> * Peter Lieven (pl@kamp.de) wrote:
>>>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
>>>>> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
>>>>>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason
>>>>>> to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise
>>>>>> while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed
>>>>>> and have a fast storage on the destination VM.
>>>>>>
>>>>>> In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make
>>>>>> this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?
>>>>> I don't know if benchmarks were run when choosing the value.  From the
>>>>> commit description it sounds like the main purpose was to limit the
>>>>> amount of memory that can be consumed.
>>>>>
>>>>> 16 MB also fulfills that criteria :), but why is the source VM more
>>>>> responsive with a lower value?
>>>>>
>>>>> Perhaps the issue is queue depth on the storage device - the block
>>>>> migration code enqueues up to 512 MB worth of reads, and guest I/O has
>>>>> to wait?
>>>> That is my guess. Especially if the destination storage is faster we basically alsways have
>>>> 512 I/Os in flight on the source storage.
>>>>
>>>> Does anyone mind if the reduce that value to 16MB or do we need a better mechanism?
>>> We've got migration-parameters these days; you could connect it to one
>>> of those fairly easily I think.
>>> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
>>> already there.
>>> Then you can set it to whatever you like.
>> It would be nice to solve the performance problem without adding a
>> tuneable.
>>
>> On the other hand, QEMU has no idea what the queue depth of the device
>> is.  Therefore it cannot prioritize guest I/O over block migration I/O.
>>
>> 512 parallel requests is much too high.  Most parallel I/O benchmarking
>> is done at 32-64 queue depth.
>>
>> I think that 16 parallel requests is a reasonable maximum number for a
>> background job.
>>
>> We need to be clear though that the purpose of this change is unrelated
>> to the original 512 MB memory footprint goal.  It just happens to touch
>> the same constant but the goal is now to submit at most 16 I/O requests
>> in parallel to avoid monopolizing the I/O device.
> I think we should really look at this. The variables that control if we stay in the while loop or not are incremented and decremented
> at the following places:
>
> mig_save_device_dirty:
> mig_save_device_bulk:
>     block_mig_state.submitted++;
>
> blk_mig_read_cb:
>     block_mig_state.submitted--;
>     block_mig_state.read_done++;
>
> flush_blks:
>     block_mig_state.read_done--;
>
> The condition of the while loop is:
> (block_mig_state.submitted +
>             block_mig_state.read_done) * BLOCK_SIZE <
>            qemu_file_get_rate_limit(f) &&
>            (block_mig_state.submitted +
>             block_mig_state.read_done) <
>            MAX_INFLIGHT_IO)
>
> At first I wonder if we ever reach the rate-limit because we put the read buffers onto f AFTER we exit the while loop?
>
> And even if we reach the limit we constantly maintain 512 I/Os in parallel because we immediately decrement read_done
> when we put the buffers to f in flush_blks. In the next iteration of the while loop we then read again until we have 512 in-flight I/Os.
>
> And shouldn't we have a time limit to limit the time we stay in the while loop? I think we artificially delay sending data to f?

Thinking about it for a while I would propose the following:

a) rename MAX_INFLIGHT_IO to MAX_IO_BUFFERS
b) add MAX_PARALLEL_IO with a value of 16
c) compare qemu_file_get_rate_limit only with block_mig_state.read_done

This would yield in the following condition for the while loop:

(block_mig_state.read_done * BLOCK_SIZE < qemu_file_get_rate_limit(f) &&
 (block_mig_state.submitted + block_mig_state.read_done) < MAX_IO_BUFFERS &&
 block_mig_state.submitted < MAX_PARALLEL_IO)

Sounds that like a plan?

Peter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] block migration and dirty bitmap reset
  2018-03-07  7:55           ` Peter Lieven
@ 2018-03-07  8:06             ` Peter Lieven
  2018-03-08  1:28               ` Fam Zheng
  2018-03-07  9:47             ` [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO Stefan Hajnoczi
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Lieven @ 2018-03-07  8:06 UTC (permalink / raw)
  To: Stefan Hajnoczi, Dr. David Alan Gilbert
  Cc: Fam Zheng, wency, qemu block, Juan Quintela, qemu-devel, Stefan Hajnoczi

Hi,

while looking at the code I wonder if the blk_aio_preadv and the bdrv_reset_dirty_bitmap order must
be swapped in mig_save_device_bulk:

    qemu_mutex_lock_iothread();
    aio_context_acquire(blk_get_aio_context(bmds->blk));
    blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, &blk->qiov,
                                0, blk_mig_read_cb, blk);

    bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, cur_sector * BDRV_SECTOR_SIZE,
                            nr_sectors * BDRV_SECTOR_SIZE);
    aio_context_release(blk_get_aio_context(bmds->blk));
    qemu_mutex_unlock_iothread();

In mig_save_device_dirty we first reset the dirty bitmap and read then which shoulds like
a better idea. Maybe it doesn't matter while we acquire the aioctx and the iothread lock...

Peter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO
  2018-03-07  7:55           ` Peter Lieven
  2018-03-07  8:06             ` [Qemu-devel] block migration and dirty bitmap reset Peter Lieven
@ 2018-03-07  9:47             ` Stefan Hajnoczi
  2018-03-07 20:35               ` Peter Lieven
  1 sibling, 1 reply; 15+ messages in thread
From: Stefan Hajnoczi @ 2018-03-07  9:47 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Dr. David Alan Gilbert, Fam Zheng, Wen Congyang, qemu block,
	Juan Quintela, qemu-devel, Stefan Hajnoczi

On Wed, Mar 7, 2018 at 7:55 AM, Peter Lieven <pl@kamp.de> wrote:
> Am 06.03.2018 um 17:35 schrieb Peter Lieven:
>> Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
>>> On Mon, Mar 05, 2018 at 02:52:16PM +0000, Dr. David Alan Gilbert wrote:
>>>> * Peter Lieven (pl@kamp.de) wrote:
>>>>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
>>>>>> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
>>>>>>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason
>>>>>>> to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise
>>>>>>> while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed
>>>>>>> and have a fast storage on the destination VM.
>>>>>>>
>>>>>>> In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make
>>>>>>> this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?
>>>>>> I don't know if benchmarks were run when choosing the value.  From the
>>>>>> commit description it sounds like the main purpose was to limit the
>>>>>> amount of memory that can be consumed.
>>>>>>
>>>>>> 16 MB also fulfills that criteria :), but why is the source VM more
>>>>>> responsive with a lower value?
>>>>>>
>>>>>> Perhaps the issue is queue depth on the storage device - the block
>>>>>> migration code enqueues up to 512 MB worth of reads, and guest I/O has
>>>>>> to wait?
>>>>> That is my guess. Especially if the destination storage is faster we basically alsways have
>>>>> 512 I/Os in flight on the source storage.
>>>>>
>>>>> Does anyone mind if the reduce that value to 16MB or do we need a better mechanism?
>>>> We've got migration-parameters these days; you could connect it to one
>>>> of those fairly easily I think.
>>>> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
>>>> already there.
>>>> Then you can set it to whatever you like.
>>> It would be nice to solve the performance problem without adding a
>>> tuneable.
>>>
>>> On the other hand, QEMU has no idea what the queue depth of the device
>>> is.  Therefore it cannot prioritize guest I/O over block migration I/O.
>>>
>>> 512 parallel requests is much too high.  Most parallel I/O benchmarking
>>> is done at 32-64 queue depth.
>>>
>>> I think that 16 parallel requests is a reasonable maximum number for a
>>> background job.
>>>
>>> We need to be clear though that the purpose of this change is unrelated
>>> to the original 512 MB memory footprint goal.  It just happens to touch
>>> the same constant but the goal is now to submit at most 16 I/O requests
>>> in parallel to avoid monopolizing the I/O device.
>> I think we should really look at this. The variables that control if we stay in the while loop or not are incremented and decremented
>> at the following places:
>>
>> mig_save_device_dirty:
>> mig_save_device_bulk:
>>     block_mig_state.submitted++;
>>
>> blk_mig_read_cb:
>>     block_mig_state.submitted--;
>>     block_mig_state.read_done++;
>>
>> flush_blks:
>>     block_mig_state.read_done--;
>>
>> The condition of the while loop is:
>> (block_mig_state.submitted +
>>             block_mig_state.read_done) * BLOCK_SIZE <
>>            qemu_file_get_rate_limit(f) &&
>>            (block_mig_state.submitted +
>>             block_mig_state.read_done) <
>>            MAX_INFLIGHT_IO)
>>
>> At first I wonder if we ever reach the rate-limit because we put the read buffers onto f AFTER we exit the while loop?
>>
>> And even if we reach the limit we constantly maintain 512 I/Os in parallel because we immediately decrement read_done
>> when we put the buffers to f in flush_blks. In the next iteration of the while loop we then read again until we have 512 in-flight I/Os.
>>
>> And shouldn't we have a time limit to limit the time we stay in the while loop? I think we artificially delay sending data to f?
>
> Thinking about it for a while I would propose the following:
>
> a) rename MAX_INFLIGHT_IO to MAX_IO_BUFFERS
> b) add MAX_PARALLEL_IO with a value of 16
> c) compare qemu_file_get_rate_limit only with block_mig_state.read_done
>
> This would yield in the following condition for the while loop:
>
> (block_mig_state.read_done * BLOCK_SIZE < qemu_file_get_rate_limit(f) &&
>  (block_mig_state.submitted + block_mig_state.read_done) < MAX_IO_BUFFERS &&
>  block_mig_state.submitted < MAX_PARALLEL_IO)
>
> Sounds that like a plan?

That sounds good to me.

Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO
  2018-03-07  9:47             ` [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO Stefan Hajnoczi
@ 2018-03-07 20:35               ` Peter Lieven
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Lieven @ 2018-03-07 20:35 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Dr. David Alan Gilbert, Fam Zheng, Wen Congyang, qemu block,
	Juan Quintela, qemu-devel, Stefan Hajnoczi

Am 07.03.2018 um 10:47 schrieb Stefan Hajnoczi:
> On Wed, Mar 7, 2018 at 7:55 AM, Peter Lieven <pl@kamp.de> wrote:
>> Am 06.03.2018 um 17:35 schrieb Peter Lieven:
>>> Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
>>>> On Mon, Mar 05, 2018 at 02:52:16PM +0000, Dr. David Alan Gilbert wrote:
>>>>> * Peter Lieven (pl@kamp.de) wrote:
>>>>>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
>>>>>>> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
>>>>>>>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason
>>>>>>>> to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise
>>>>>>>> while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed
>>>>>>>> and have a fast storage on the destination VM.
>>>>>>>>
>>>>>>>> In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make
>>>>>>>> this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?
>>>>>>> I don't know if benchmarks were run when choosing the value.  From the
>>>>>>> commit description it sounds like the main purpose was to limit the
>>>>>>> amount of memory that can be consumed.
>>>>>>>
>>>>>>> 16 MB also fulfills that criteria :), but why is the source VM more
>>>>>>> responsive with a lower value?
>>>>>>>
>>>>>>> Perhaps the issue is queue depth on the storage device - the block
>>>>>>> migration code enqueues up to 512 MB worth of reads, and guest I/O has
>>>>>>> to wait?
>>>>>> That is my guess. Especially if the destination storage is faster we basically alsways have
>>>>>> 512 I/Os in flight on the source storage.
>>>>>>
>>>>>> Does anyone mind if the reduce that value to 16MB or do we need a better mechanism?
>>>>> We've got migration-parameters these days; you could connect it to one
>>>>> of those fairly easily I think.
>>>>> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
>>>>> already there.
>>>>> Then you can set it to whatever you like.
>>>> It would be nice to solve the performance problem without adding a
>>>> tuneable.
>>>>
>>>> On the other hand, QEMU has no idea what the queue depth of the device
>>>> is.  Therefore it cannot prioritize guest I/O over block migration I/O.
>>>>
>>>> 512 parallel requests is much too high.  Most parallel I/O benchmarking
>>>> is done at 32-64 queue depth.
>>>>
>>>> I think that 16 parallel requests is a reasonable maximum number for a
>>>> background job.
>>>>
>>>> We need to be clear though that the purpose of this change is unrelated
>>>> to the original 512 MB memory footprint goal.  It just happens to touch
>>>> the same constant but the goal is now to submit at most 16 I/O requests
>>>> in parallel to avoid monopolizing the I/O device.
>>> I think we should really look at this. The variables that control if we stay in the while loop or not are incremented and decremented
>>> at the following places:
>>>
>>> mig_save_device_dirty:
>>> mig_save_device_bulk:
>>>     block_mig_state.submitted++;
>>>
>>> blk_mig_read_cb:
>>>     block_mig_state.submitted--;
>>>     block_mig_state.read_done++;
>>>
>>> flush_blks:
>>>     block_mig_state.read_done--;
>>>
>>> The condition of the while loop is:
>>> (block_mig_state.submitted +
>>>             block_mig_state.read_done) * BLOCK_SIZE <
>>>            qemu_file_get_rate_limit(f) &&
>>>            (block_mig_state.submitted +
>>>             block_mig_state.read_done) <
>>>            MAX_INFLIGHT_IO)
>>>
>>> At first I wonder if we ever reach the rate-limit because we put the read buffers onto f AFTER we exit the while loop?
>>>
>>> And even if we reach the limit we constantly maintain 512 I/Os in parallel because we immediately decrement read_done
>>> when we put the buffers to f in flush_blks. In the next iteration of the while loop we then read again until we have 512 in-flight I/Os.
>>>
>>> And shouldn't we have a time limit to limit the time we stay in the while loop? I think we artificially delay sending data to f?
>> Thinking about it for a while I would propose the following:
>>
>> a) rename MAX_INFLIGHT_IO to MAX_IO_BUFFERS
>> b) add MAX_PARALLEL_IO with a value of 16
>> c) compare qemu_file_get_rate_limit only with block_mig_state.read_done
>>
>> This would yield in the following condition for the while loop:
>>
>> (block_mig_state.read_done * BLOCK_SIZE < qemu_file_get_rate_limit(f) &&
>>  (block_mig_state.submitted + block_mig_state.read_done) < MAX_IO_BUFFERS &&
>>  block_mig_state.submitted < MAX_PARALLEL_IO)
>>
>> Sounds that like a plan?
> That sounds good to me.

I will prepare patches for this.

Peter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] block migration and dirty bitmap reset
  2018-03-07  8:06             ` [Qemu-devel] block migration and dirty bitmap reset Peter Lieven
@ 2018-03-08  1:28               ` Fam Zheng
  2018-03-08  8:57                 ` Peter Lieven
  0 siblings, 1 reply; 15+ messages in thread
From: Fam Zheng @ 2018-03-08  1:28 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Stefan Hajnoczi, Dr. David Alan Gilbert, wency, qemu block,
	Juan Quintela, qemu-devel, Stefan Hajnoczi

On Wed, 03/07 09:06, Peter Lieven wrote:
> Hi,
> 
> while looking at the code I wonder if the blk_aio_preadv and the bdrv_reset_dirty_bitmap order must
> be swapped in mig_save_device_bulk:
> 
>     qemu_mutex_lock_iothread();
>     aio_context_acquire(blk_get_aio_context(bmds->blk));
>     blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, &blk->qiov,
>                                 0, blk_mig_read_cb, blk);
> 
>     bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, cur_sector * BDRV_SECTOR_SIZE,
>                             nr_sectors * BDRV_SECTOR_SIZE);
>     aio_context_release(blk_get_aio_context(bmds->blk));
>     qemu_mutex_unlock_iothread();
> 
> In mig_save_device_dirty we first reset the dirty bitmap and read then which shoulds like
> a better idea.

Yes, that sounds right to me.

Fam

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] block migration and dirty bitmap reset
  2018-03-08  1:28               ` Fam Zheng
@ 2018-03-08  8:57                 ` Peter Lieven
  2018-03-08  9:01                   ` Fam Zheng
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Lieven @ 2018-03-08  8:57 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Stefan Hajnoczi, Dr. David Alan Gilbert, wency, qemu block,
	Juan Quintela, qemu-devel, Stefan Hajnoczi



> Am 08.03.2018 um 02:28 schrieb Fam Zheng <famz@redhat.com>:
> 
>> On Wed, 03/07 09:06, Peter Lieven wrote:
>> Hi,
>> 
>> while looking at the code I wonder if the blk_aio_preadv and the bdrv_reset_dirty_bitmap order must
>> be swapped in mig_save_device_bulk:
>> 
>>    qemu_mutex_lock_iothread();
>>    aio_context_acquire(blk_get_aio_context(bmds->blk));
>>    blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, &blk->qiov,
>>                                0, blk_mig_read_cb, blk);
>> 
>>    bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, cur_sector * BDRV_SECTOR_SIZE,
>>                            nr_sectors * BDRV_SECTOR_SIZE);
>>    aio_context_release(blk_get_aio_context(bmds->blk));
>>    qemu_mutex_unlock_iothread();
>> 
>> In mig_save_device_dirty we first reset the dirty bitmap and read then which shoulds like
>> a better idea.
> 
> Yes, that sounds right to me.
> 
> Fam

You mean the order should be swapped in mig_save_device_bulk as well?

Peter

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] block migration and dirty bitmap reset
  2018-03-08  8:57                 ` Peter Lieven
@ 2018-03-08  9:01                   ` Fam Zheng
  2018-03-08 10:33                     ` Peter Lieven
  0 siblings, 1 reply; 15+ messages in thread
From: Fam Zheng @ 2018-03-08  9:01 UTC (permalink / raw)
  To: Peter Lieven
  Cc: Stefan Hajnoczi, Dr. David Alan Gilbert, wency, qemu block,
	Juan Quintela, qemu-devel, Stefan Hajnoczi

On Thu, Mar 8, 2018 at 4:57 PM, Peter Lieven <pl@kamp.de> wrote:
>
>
>> Am 08.03.2018 um 02:28 schrieb Fam Zheng <famz@redhat.com>:
>>
>>> On Wed, 03/07 09:06, Peter Lieven wrote:
>>> Hi,
>>>
>>> while looking at the code I wonder if the blk_aio_preadv and the bdrv_reset_dirty_bitmap order must
>>> be swapped in mig_save_device_bulk:
>>>
>>>    qemu_mutex_lock_iothread();
>>>    aio_context_acquire(blk_get_aio_context(bmds->blk));
>>>    blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, &blk->qiov,
>>>                                0, blk_mig_read_cb, blk);
>>>
>>>    bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, cur_sector * BDRV_SECTOR_SIZE,
>>>                            nr_sectors * BDRV_SECTOR_SIZE);
>>>    aio_context_release(blk_get_aio_context(bmds->blk));
>>>    qemu_mutex_unlock_iothread();
>>>
>>> In mig_save_device_dirty we first reset the dirty bitmap and read then which shoulds like
>>> a better idea.
>>
>> Yes, that sounds right to me.
>>
>> Fam
>
> You mean the order should be swapped in mig_save_device_bulk as well?

Yeah, resetting the dirty bitmap before reading makes sure we don't
miss any new data.

Fam

>
> Peter
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] block migration and dirty bitmap reset
  2018-03-08  9:01                   ` Fam Zheng
@ 2018-03-08 10:33                     ` Peter Lieven
  0 siblings, 0 replies; 15+ messages in thread
From: Peter Lieven @ 2018-03-08 10:33 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Stefan Hajnoczi, Dr. David Alan Gilbert, wency, qemu block,
	Juan Quintela, qemu-devel, Stefan Hajnoczi

Am 08.03.2018 um 10:01 schrieb Fam Zheng:
> On Thu, Mar 8, 2018 at 4:57 PM, Peter Lieven <pl@kamp.de> wrote:
>>
>>> Am 08.03.2018 um 02:28 schrieb Fam Zheng <famz@redhat.com>:
>>>
>>>> On Wed, 03/07 09:06, Peter Lieven wrote:
>>>> Hi,
>>>>
>>>> while looking at the code I wonder if the blk_aio_preadv and the bdrv_reset_dirty_bitmap order must
>>>> be swapped in mig_save_device_bulk:
>>>>
>>>>     qemu_mutex_lock_iothread();
>>>>     aio_context_acquire(blk_get_aio_context(bmds->blk));
>>>>     blk->aiocb = blk_aio_preadv(bb, cur_sector * BDRV_SECTOR_SIZE, &blk->qiov,
>>>>                                 0, blk_mig_read_cb, blk);
>>>>
>>>>     bdrv_reset_dirty_bitmap(bmds->dirty_bitmap, cur_sector * BDRV_SECTOR_SIZE,
>>>>                             nr_sectors * BDRV_SECTOR_SIZE);
>>>>     aio_context_release(blk_get_aio_context(bmds->blk));
>>>>     qemu_mutex_unlock_iothread();
>>>>
>>>> In mig_save_device_dirty we first reset the dirty bitmap and read then which shoulds like
>>>> a better idea.
>>> Yes, that sounds right to me.
>>>
>>> Fam
>> You mean the order should be swapped in mig_save_device_bulk as well?
> Yeah, resetting the dirty bitmap before reading makes sure we don't
> miss any new data.

I wonder if this can really happen during blk_aio_preadv?
Anyway better safe than sorry. It at least will not hurt to swap the order.

Peter

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-03-08 10:33 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-22 11:13 [Qemu-devel] block migration and MAX_IN_FLIGHT_IO Peter Lieven
2018-03-05 11:45 ` Stefan Hajnoczi
2018-03-05 14:37   ` Peter Lieven
2018-03-05 14:52     ` Dr. David Alan Gilbert
2018-03-06 16:07       ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi
2018-03-06 16:35         ` Peter Lieven
2018-03-07  7:55           ` Peter Lieven
2018-03-07  8:06             ` [Qemu-devel] block migration and dirty bitmap reset Peter Lieven
2018-03-08  1:28               ` Fam Zheng
2018-03-08  8:57                 ` Peter Lieven
2018-03-08  9:01                   ` Fam Zheng
2018-03-08 10:33                     ` Peter Lieven
2018-03-07  9:47             ` [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO Stefan Hajnoczi
2018-03-07 20:35               ` Peter Lieven
2018-03-06 16:14       ` [Qemu-devel] " Peter Lieven

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.