Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO

From: Peter Lieven <pl@kamp.de>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Fam Zheng <famz@redhat.com>, Wen Congyang <wency@cn.fujitsu.com>,
	qemu block <qemu-block@nongnu.org>,
	Juan Quintela <quintela@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [Qemu-block] block migration and MAX_IN_FLIGHT_IO
Date: Wed, 7 Mar 2018 21:35:42 +0100	[thread overview]
Message-ID: <14f9de5c-5e99-1047-ace0-6ab0c7850945@kamp.de> (raw)
In-Reply-To: <CAJSP0QXTFz55WmF8M7Lc8hLjeD3YpSXCQx+Q8y9=9CN8NW8EJw@mail.gmail.com>

Am 07.03.2018 um 10:47 schrieb Stefan Hajnoczi:
> On Wed, Mar 7, 2018 at 7:55 AM, Peter Lieven <pl@kamp.de> wrote:
>> Am 06.03.2018 um 17:35 schrieb Peter Lieven:
>>> Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi:
>>>> On Mon, Mar 05, 2018 at 02:52:16PM +0000, Dr. David Alan Gilbert wrote:
>>>>> * Peter Lieven (pl@kamp.de) wrote:
>>>>>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi:
>>>>>>> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote:
>>>>>>>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in 2015 and was curious what was the reason
>>>>>>>> to choose 512MB as readahead? The question is that I found that the source VM gets very unresponsive I/O wise
>>>>>>>> while the initial 512MB are read and furthermore seems to stay unreasponsive if we choose a high migration speed
>>>>>>>> and have a fast storage on the destination VM.
>>>>>>>>
>>>>>>>> In our environment I modified this value to 16MB which seems to work much smoother. I wonder if we should make
>>>>>>>> this a user configurable value or define a different rate limit for the block transfer in bulk stage at least?
>>>>>>> I don't know if benchmarks were run when choosing the value.  From the
>>>>>>> commit description it sounds like the main purpose was to limit the
>>>>>>> amount of memory that can be consumed.
>>>>>>>
>>>>>>> 16 MB also fulfills that criteria :), but why is the source VM more
>>>>>>> responsive with a lower value?
>>>>>>>
>>>>>>> Perhaps the issue is queue depth on the storage device - the block
>>>>>>> migration code enqueues up to 512 MB worth of reads, and guest I/O has
>>>>>>> to wait?
>>>>>> That is my guess. Especially if the destination storage is faster we basically alsways have
>>>>>> 512 I/Os in flight on the source storage.
>>>>>>
>>>>>> Does anyone mind if the reduce that value to 16MB or do we need a better mechanism?
>>>>> We've got migration-parameters these days; you could connect it to one
>>>>> of those fairly easily I think.
>>>>> Try: grep -i 'cpu[-_]throttle[-_]initial'  for an example of one that's
>>>>> already there.
>>>>> Then you can set it to whatever you like.
>>>> It would be nice to solve the performance problem without adding a
>>>> tuneable.
>>>>
>>>> On the other hand, QEMU has no idea what the queue depth of the device
>>>> is.  Therefore it cannot prioritize guest I/O over block migration I/O.
>>>>
>>>> 512 parallel requests is much too high.  Most parallel I/O benchmarking
>>>> is done at 32-64 queue depth.
>>>>
>>>> I think that 16 parallel requests is a reasonable maximum number for a
>>>> background job.
>>>>
>>>> We need to be clear though that the purpose of this change is unrelated
>>>> to the original 512 MB memory footprint goal.  It just happens to touch
>>>> the same constant but the goal is now to submit at most 16 I/O requests
>>>> in parallel to avoid monopolizing the I/O device.
>>> I think we should really look at this. The variables that control if we stay in the while loop or not are incremented and decremented
>>> at the following places:
>>>
>>> mig_save_device_dirty:
>>> mig_save_device_bulk:
>>>     block_mig_state.submitted++;
>>>
>>> blk_mig_read_cb:
>>>     block_mig_state.submitted--;
>>>     block_mig_state.read_done++;
>>>
>>> flush_blks:
>>>     block_mig_state.read_done--;
>>>
>>> The condition of the while loop is:
>>> (block_mig_state.submitted +
>>>             block_mig_state.read_done) * BLOCK_SIZE <
>>>            qemu_file_get_rate_limit(f) &&
>>>            (block_mig_state.submitted +
>>>             block_mig_state.read_done) <
>>>            MAX_INFLIGHT_IO)
>>>
>>> At first I wonder if we ever reach the rate-limit because we put the read buffers onto f AFTER we exit the while loop?
>>>
>>> And even if we reach the limit we constantly maintain 512 I/Os in parallel because we immediately decrement read_done
>>> when we put the buffers to f in flush_blks. In the next iteration of the while loop we then read again until we have 512 in-flight I/Os.
>>>
>>> And shouldn't we have a time limit to limit the time we stay in the while loop? I think we artificially delay sending data to f?
>> Thinking about it for a while I would propose the following:
>>
>> a) rename MAX_INFLIGHT_IO to MAX_IO_BUFFERS
>> b) add MAX_PARALLEL_IO with a value of 16
>> c) compare qemu_file_get_rate_limit only with block_mig_state.read_done
>>
>> This would yield in the following condition for the while loop:
>>
>> (block_mig_state.read_done * BLOCK_SIZE < qemu_file_get_rate_limit(f) &&
>>  (block_mig_state.submitted + block_mig_state.read_done) < MAX_IO_BUFFERS &&
>>  block_mig_state.submitted < MAX_PARALLEL_IO)
>>
>> Sounds that like a plan?
> That sounds good to me.

I will prepare patches for this.

Peter