All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] migration: broken ram_save_pending
@ 2014-02-04  7:15 Alexey Kardashevskiy
  2014-02-04 10:46 ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Kardashevskiy @ 2014-02-04  7:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Alex Graf

Hi!

I hit a problem with migration which I would like to know how to fix.

I run the source QEMU as this:
./qemu-system-ppc64 -enable-kvm -m 1024 -machine pseries \
	-nographic -vga none

For the destination, I add "-incoming tcp:localhost:4000".

Both run on the same POWER8 machine, the latest QEMU with the my
"hpratio=1" and "fix SLB migration" patches applied. The host kernel is
3.12 with 64K system page size (does not matter much though).

Since the source QEMU does not get any kernel or disk or net, it stays in
SLOF prompt. Very simple config.

Now I do migration. First "bulk" iteration goes pretty quick, all good.
When it is done, we enter "while (pending()) iterate()" loop in the
migration_thread() function.

The idea is - when the number of changes becomes small enough to get
transferred within a "maximum downtime" timeout, the migration will finish.

However bit different thing happens in this configuration. For some reason
(which I do not really know, I would like to but it is irrelevant here)
SLOF keeps dirtying few pages (for example, 6), each is 64K or up to 96 4K
pages in QEMU terms (393216 bytes). Every time the ram_save_pending() is
called, 96 pages are dirty. This is not a huge number but
ram_save_iterate() moves the migration file pointer only 287544 bytes
further because this is what was actually transferred and this number is
less due to the "is_zero_range(p, TARGET_PAGE_SIZE)" optimization.

So. migration_thread() gets dirty pages number, tries to send them in a
loop but every iteration resets the number of pages to 96 and we start
again. After several tries we cross BUFFER_DELAY timeout and calculate new
@max_size and if the host machine is fast enough it is bigger than 393216
and next loop will finally finish the migration.

How to fix this misbehavior?

I can only think of something simple like below and not sure it does not
break other things. I would expect ram_save_pending() to return correct
number of bytes QEMU is going to send rather than number of pages
multiplied by 4096 but checking if all these pages are really empty is not
too cheap.

Thanks!


diff --git a/arch_init.c b/arch_init.c
index 2ba297e..90949b0 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -537,16 +537,17 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
                         acct_info.dup_pages++;
                     }
                 }
             } else if (is_zero_range(p, TARGET_PAGE_SIZE)) {
                 acct_info.dup_pages++;
                 bytes_sent = save_block_hdr(f, block, offset, cont,
                                             RAM_SAVE_FLAG_COMPRESS);
                 qemu_put_byte(f, 0);
+                qemu_update_position(f, TARGET_PAGE_SIZE);
                 bytes_sent++;
             } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
                 current_addr = block->offset + offset;
                 bytes_sent = save_xbzrle_page(f, p, current_addr, block,
                                               offset, cont, last_stage);
                 if (!last_stage) {
                     p = get_cached_data(XBZRLE.cache, current_addr);
                 }


-- 
Alexey

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-04  7:15 [Qemu-devel] migration: broken ram_save_pending Alexey Kardashevskiy
@ 2014-02-04 10:46 ` Paolo Bonzini
  2014-02-04 11:59   ` Alexey Kardashevskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2014-02-04 10:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Graf

Il 04/02/2014 08:15, Alexey Kardashevskiy ha scritto:
> So. migration_thread() gets dirty pages number, tries to send them in a
> loop but every iteration resets the number of pages to 96 and we start
> again. After several tries we cross BUFFER_DELAY timeout and calculate new
> @max_size and if the host machine is fast enough it is bigger than 393216
> and next loop will finally finish the migration.

This should have happened pretty much immediately, because it's not 
while (pending()) but rather

             while (pending_size && pending_size >= max_size)

(it's an "if" in the code, but the idea is the same).  And max_size is 
the following:

             max_size = bandwidth * migrate_max_downtime() / 1000000;

With the default throttling of 32 MiB/s, bandwidth must be something 
like 33000 (expressed in bytes/ms) with the default settings, and then 
max_size should be 33000*3*10^9 / 10^6 = 6000000.  Where is my 
computation wrong?

Also, did you profile it to find the hotspot?  Perhaps the bitmap 
operations are taking a lot of time.  How big is the guest?  Juan's 
patches were optimizing the bitmaps but not all of them apply to your 
case because of hpratio.

> I can only think of something simple like below and not sure it does not
> break other things. I would expect ram_save_pending() to return correct
> number of bytes QEMU is going to send rather than number of pages
> multiplied by 4096 but checking if all these pages are really empty is not
> too cheap.

If you use qemu_update_position you will use very little bandwidth in 
the case where a lot of pages are zero.

What you mention in ram_save_pending() is not problematic just because 
of finding if the pages are empty, but also because you have to find the 
nonzero spots in the bitmap!

Paolo

> Thanks!
>
>
> diff --git a/arch_init.c b/arch_init.c
> index 2ba297e..90949b0 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -537,16 +537,17 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>                          acct_info.dup_pages++;
>                      }
>                  }
>              } else if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>                  acct_info.dup_pages++;
>                  bytes_sent = save_block_hdr(f, block, offset, cont,
>                                              RAM_SAVE_FLAG_COMPRESS);
>                  qemu_put_byte(f, 0);
> +                qemu_update_position(f, TARGET_PAGE_SIZE);
>                  bytes_sent++;
>              } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
>                  current_addr = block->offset + offset;
>                  bytes_sent = save_xbzrle_page(f, p, current_addr, block,
>                                                offset, cont, last_stage);
>                  if (!last_stage) {
>                      p = get_cached_data(XBZRLE.cache, current_addr);
>                  }
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-04 10:46 ` Paolo Bonzini
@ 2014-02-04 11:59   ` Alexey Kardashevskiy
  2014-02-04 12:07     ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Kardashevskiy @ 2014-02-04 11:59 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: Alex Graf

On 02/04/2014 09:46 PM, Paolo Bonzini wrote:
> Il 04/02/2014 08:15, Alexey Kardashevskiy ha scritto:
>> So. migration_thread() gets dirty pages number, tries to send them in a
>> loop but every iteration resets the number of pages to 96 and we start
>> again. After several tries we cross BUFFER_DELAY timeout and calculate new
>> @max_size and if the host machine is fast enough it is bigger than 393216
>> and next loop will finally finish the migration.
> 
> This should have happened pretty much immediately, because it's not while
> (pending()) but rather
> 
>             while (pending_size && pending_size >= max_size)
> 
> (it's an "if" in the code, but the idea is the same).  And max_size is the
> following:
> 
>             max_size = bandwidth * migrate_max_downtime() / 1000000;
> 
> With the default throttling of 32 MiB/s, bandwidth must be something like
> 33000 (expressed in bytes/ms) with the default settings, and then max_size
> should be 33000*3*10^9 / 10^6 = 6000000.  Where is my computation wrong?


migrate_max_downtime() = 30000000 = 3*10^7.

When the migration is in iterating stage, bandwidth is a speed in last
100ms which is usually 5 blocks 250KB each so it is
1250000/100=12500bytes/s and max_size=12500*30000000/10^6=375000 which is
less than the last chunk is.


> 
> Also, did you profile it to find the hotspot?  Perhaps the bitmap
> operations are taking a lot of time.  How big is the guest?

1024MB.


>  Juan's patches
> were optimizing the bitmaps but not all of them apply to your case because
> of hpratio.

This I had to disable :)


>> I can only think of something simple like below and not sure it does not
>> break other things. I would expect ram_save_pending() to return correct
>> number of bytes QEMU is going to send rather than number of pages
>> multiplied by 4096 but checking if all these pages are really empty is not
>> too cheap.
> 
> If you use qemu_update_position you will use very little bandwidth in the
> case where a lot of pages are zero.

My guest migrates in a second or so. I guess in this case
qemu_file_rate_limit() limits the speed and it does not look at QEMUFile::pos.


> What you mention in ram_save_pending() is not problematic just because of
> finding if the pages are empty, but also because you have to find the
> nonzero spots in the bitmap!

Sure.


> 
> Paolo
> 
>> Thanks!
>>
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index 2ba297e..90949b0 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -537,16 +537,17 @@ static int ram_save_block(QEMUFile *f, bool
>> last_stage)
>>                          acct_info.dup_pages++;
>>                      }
>>                  }
>>              } else if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>>                  acct_info.dup_pages++;
>>                  bytes_sent = save_block_hdr(f, block, offset, cont,
>>                                              RAM_SAVE_FLAG_COMPRESS);
>>                  qemu_put_byte(f, 0);
>> +                qemu_update_position(f, TARGET_PAGE_SIZE);
>>                  bytes_sent++;
>>              } else if (!ram_bulk_stage && migrate_use_xbzrle()) {
>>                  current_addr = block->offset + offset;
>>                  bytes_sent = save_xbzrle_page(f, p, current_addr, block,
>>                                                offset, cont, last_stage);
>>                  if (!last_stage) {
>>                      p = get_cached_data(XBZRLE.cache, current_addr);
>>                  }
>>
>>
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-04 11:59   ` Alexey Kardashevskiy
@ 2014-02-04 12:07     ` Paolo Bonzini
  2014-02-04 12:16       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2014-02-04 12:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Graf

Il 04/02/2014 12:59, Alexey Kardashevskiy ha scritto:
>> > With the default throttling of 32 MiB/s, bandwidth must be something like
>> > 33000 (expressed in bytes/ms) with the default settings, and then max_size
>> > should be 33000*3*10^9 / 10^6 = 6000000.  Where is my computation wrong?
>
> migrate_max_downtime() = 30000000 = 3*10^7.

Oops, that's the mistake.

> When the migration is in iterating stage, bandwidth is a speed in last
> 100ms which is usually 5 blocks 250KB each so it is
> 1250000/100=12500bytes/s and max_size=12500*30000000/10^6=375000 which is
> less than the last chunk is.
>
>

Perhaps our default maximum downtime is too low.  30 ms doesn't seem 
achievable in practice with 32 MiB/s bandwidth.  Just making it 300 ms 
or so should fix your problem.

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-04 12:07     ` Paolo Bonzini
@ 2014-02-04 12:16       ` Alexey Kardashevskiy
  2014-02-04 14:00         ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Kardashevskiy @ 2014-02-04 12:16 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: Alex Graf

On 02/04/2014 11:07 PM, Paolo Bonzini wrote:
> Il 04/02/2014 12:59, Alexey Kardashevskiy ha scritto:
>>> > With the default throttling of 32 MiB/s, bandwidth must be something like
>>> > 33000 (expressed in bytes/ms) with the default settings, and then
>>> max_size
>>> > should be 33000*3*10^9 / 10^6 = 6000000.  Where is my computation wrong?
>>
>> migrate_max_downtime() = 30000000 = 3*10^7.
> 
> Oops, that's the mistake.

Make a patch? :)


>> When the migration is in iterating stage, bandwidth is a speed in last
>> 100ms which is usually 5 blocks 250KB each so it is
>> 1250000/100=12500bytes/s and max_size=12500*30000000/10^6=375000 which is
>> less than the last chunk is.
>>
>>
> 
> Perhaps our default maximum downtime is too low.  30 ms doesn't seem
> achievable in practice with 32 MiB/s bandwidth.  Just making it 300 ms or
> so should fix your problem.

Well, it will fix it in my particular case but in a long run this does not
feel like a fix - there should be a way for migration_thread() to know that
ram_save_iterate() sent all dirty pages it had to send, no?



-- 
Alexey

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-04 12:16       ` Alexey Kardashevskiy
@ 2014-02-04 14:00         ` Paolo Bonzini
  2014-02-04 22:17           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2014-02-04 14:00 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Graf

Il 04/02/2014 13:16, Alexey Kardashevskiy ha scritto:
> On 02/04/2014 11:07 PM, Paolo Bonzini wrote:
>> Il 04/02/2014 12:59, Alexey Kardashevskiy ha scritto:
>>>>> With the default throttling of 32 MiB/s, bandwidth must be something like
>>>>> 33000 (expressed in bytes/ms) with the default settings, and then
>>>> max_size
>>>>> should be 33000*3*10^9 / 10^6 = 6000000.  Where is my computation wrong?
>>>
>>> migrate_max_downtime() = 30000000 = 3*10^7.
>>
>> Oops, that's the mistake.
>
> Make a patch? :)

I mean, my mistake. :)  I assumed 3000 ms = 3*10^9.

30 ms is too little, but 3000 ms is probably too much for a default.

>>> When the migration is in iterating stage, bandwidth is a speed in last
>>> 100ms which is usually 5 blocks 250KB each so it is
>>> 1250000/100=12500bytes/s and max_size=12500*30000000/10^6=375000 which is
>>> less than the last chunk is.
>>
>> Perhaps our default maximum downtime is too low.  30 ms doesn't seem
>> achievable in practice with 32 MiB/s bandwidth.  Just making it 300 ms or
>> so should fix your problem.
>
> Well, it will fix it in my particular case but in a long run this does not
> feel like a fix - there should be a way for migration_thread() to know that
> ram_save_iterate() sent all dirty pages it had to send, no?

No, because new pages might be dirtied while ram_save_iterate() was running.

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-04 14:00         ` Paolo Bonzini
@ 2014-02-04 22:17           ` Alexey Kardashevskiy
  2014-02-05  7:18             ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Kardashevskiy @ 2014-02-04 22:17 UTC (permalink / raw)
  To: Paolo Bonzini, qemu-devel; +Cc: Alex Graf

On 02/05/2014 01:00 AM, Paolo Bonzini wrote:
> Il 04/02/2014 13:16, Alexey Kardashevskiy ha scritto:
>> On 02/04/2014 11:07 PM, Paolo Bonzini wrote:
>>> Il 04/02/2014 12:59, Alexey Kardashevskiy ha scritto:
>>>>>> With the default throttling of 32 MiB/s, bandwidth must be something
>>>>>> like
>>>>>> 33000 (expressed in bytes/ms) with the default settings, and then
>>>>> max_size
>>>>>> should be 33000*3*10^9 / 10^6 = 6000000.  Where is my computation wrong?
>>>>
>>>> migrate_max_downtime() = 30000000 = 3*10^7.
>>>
>>> Oops, that's the mistake.
>>
>> Make a patch? :)
> 
> I mean, my mistake. :)  I assumed 3000 ms = 3*10^9.
> 
> 30 ms is too little, but 3000 ms is probably too much for a default.

So - the default is bad and we need a patch (to make it 300ms), no?

> 
>>>> When the migration is in iterating stage, bandwidth is a speed in last
>>>> 100ms which is usually 5 blocks 250KB each so it is
>>>> 1250000/100=12500bytes/s and max_size=12500*30000000/10^6=375000 which is
>>>> less than the last chunk is.
>>>
>>> Perhaps our default maximum downtime is too low.  30 ms doesn't seem
>>> achievable in practice with 32 MiB/s bandwidth.  Just making it 300 ms or
>>> so should fix your problem.
>>
>> Well, it will fix it in my particular case but in a long run this does not
>> feel like a fix - there should be a way for migration_thread() to know that
>> ram_save_iterate() sent all dirty pages it had to send, no?
> 
> No, because new pages might be dirtied while ram_save_iterate() was running.


I do not get it, sorry. In my example the ram_save_iterate() sends
everything in one go but its caller thinks that it did not and tries again.
This is is not because something got dirty in between, this is only because
of sending zero pages as 8+1 bytes chunk (not as 4096+1 bytes).


-- 
Alexey

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-04 22:17           ` Alexey Kardashevskiy
@ 2014-02-05  7:18             ` Paolo Bonzini
  2014-02-05  9:09               ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2014-02-05  7:18 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Graf

Il 04/02/2014 23:17, Alexey Kardashevskiy ha scritto:
>>> >> Well, it will fix it in my particular case but in a long run this does not
>>> >> feel like a fix - there should be a way for migration_thread() to know that
>>> >> ram_save_iterate() sent all dirty pages it had to send, no?
>> >
>> > No, because new pages might be dirtied while ram_save_iterate() was running.
>
> I do not get it, sorry. In my example the ram_save_iterate() sends
> everything in one go but its caller thinks that it did not and tries again.

It's not that "the caller thinks that it did not".  The caller knows 
what happens, because migration_bitmap_find_and_reset_dirty updates the 
migration_dirty_pages count that ram_save_pending uses.  So 
migration_dirty_pages should be 0 when ram_save_pending is entered.

However, something gets dirty in between so remaining_size is again 
393216 when ram_save_pending returns, after the migration_bitmap_sync 
call.  Because of this the migration thread thinks that 
ram_save_iterate() _will_ not send everything in one go.

At least, this is how I read the code.  Perhaps I'm wrong. ;)

Paolo

> This is is not because something got dirty in between, this is only because
> of sending zero pages as 8+1 bytes chunk (not as 4096+1 bytes).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-05  7:18             ` Paolo Bonzini
@ 2014-02-05  9:09               ` Dr. David Alan Gilbert
  2014-02-05 16:35                 ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2014-02-05  9:09 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-devel, Alex Graf

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> Il 04/02/2014 23:17, Alexey Kardashevskiy ha scritto:
> >>>>> Well, it will fix it in my particular case but in a long run this does not
> >>>>> feel like a fix - there should be a way for migration_thread() to know that
> >>>>> ram_save_iterate() sent all dirty pages it had to send, no?
> >>>
> >>> No, because new pages might be dirtied while ram_save_iterate() was running.
> >
> >I do not get it, sorry. In my example the ram_save_iterate() sends
> >everything in one go but its caller thinks that it did not and tries again.
> 
> It's not that "the caller thinks that it did not".  The caller knows
> what happens, because migration_bitmap_find_and_reset_dirty updates
> the migration_dirty_pages count that ram_save_pending uses.  So
> migration_dirty_pages should be 0 when ram_save_pending is entered.
> 
> However, something gets dirty in between so remaining_size is again
> 393216 when ram_save_pending returns, after the
> migration_bitmap_sync call.  Because of this the migration thread
> thinks that ram_save_iterate() _will_ not send everything in one go.
> 
> At least, this is how I read the code.  Perhaps I'm wrong. ;)

My reading was a bit different.

I think the case Alexey is hitting is:
   1 A few dirtied pages
   2 but because of the hpratio most of the data is actually zero
     - indeed most of the target-page sized chunks are zero
   3 Thus the data compresses very heavily
   4 When the bandwidth/delay calculation happens it's spent a reasonable
     amount of time transferring a reasonable amount of pages but not
     actually many bytes on the wire, so the estimate of the available
     bandwidth available is lower than reality.
   5 The max-downtime calculation is a comparison of pending-dirty uncompressed
     bytes with compressed bandwidth

(5) is bound to fail if the compression ratio is particularly high, which
because of the hpratio it is if we're just dirtying one word in an entire
host page.

What I'm not too sure of is you'd think if only a few pages were dirtied
that the loop would happen quite quickly and thus the delay would also be
small, and so bytes-on-wire would be divided by a small value and thus
not be too bad.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-05  9:09               ` Dr. David Alan Gilbert
@ 2014-02-05 16:35                 ` Paolo Bonzini
  2014-02-05 16:42                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2014-02-05 16:35 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Alexey Kardashevskiy, qemu-devel, Alex Graf

Il 05/02/2014 10:09, Dr. David Alan Gilbert ha scritto:
> I think the case Alexey is hitting is:
>    1 A few dirtied pages
>    2 but because of the hpratio most of the data is actually zero
>      - indeed most of the target-page sized chunks are zero
>    3 Thus the data compresses very heavily
>    4 When the bandwidth/delay calculation happens it's spent a reasonable
>      amount of time transferring a reasonable amount of pages but not
>      actually many bytes on the wire, so the estimate of the available
>      bandwidth available is lower than reality.
>    5 The max-downtime calculation is a comparison of pending-dirty uncompressed
>      bytes with compressed bandwidth
>
> (5) is bound to fail if the compression ratio is particularly high, which
> because of the hpratio it is if we're just dirtying one word in an entire
> host page.

So far so good, but why isn't pending-dirty (aka migration_dirty_pages 
in the code) zero?

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-05 16:35                 ` Paolo Bonzini
@ 2014-02-05 16:42                   ` Dr. David Alan Gilbert
  2014-02-05 16:45                     ` Paolo Bonzini
  0 siblings, 1 reply; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2014-02-05 16:42 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Alexey Kardashevskiy, qemu-devel, Alex Graf

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> Il 05/02/2014 10:09, Dr. David Alan Gilbert ha scritto:
> >I think the case Alexey is hitting is:
> >   1 A few dirtied pages
> >   2 but because of the hpratio most of the data is actually zero
> >     - indeed most of the target-page sized chunks are zero
> >   3 Thus the data compresses very heavily
> >   4 When the bandwidth/delay calculation happens it's spent a reasonable
> >     amount of time transferring a reasonable amount of pages but not
> >     actually many bytes on the wire, so the estimate of the available
> >     bandwidth available is lower than reality.
> >   5 The max-downtime calculation is a comparison of pending-dirty uncompressed
> >     bytes with compressed bandwidth
> >
> >(5) is bound to fail if the compression ratio is particularly high, which
> >because of the hpratio it is if we're just dirtying one word in an entire
> >host page.
> 
> So far so good, but why isn't pending-dirty (aka
> migration_dirty_pages in the code) zero?

Because:
    * the code is still running and keeps redirtying a small handful of pages
    * but because we've underestimated our available bandwidth we never stop
      it and just throw those pages across immediately

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-05 16:42                   ` Dr. David Alan Gilbert
@ 2014-02-05 16:45                     ` Paolo Bonzini
  2014-02-06  3:10                       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2014-02-05 16:45 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Alexey Kardashevskiy, qemu-devel, Alex Graf

Il 05/02/2014 17:42, Dr. David Alan Gilbert ha scritto:
> Because:
>     * the code is still running and keeps redirtying a small handful of pages
>     * but because we've underestimated our available bandwidth we never stop
>       it and just throw those pages across immediately

Ok, I thought Alexey was saying we are not redirtying that handful of pages.

And in turn, this is because the max downtime we have is too low 
(especially for the default 32 MB/sec default bandwidth; that's also 
pretty low).

Paolo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-05 16:45                     ` Paolo Bonzini
@ 2014-02-06  3:10                       ` Alexey Kardashevskiy
  2014-02-06 11:24                         ` Dr. David Alan Gilbert
  2014-02-06 23:49                         ` Paolo Bonzini
  0 siblings, 2 replies; 18+ messages in thread
From: Alexey Kardashevskiy @ 2014-02-06  3:10 UTC (permalink / raw)
  To: Paolo Bonzini, Dr. David Alan Gilbert; +Cc: qemu-devel, Alex Graf

On 02/06/2014 03:45 AM, Paolo Bonzini wrote:
> Il 05/02/2014 17:42, Dr. David Alan Gilbert ha scritto:
>> Because:
>>     * the code is still running and keeps redirtying a small handful of
>> pages
>>     * but because we've underestimated our available bandwidth we never stop
>>       it and just throw those pages across immediately
> 
> Ok, I thought Alexey was saying we are not redirtying that handful of pages.


Every iteration we read the dirty map from KVM and send all dirty pages
across the stream.


> And in turn, this is because the max downtime we have is too low
> (especially for the default 32 MB/sec default bandwidth; that's also pretty
> low).


My understanding nooow is that in order to finish migration QEMU waits for
the earliest 100ms (BUFFER_DELAY) of continuously low trafic but due to
those pages getting dirty every time we read the dirty map, we transfer
more in these 100ms than we are actually allowed (>32MB/s or 320KB/100ms).
So we transfer-transfer-transfer, detect than we transfer too much, do
delay() and if max_size (calculated from actual transfer and downtime) for
the next iteration is less (by luck) than those 96 pages (uncompressed) -
we finish.

Increasing speed or/and downtime will help but still - we would not need
that if migration did not expect all 96 pages to have to be sent but did
have some smart way to detect that many are empty (so - compressed).

Literally, move is_zero_range() from ram_save_block() to
migration_bitmap_sync() and store this bit in some new pages_zero_map, for
example. But does it make a lot of sense?


-- 
Alexey

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-06  3:10                       ` Alexey Kardashevskiy
@ 2014-02-06 11:24                         ` Dr. David Alan Gilbert
  2014-02-07  5:39                           ` Alexey Kardashevskiy
  2014-02-06 23:49                         ` Paolo Bonzini
  1 sibling, 1 reply; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2014-02-06 11:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Paolo Bonzini, Alex Graf, Dr. David Alan Gilbert, qemu-devel

* Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
> On 02/06/2014 03:45 AM, Paolo Bonzini wrote:
> > Il 05/02/2014 17:42, Dr. David Alan Gilbert ha scritto:
> >> Because:
> >>     * the code is still running and keeps redirtying a small handful of
> >> pages
> >>     * but because we've underestimated our available bandwidth we never stop
> >>       it and just throw those pages across immediately
> > 
> > Ok, I thought Alexey was saying we are not redirtying that handful of pages.
> 
> 
> Every iteration we read the dirty map from KVM and send all dirty pages
> across the stream.
> 
> 
> > And in turn, this is because the max downtime we have is too low
> > (especially for the default 32 MB/sec default bandwidth; that's also pretty
> > low).
> 
> 
> My understanding nooow is that in order to finish migration QEMU waits for
> the earliest 100ms (BUFFER_DELAY) of continuously low trafic but due to
> those pages getting dirty every time we read the dirty map, we transfer
> more in these 100ms than we are actually allowed (>32MB/s or 320KB/100ms).
> So we transfer-transfer-transfer, detect than we transfer too much, do
> delay() and if max_size (calculated from actual transfer and downtime) for
> the next iteration is less (by luck) than those 96 pages (uncompressed) -
> we finish.

How about turning on some of the debug in migration.c; I suggest not all of
it, but how about the :

            DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
                    " bandwidth %g max_size %" PRId64 "\n",
                    transferred_bytes, time_spent, bandwidth, max_size);

and also the s->dirty_bytes_rate value.  It would help check our assumptions.

> Increasing speed or/and downtime will help but still - we would not need
> that if migration did not expect all 96 pages to have to be sent but did
> have some smart way to detect that many are empty (so - compressed).

I think the other way would be to keep track of the compression ratio;
if we knew how many pages we'd sent, and how much bandwidth that had used,
we could divide the pending_bytes by that to get a *different* approximation.

However, the problem is that my understanding is we're trying to 
_gurantee_ a maximum downtime, and to do that we have to use the calculation
that assumes that all the pages we have are going to take the maximum time
to transfer, and only go into downtime then.

> Literally, move is_zero_range() from ram_save_block() to
> migration_bitmap_sync() and store this bit in some new pages_zero_map, for
> example. But does it make a lot of sense?

The problem is that means checking whether it's zero more often; at the moment
we check it's zero once during sending; to do what you're suggesting would
mean we'd have to check every page is zero, every time we sync, and I think
that's more often than we send.

Have you tried disabling the call to is_zero_range in arch_init.c's ram_block
so that (as long as you have XBZRLE off) we don't do any compression; if 
the theory is right then your problem should go away.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-06  3:10                       ` Alexey Kardashevskiy
  2014-02-06 11:24                         ` Dr. David Alan Gilbert
@ 2014-02-06 23:49                         ` Paolo Bonzini
  2014-02-07  5:42                           ` Alexey Kardashevskiy
  1 sibling, 1 reply; 18+ messages in thread
From: Paolo Bonzini @ 2014-02-06 23:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Dr. David Alan Gilbert; +Cc: qemu-devel, Alex Graf

Il 06/02/2014 04:10, Alexey Kardashevskiy ha scritto:
>> > Ok, I thought Alexey was saying we are not redirtying that handful of pages.
> 
> Every iteration we read the dirty map from KVM and send all dirty pages
> across the stream.

But we never finish because qemu_savevm_state_pending is only called _after_
the g_usleep?  And thus there's time for the guest to redirty those pages.
Does something like this fix it (of course for a proper pages the goto
should be eliminated)?

diff --git a/migration.c b/migration.c
index 7235c23..804c3bd 100644
--- a/migration.c
+++ b/migration.c
@@ -589,6 +589,7 @@ static void *migration_thread(void *opaque)
             } else {
                 int ret;
 
+final_phase:
                 DPRINTF("done iterating\n");
                 qemu_mutex_lock_iothread();
                 start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
@@ -640,10 +641,16 @@ static void *migration_thread(void *opaque)
             qemu_file_reset_rate_limit(s->file);
             initial_time = current_time;
             initial_bytes = qemu_ftell(s->file);
-        }
-        if (qemu_file_rate_limit(s->file)) {
-            /* usleep expects microseconds */
-            g_usleep((initial_time + BUFFER_DELAY - current_time)*1000);
+        } else if (qemu_file_rate_limit(s->file)) {
+            pending_size = qemu_savevm_state_pending(s->file, max_size);
+            DPRINTF("pending size %" PRIu64 " max %" PRIu64 "\n",
+                    pending_size, max_size);
+            if (pending_size >= max_size) {
+                /* usleep expects microseconds */
+                g_usleep((initial_time + BUFFER_DELAY - current_time)*1000);
+	    } else {
+                goto final_phase;
+	    }
         }
     }
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-06 11:24                         ` Dr. David Alan Gilbert
@ 2014-02-07  5:39                           ` Alexey Kardashevskiy
  2014-02-07  8:55                             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 18+ messages in thread
From: Alexey Kardashevskiy @ 2014-02-07  5:39 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Paolo Bonzini, qemu-devel, Alex Graf

On 02/06/2014 10:24 PM, Dr. David Alan Gilbert wrote:
> * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
>> On 02/06/2014 03:45 AM, Paolo Bonzini wrote:
>>> Il 05/02/2014 17:42, Dr. David Alan Gilbert ha scritto:
>>>> Because:
>>>>     * the code is still running and keeps redirtying a small handful of
>>>> pages
>>>>     * but because we've underestimated our available bandwidth we never stop
>>>>       it and just throw those pages across immediately
>>>
>>> Ok, I thought Alexey was saying we are not redirtying that handful of pages.
>>
>>
>> Every iteration we read the dirty map from KVM and send all dirty pages
>> across the stream.
>>
>>
>>> And in turn, this is because the max downtime we have is too low
>>> (especially for the default 32 MB/sec default bandwidth; that's also pretty
>>> low).
>>
>>
>> My understanding nooow is that in order to finish migration QEMU waits for
>> the earliest 100ms (BUFFER_DELAY) of continuously low trafic but due to
>> those pages getting dirty every time we read the dirty map, we transfer
>> more in these 100ms than we are actually allowed (>32MB/s or 320KB/100ms).
>> So we transfer-transfer-transfer, detect than we transfer too much, do
>> delay() and if max_size (calculated from actual transfer and downtime) for
>> the next iteration is less (by luck) than those 96 pages (uncompressed) -
>> we finish.
> 
> How about turning on some of the debug in migration.c; I suggest not all of
> it, but how about the :
> 
>             DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
>                     " bandwidth %g max_size %" PRId64 "\n",
>                     transferred_bytes, time_spent, bandwidth, max_size);
> 
> and also the s->dirty_bytes_rate value.  It would help check our assumptions.


It is always zero.


>> Increasing speed or/and downtime will help but still - we would not need
>> that if migration did not expect all 96 pages to have to be sent but did
>> have some smart way to detect that many are empty (so - compressed).
> 
> I think the other way would be to keep track of the compression ratio;
> if we knew how many pages we'd sent, and how much bandwidth that had used,
> we could divide the pending_bytes by that to get a *different* approximation.
> 
> However, the problem is that my understanding is we're trying to 
> _gurantee_ a maximum downtime, and to do that we have to use the calculation
> that assumes that all the pages we have are going to take the maximum time
> to transfer, and only go into downtime then.
> 
>> Literally, move is_zero_range() from ram_save_block() to
>> migration_bitmap_sync() and store this bit in some new pages_zero_map, for
>> example. But does it make a lot of sense?
> 
> The problem is that means checking whether it's zero more often; at the moment
> we check it's zero once during sending; to do what you're suggesting would
> mean we'd have to check every page is zero, every time we sync, and I think
> that's more often than we send.
> 
> Have you tried disabling the call to is_zero_range in arch_init.c's ram_block
> so that (as long as you have XBZRLE off) we don't do any compression; if 
> the theory is right then your problem should go away.


That was what I did first :)



> 
> Dave
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-06 23:49                         ` Paolo Bonzini
@ 2014-02-07  5:42                           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 18+ messages in thread
From: Alexey Kardashevskiy @ 2014-02-07  5:42 UTC (permalink / raw)
  To: Paolo Bonzini, Dr. David Alan Gilbert; +Cc: qemu-devel, Alex Graf

On 02/07/2014 10:49 AM, Paolo Bonzini wrote:
> Il 06/02/2014 04:10, Alexey Kardashevskiy ha scritto:
>>>> Ok, I thought Alexey was saying we are not redirtying that handful of pages.
>>
>> Every iteration we read the dirty map from KVM and send all dirty pages
>> across the stream.
> 
> But we never finish because qemu_savevm_state_pending is only called _after_
> the g_usleep?  And thus there's time for the guest to redirty those pages.
> Does something like this fix it (of course for a proper pages the goto
> should be eliminated)?
> 
> diff --git a/migration.c b/migration.c
> index 7235c23..804c3bd 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -589,6 +589,7 @@ static void *migration_thread(void *opaque)
>              } else {
>                  int ret;
>  
> +final_phase:
>                  DPRINTF("done iterating\n");
>                  qemu_mutex_lock_iothread();
>                  start_time = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> @@ -640,10 +641,16 @@ static void *migration_thread(void *opaque)
>              qemu_file_reset_rate_limit(s->file);
>              initial_time = current_time;
>              initial_bytes = qemu_ftell(s->file);
> -        }
> -        if (qemu_file_rate_limit(s->file)) {
> -            /* usleep expects microseconds */
> -            g_usleep((initial_time + BUFFER_DELAY - current_time)*1000);
> +        } else if (qemu_file_rate_limit(s->file)) {
> +            pending_size = qemu_savevm_state_pending(s->file, max_size);
> +            DPRINTF("pending size %" PRIu64 " max %" PRIu64 "\n",
> +                    pending_size, max_size);
> +            if (pending_size >= max_size) {
> +                /* usleep expects microseconds */
> +                g_usleep((initial_time + BUFFER_DELAY - current_time)*1000);
> +	    } else {
> +                goto final_phase;
> +	    }
>          }
>      }


It does not make any difference, will have a closer look on Monday.


-- 
Alexey

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [Qemu-devel] migration: broken ram_save_pending
  2014-02-07  5:39                           ` Alexey Kardashevskiy
@ 2014-02-07  8:55                             ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 18+ messages in thread
From: Dr. David Alan Gilbert @ 2014-02-07  8:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paolo Bonzini, qemu-devel, Alex Graf

* Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
> On 02/06/2014 10:24 PM, Dr. David Alan Gilbert wrote:
> > * Alexey Kardashevskiy (aik@ozlabs.ru) wrote:
> >> On 02/06/2014 03:45 AM, Paolo Bonzini wrote:
> >>> Il 05/02/2014 17:42, Dr. David Alan Gilbert ha scritto:
> >>>> Because:
> >>>>     * the code is still running and keeps redirtying a small handful of
> >>>> pages
> >>>>     * but because we've underestimated our available bandwidth we never stop
> >>>>       it and just throw those pages across immediately
> >>>
> >>> Ok, I thought Alexey was saying we are not redirtying that handful of pages.
> >>
> >>
> >> Every iteration we read the dirty map from KVM and send all dirty pages
> >> across the stream.
> >>
> >>
> >>> And in turn, this is because the max downtime we have is too low
> >>> (especially for the default 32 MB/sec default bandwidth; that's also pretty
> >>> low).
> >>
> >>
> >> My understanding nooow is that in order to finish migration QEMU waits for
> >> the earliest 100ms (BUFFER_DELAY) of continuously low trafic but due to
> >> those pages getting dirty every time we read the dirty map, we transfer
> >> more in these 100ms than we are actually allowed (>32MB/s or 320KB/100ms).
> >> So we transfer-transfer-transfer, detect than we transfer too much, do
> >> delay() and if max_size (calculated from actual transfer and downtime) for
> >> the next iteration is less (by luck) than those 96 pages (uncompressed) -
> >> we finish.
> > 
> > How about turning on some of the debug in migration.c; I suggest not all of
> > it, but how about the :
> > 
> >             DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
> >                     " bandwidth %g max_size %" PRId64 "\n",
> >                     transferred_bytes, time_spent, bandwidth, max_size);
> > 
> > and also the s->dirty_bytes_rate value.  It would help check our assumptions.
> 
> 
> It is always zero.
> 
> 
> >> Increasing speed or/and downtime will help but still - we would not need
> >> that if migration did not expect all 96 pages to have to be sent but did
> >> have some smart way to detect that many are empty (so - compressed).
> > 
> > I think the other way would be to keep track of the compression ratio;
> > if we knew how many pages we'd sent, and how much bandwidth that had used,
> > we could divide the pending_bytes by that to get a *different* approximation.
> > 
> > However, the problem is that my understanding is we're trying to 
> > _gurantee_ a maximum downtime, and to do that we have to use the calculation
> > that assumes that all the pages we have are going to take the maximum time
> > to transfer, and only go into downtime then.
> > 
> >> Literally, move is_zero_range() from ram_save_block() to
> >> migration_bitmap_sync() and store this bit in some new pages_zero_map, for
> >> example. But does it make a lot of sense?
> > 
> > The problem is that means checking whether it's zero more often; at the moment
> > we check it's zero once during sending; to do what you're suggesting would
> > mean we'd have to check every page is zero, every time we sync, and I think
> > that's more often than we send.
> > 
> > Have you tried disabling the call to is_zero_range in arch_init.c's ram_block
> > so that (as long as you have XBZRLE off) we don't do any compression; if 
> > the theory is right then your problem should go away.
> 
> That was what I did first :)

Oh ok, then my theory on it is toast.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-02-07  8:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-04  7:15 [Qemu-devel] migration: broken ram_save_pending Alexey Kardashevskiy
2014-02-04 10:46 ` Paolo Bonzini
2014-02-04 11:59   ` Alexey Kardashevskiy
2014-02-04 12:07     ` Paolo Bonzini
2014-02-04 12:16       ` Alexey Kardashevskiy
2014-02-04 14:00         ` Paolo Bonzini
2014-02-04 22:17           ` Alexey Kardashevskiy
2014-02-05  7:18             ` Paolo Bonzini
2014-02-05  9:09               ` Dr. David Alan Gilbert
2014-02-05 16:35                 ` Paolo Bonzini
2014-02-05 16:42                   ` Dr. David Alan Gilbert
2014-02-05 16:45                     ` Paolo Bonzini
2014-02-06  3:10                       ` Alexey Kardashevskiy
2014-02-06 11:24                         ` Dr. David Alan Gilbert
2014-02-07  5:39                           ` Alexey Kardashevskiy
2014-02-07  8:55                             ` Dr. David Alan Gilbert
2014-02-06 23:49                         ` Paolo Bonzini
2014-02-07  5:42                           ` Alexey Kardashevskiy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.