All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] dirty page count problem
@ 2017-07-21 17:28 Dr. David Alan Gilbert
  2017-07-21 19:07 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 3+ messages in thread
From: Dr. David Alan Gilbert @ 2017-07-21 17:28 UTC (permalink / raw)
  To: haozhong.zhang; +Cc: qemu-devel, alex.benee, peterx, lvivier, quintela

Hi,
  Git bisect is pointing to your patch 084140bd49:
  exec: fix access to ram_list.dirty_memory when sync dirty bitmap

trying to diagnose a bug I'm seeing; it looks like the dirty page count
is wrong for some reason.

Alex Bennée spotted a problem where the postcopy test would occasionally
fail under very heavy load;    attaching a debugger and it looks like
the problem is we have a migration_dirty_page count stuck at 2;
in the normal migration tests we don't spot this, because 2 pages is
smaller than the threshold to end migration and so an extra 2 pages
doesn't block it finishing.   However, with a very
small downtime setting (like we use in the postcopy test) and with
very low bandwidth (as when Alex ran the test on a very heavily loaded
machine) we end up never calling the bitmap sync again and never
completing the iteration.

I'm using the following addition to spot the problem:

diff --git a/migration/ram.c b/migration/ram.c
index e75f1050e4..3ddf884952 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1350,6 +1350,13 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
         }
     } while (!pages && again);

+    if (!pages && !again && pss.complete_round && rs->migration_dirty_pages)
+    {
+        /* Should make this fail migration ? */
+        fprintf(stderr, "%s: no page found, yet dirty_pages=%"PRIu64"\n",
+                __func__, rs->migration_dirty_pages);
+    }
+
     rs->last_seen_block = pss.block;
     rs->last_page = pss.page;

(which I might add as a test to fail a migration)

That test fails easily even on an unloaded machine:
tests/postcopy-test
/x86_64/postcopy: ram_find_and_save_block: no page found, yet dirty_pages=2
ram_find_and_save_block: no page found, yet dirty_pages=2
ram_find_and_save_block: no page found, yet dirty_pages=2
OK


I'll try and debug where our extra two pages are coming from.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] dirty page count problem
  2017-07-21 17:28 [Qemu-devel] dirty page count problem Dr. David Alan Gilbert
@ 2017-07-21 19:07 ` Dr. David Alan Gilbert
  2017-07-24 14:17   ` Alex Bennée
  0 siblings, 1 reply; 3+ messages in thread
From: Dr. David Alan Gilbert @ 2017-07-21 19:07 UTC (permalink / raw)
  To: haozhong.zhang; +Cc: qemu-devel, alex.bennee, peterx, lvivier, quintela

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> Hi,
>   Git bisect is pointing to your patch 084140bd49:
>   exec: fix access to ram_list.dirty_memory when sync dirty bitmap
> 
> trying to diagnose a bug I'm seeing; it looks like the dirty page count
> is wrong for some reason.
> 
> Alex Bennée spotted a problem where the postcopy test would occasionally
> fail under very heavy load;    attaching a debugger and it looks like
> the problem is we have a migration_dirty_page count stuck at 2;
> in the normal migration tests we don't spot this, because 2 pages is
> smaller than the threshold to end migration and so an extra 2 pages
> doesn't block it finishing.   However, with a very
> small downtime setting (like we use in the postcopy test) and with
> very low bandwidth (as when Alex ran the test on a very heavily loaded
> machine) we end up never calling the bitmap sync again and never
> completing the iteration.
> 
> I'm using the following addition to spot the problem:

I think I have the fix for this, or at least the reason; the alignment
check also needs fixing up, but the 08414 patch missed it.

  Alex: Please test the fix included at the bottom

> 
> diff --git a/migration/ram.c b/migration/ram.c
> index e75f1050e4..3ddf884952 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1350,6 +1350,13 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>          }
>      } while (!pages && again);
> 
> +    if (!pages && !again && pss.complete_round && rs->migration_dirty_pages)
> +    {
> +        /* Should make this fail migration ? */
> +        fprintf(stderr, "%s: no page found, yet dirty_pages=%"PRIu64"\n",
> +                __func__, rs->migration_dirty_pages);
> +    }
> +
>      rs->last_seen_block = pss.block;
>      rs->last_page = pss.page;
> 
> (which I might add as a test to fail a migration)
> 
> That test fails easily even on an unloaded machine:
> tests/postcopy-test
> /x86_64/postcopy: ram_find_and_save_block: no page found, yet dirty_pages=2
> ram_find_and_save_block: no page found, yet dirty_pages=2
> ram_find_and_save_block: no page found, yet dirty_pages=2
> OK
> 
> 
> I'll try and debug where our extra two pages are coming from.

From b2c86818eaaaf790246f21ae21fead903c1d64f0 Mon Sep 17 00:00:00 2001
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Date: Fri, 21 Jul 2017 19:59:01 +0100
Subject: [PATCH] cpu_physical_memory_sync_dirty_bitmap: Fix alignment check

This code has an optimised, word aligned version, and a boring
unaligned version.  Recently 084140bd498909 fixed a missing offset
addition from the core of both versions.  However, the offset isn't
necessarily aligned and thus the choice between the two versions
needs fixing up to also include the offset.

DANGER: Friday evening, post-dinner lightly tested patch.

Symptom:
  A few stuck unsent pages during migration; not normally noticed
unless under very low bandwidth.

Fixes: 084140bd498909
---
 include/exec/ram_addr.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index c04f4f6..c802f7f 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -378,15 +378,16 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
 {
     ram_addr_t addr;
     unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
+    unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
     uint64_t num_dirty = 0;
     unsigned long *dest = rb->bmap;
 
     /* start address is aligned at the start of a word? */
-    if (((page * BITS_PER_LONG) << TARGET_PAGE_BITS) == start) {
+    if (((word * BITS_PER_LONG) << TARGET_PAGE_BITS) ==
+         (start + rb->offset)) {
         int k;
         int nr = BITS_TO_LONGS(length >> TARGET_PAGE_BITS);
         unsigned long * const *src;
-        unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
         unsigned long idx = (word * BITS_PER_LONG) / DIRTY_MEMORY_BLOCK_SIZE;
         unsigned long offset = BIT_WORD((word * BITS_PER_LONG) %
                                         DIRTY_MEMORY_BLOCK_SIZE);
-- 
1.8.3.1

> Dave
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] dirty page count problem
  2017-07-21 19:07 ` Dr. David Alan Gilbert
@ 2017-07-24 14:17   ` Alex Bennée
  0 siblings, 0 replies; 3+ messages in thread
From: Alex Bennée @ 2017-07-24 14:17 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: haozhong.zhang, qemu-devel, peterx, lvivier, quintela


Dr. David Alan Gilbert <dgilbert@redhat.com> writes:

> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
>> Hi,
>>   Git bisect is pointing to your patch 084140bd49:
>>   exec: fix access to ram_list.dirty_memory when sync dirty bitmap
>>
>> trying to diagnose a bug I'm seeing; it looks like the dirty page count
>> is wrong for some reason.
>>
>> Alex Bennée spotted a problem where the postcopy test would occasionally
>> fail under very heavy load;    attaching a debugger and it looks like
>> the problem is we have a migration_dirty_page count stuck at 2;
>> in the normal migration tests we don't spot this, because 2 pages is
>> smaller than the threshold to end migration and so an extra 2 pages
>> doesn't block it finishing.   However, with a very
>> small downtime setting (like we use in the postcopy test) and with
>> very low bandwidth (as when Alex ran the test on a very heavily loaded
>> machine) we end up never calling the bitmap sync again and never
>> completing the iteration.
>>
>> I'm using the following addition to spot the problem:
>
> I think I have the fix for this, or at least the reason; the alignment
> check also needs fixing up, but the 08414 patch missed it.
>
>   Alex: Please test the fix included at the bottom
>
>>
>> diff --git a/migration/ram.c b/migration/ram.c
>> index e75f1050e4..3ddf884952 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1350,6 +1350,13 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
>>          }
>>      } while (!pages && again);
>>
>> +    if (!pages && !again && pss.complete_round && rs->migration_dirty_pages)
>> +    {
>> +        /* Should make this fail migration ? */
>> +        fprintf(stderr, "%s: no page found, yet dirty_pages=%"PRIu64"\n",
>> +                __func__, rs->migration_dirty_pages);
>> +    }
>> +
>>      rs->last_seen_block = pss.block;
>>      rs->last_page = pss.page;
>>
>> (which I might add as a test to fail a migration)
>>
>> That test fails easily even on an unloaded machine:
>> tests/postcopy-test
>> /x86_64/postcopy: ram_find_and_save_block: no page found, yet dirty_pages=2
>> ram_find_and_save_block: no page found, yet dirty_pages=2
>> ram_find_and_save_block: no page found, yet dirty_pages=2
>> OK
>>
>>
>> I'll try and debug where our extra two pages are coming from.
>
> From b2c86818eaaaf790246f21ae21fead903c1d64f0 Mon Sep 17 00:00:00 2001
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Date: Fri, 21 Jul 2017 19:59:01 +0100
> Subject: [PATCH] cpu_physical_memory_sync_dirty_bitmap: Fix alignment check
>
> This code has an optimised, word aligned version, and a boring
> unaligned version.  Recently 084140bd498909 fixed a missing offset
> addition from the core of both versions.  However, the offset isn't
> necessarily aligned and thus the choice between the two versions
> needs fixing up to also include the offset.
>
> DANGER: Friday evening, post-dinner lightly tested patch.
>
> Symptom:
>   A few stuck unsent pages during migration; not normally noticed
> unless under very low bandwidth.
>
> Fixes: 084140bd498909
> ---
>  include/exec/ram_addr.h | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index c04f4f6..c802f7f 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -378,15 +378,16 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(RAMBlock *rb,
>  {
>      ram_addr_t addr;
>      unsigned long page = BIT_WORD(start >> TARGET_PAGE_BITS);
> +    unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
>      uint64_t num_dirty = 0;
>      unsigned long *dest = rb->bmap;
>
>      /* start address is aligned at the start of a word? */
> -    if (((page * BITS_PER_LONG) << TARGET_PAGE_BITS) == start) {
> +    if (((word * BITS_PER_LONG) << TARGET_PAGE_BITS) ==
> +         (start + rb->offset)) {
>          int k;
>          int nr = BITS_TO_LONGS(length >> TARGET_PAGE_BITS);
>          unsigned long * const *src;
> -        unsigned long word = BIT_WORD((start + rb->offset) >> TARGET_PAGE_BITS);
>          unsigned long idx = (word * BITS_PER_LONG) / DIRTY_MEMORY_BLOCK_SIZE;
>          unsigned long offset = BIT_WORD((word * BITS_PER_LONG) %
>                                          DIRTY_MEMORY_BLOCK_SIZE);
> --
> 1.8.3.1
>
>> Dave
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

I saw no hangs, there where 4 times it failed but that might be another
test crashing. I'm investigating that so a cautious:

Tested-by: Alex Bennée <alex.bennee@linaro.org>

--
Alex Bennée

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-07-24 14:17 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-21 17:28 [Qemu-devel] dirty page count problem Dr. David Alan Gilbert
2017-07-21 19:07 ` Dr. David Alan Gilbert
2017-07-24 14:17   ` Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.