* [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
[not found] <200906221549.n5MFn3Qd015389@d03av02.boulder.ibm.com>
@ 2009-06-22 16:12 ` Avi Kivity
2009-06-22 16:25 ` Anthony Liguori
0 siblings, 1 reply; 13+ messages in thread
From: Avi Kivity @ 2009-06-22 16:12 UTC (permalink / raw)
To: Anthony Liguori; +Cc: qemu-devel
On 06/22/2009 06:51 PM, Anthony Liguori wrote:
> From: Anthony Liguori<aliguori@us.ibm.com>
>
> Otherwise, after migration, we end up with a much larger RSS size then we
> ought to have.
>
>
We have the same issue on the migration source node. I don't see a
simple way to solve it, though.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 16:12 ` [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away Avi Kivity
@ 2009-06-22 16:25 ` Anthony Liguori
2009-06-22 16:38 ` Avi Kivity
0 siblings, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2009-06-22 16:25 UTC (permalink / raw)
To: Avi Kivity; +Cc: Anthony Liguori, qemu-devel
Avi Kivity wrote:
> On 06/22/2009 06:51 PM, Anthony Liguori wrote:
>> From: Anthony Liguori<aliguori@us.ibm.com>
>>
>> Otherwise, after migration, we end up with a much larger RSS size
>> then we
>> ought to have.
>>
>>
>
> We have the same issue on the migration source node. I don't see a
> simple way to solve it, though.
I don't follow. In this case, the issue is:
1) Start a guest with 1024, balloon down to 128MB. RSS size is now ~128MB
2) Live migrate to a different node
3) RSS on different node jumps to ~1GB
4) Weep at all your lost memory
Xen had a similar issue. This ends up biting people who overcommit
their VMs via ballooning, live migration, and badness ensues. At least
for us, the error is swapping but madvise also avoids the issue by never
consuming that memory to begin with.
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 16:25 ` Anthony Liguori
@ 2009-06-22 16:38 ` Avi Kivity
2009-06-22 16:58 ` Anthony Liguori
2009-06-22 17:03 ` Anthony Liguori
0 siblings, 2 replies; 13+ messages in thread
From: Avi Kivity @ 2009-06-22 16:38 UTC (permalink / raw)
To: Anthony Liguori; +Cc: Anthony Liguori, qemu-devel
On 06/22/2009 07:25 PM, Anthony Liguori wrote:
> Avi Kivity wrote:
>> On 06/22/2009 06:51 PM, Anthony Liguori wrote:
>>> From: Anthony Liguori<aliguori@us.ibm.com>
>>>
>>> Otherwise, after migration, we end up with a much larger RSS size
>>> then we
>>> ought to have.
>>>
>>
>> We have the same issue on the migration source node. I don't see a
>> simple way to solve it, though.
>
> I don't follow. In this case, the issue is:
>
> 1) Start a guest with 1024, balloon down to 128MB. RSS size is now
> ~128MB
> 2) Live migrate to a different node
> 3) RSS on different node jumps to ~1GB
3.5) RSS on source node jumps to ~1GB, since reading the page
instantiates the pte
> 4) Weep at all your lost memory
4.5) And at the swapping going on in the source node
>
> Xen had a similar issue. This ends up biting people who overcommit
> their VMs via ballooning, live migration, and badness ensues. At
> least for us, the error is swapping but madvise also avoids the issue
> by never consuming that memory to begin with.
Right. I'd love to do madvise() on the source node as well if we fault
in a page and find out it's zero, but the guest (and aio) is still
running and we might drop live data. We need a
madvise(MADV_DONTNEED_IFZERO), or a mincore() flag that tells us if the
page exists (vs. swapped). ksm would also do this, but it is overkill
for some applications.
Note that the patch contains a small bug -- the kernel is allowed to
ignore the advise according to the manual page, so it's better to
memset() the memory before dropping it.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 16:38 ` Avi Kivity
@ 2009-06-22 16:58 ` Anthony Liguori
2009-06-22 17:12 ` Avi Kivity
2009-06-22 17:03 ` Anthony Liguori
1 sibling, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2009-06-22 16:58 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel
Avi Kivity wrote:
> On 06/22/2009 07:25 PM, Anthony Liguori wrote:
>
>>
>> Xen had a similar issue. This ends up biting people who overcommit
>> their VMs via ballooning, live migration, and badness ensues. At
>> least for us, the error is swapping but madvise also avoids the issue
>> by never consuming that memory to begin with.
>
> Right. I'd love to do madvise() on the source node as well if we
> fault in a page and find out it's zero, but the guest (and aio) is
> still running and we might drop live data. We need a
> madvise(MADV_DONTNEED_IFZERO), or a mincore() flag that tells us if
> the page exists (vs. swapped). ksm would also do this, but it is
> overkill for some applications.
>
> Note that the patch contains a small bug -- the kernel is allowed to
> ignore the advise according to the manual page, so it's better to
> memset() the memory before dropping it.
Hrm, that's not quite how I interpreted the man page.
"This call
does not influence the semantics of the application (except in the case
of MADV_DONTNEED), but may influence its performance. The kernel is
free to ignore the advice."
MADV_DONTNEED is called out as changing the application semantics.
Specifically, I think the kernel has to zero-fill even if it choose to
ignore the advice.
I limited the guard to Linux specifically because I was unsure about
that behavior but it would be good to clarify if anyone knows how.
--
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 16:38 ` Avi Kivity
2009-06-22 16:58 ` Anthony Liguori
@ 2009-06-22 17:03 ` Anthony Liguori
2009-06-22 17:20 ` Avi Kivity
2009-06-22 19:38 ` Paul Brook
1 sibling, 2 replies; 13+ messages in thread
From: Anthony Liguori @ 2009-06-22 17:03 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel
Avi Kivity wrote:
> On 06/22/2009 07:25 PM, Anthony Liguori wrote:
>> Avi Kivity wrote:
>>> On 06/22/2009 06:51 PM, Anthony Liguori wrote:
>>>> From: Anthony Liguori<aliguori@us.ibm.com>
>>>>
>>>> Otherwise, after migration, we end up with a much larger RSS size
>>>> then we
>>>> ought to have.
>>>>
>>>
>>> We have the same issue on the migration source node. I don't see a
>>> simple way to solve it, though.
>>
>> I don't follow. In this case, the issue is:
>>
>> 1) Start a guest with 1024, balloon down to 128MB. RSS size is now
>> ~128MB
>> 2) Live migrate to a different node
>> 3) RSS on different node jumps to ~1GB
>
> 3.5) RSS on source node jumps to ~1GB, since reading the page
> instantiates the pte
Surely we can do better here...
For TCG, we always know when memory is dirty and we can check it
atomically. So we know whether a page has changed since we knew it was
last zero. We basically need a ZERO_DIRTY bit. All memory initially
carries this bit and ballooning also sets the bit. During live
migration, we can check the dirty bit first.
For KVM, we would have to enable dirty tracking always to keep
ZERO_DIRTY up to date. Since write faults are going to happen anyway at
start up, perhaps the cost of doing this wouldn't be so bad?
>
> Right. I'd love to do madvise() on the source node as well if we
> fault in a page and find out it's zero, but the guest (and aio) is
> still running and we might drop live data. We need a
> madvise(MADV_DONTNEED_IFZERO), or a mincore() flag that tells us if
> the page exists (vs. swapped). ksm would also do this, but it is
> overkill for some applications.
For KVM, we could just have an KVM_IOCTL_MADVISE_IF_NOT_DIRTY, but
that's a bad solution. That's more or less the desired semantics though.
--
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 16:58 ` Anthony Liguori
@ 2009-06-22 17:12 ` Avi Kivity
0 siblings, 0 replies; 13+ messages in thread
From: Avi Kivity @ 2009-06-22 17:12 UTC (permalink / raw)
To: Anthony Liguori; +Cc: qemu-devel
On 06/22/2009 07:58 PM, Anthony Liguori wrote:
>> Note that the patch contains a small bug -- the kernel is allowed to
>> ignore the advise according to the manual page, so it's better to
>> memset() the memory before dropping it.
>
>
> Hrm, that's not quite how I interpreted the man page.
>
> "This call
> does not influence the semantics of the application (except in the case
> of MADV_DONTNEED), but may influence its performance. The kernel is
> free to ignore the advice."
>
> MADV_DONTNEED is called out as changing the application semantics.
> Specifically, I think the kernel has to zero-fill even if it choose to
> ignore the advice.
>
> I limited the guard to Linux specifically because I was unsure about
> that behavior but it would be good to clarify if anyone knows how.
This is not posix (there is a POSIX_MADV_DONTNEED which appears to be
non-destructive (and a better complement to the other advices)), so the
only references are the manual page and the code. I don't see Linux
ever avoiding brake brake brake the kernel certainly can ignore the advice:
static long madvise_dontneed(struct vm_area_struct * vma,
struct vm_area_struct ** prev,
unsigned long start, unsigned long end)
{
*prev = vma;
if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
return -EINVAL;
if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
struct zap_details details = {
.nonlinear_vma = vma,
.last_index = ULONG_MAX,
};
zap_page_range(vma, start, end - start, &details);
} else
zap_page_range(vma, start, end - start, NULL);
return 0;
}
it won't do it silently, but we don't check the return code either.
Let's take the safe path on this and zero the page.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 17:03 ` Anthony Liguori
@ 2009-06-22 17:20 ` Avi Kivity
2009-06-22 17:37 ` Anthony Liguori
2009-06-22 17:44 ` Anthony Liguori
2009-06-22 19:38 ` Paul Brook
1 sibling, 2 replies; 13+ messages in thread
From: Avi Kivity @ 2009-06-22 17:20 UTC (permalink / raw)
To: Anthony Liguori; +Cc: qemu-devel
On 06/22/2009 08:03 PM, Anthony Liguori wrote:
>>>
>>> 1) Start a guest with 1024, balloon down to 128MB. RSS size is now
>>> ~128MB
>>> 2) Live migrate to a different node
>>> 3) RSS on different node jumps to ~1GB
>>
>> 3.5) RSS on source node jumps to ~1GB, since reading the page
>> instantiates the pte
> I don't follow. In this case, the issue is:
>
> Surely we can do better here...
>
> For TCG, we always know when memory is dirty and we can check it
> atomically. So we know whether a page has changed since we knew it
> was last zero. We basically need a ZERO_DIRTY bit. All memory
> initially carries this bit and ballooning also sets the bit. During
> live migration, we can check the dirty bit first.
You mean, a NONZERO bit which is cleared by ballooning and set on any
write. This will work naturally with the qemu dirty bytemap.
>
> For KVM, we would have to enable dirty tracking always to keep
> ZERO_DIRTY up to date. Since write faults are going to happen anyway
> at start up, perhaps the cost of doing this wouldn't be so bad?
You need to do this on the source node. Unfortunately, there's no way
to initialize the values racelessly when you start live migration
without introducing a new ioctl. I'd like a more general solution
rather than something that targets this specific problem.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 17:20 ` Avi Kivity
@ 2009-06-22 17:37 ` Anthony Liguori
2009-06-22 18:01 ` Avi Kivity
2009-06-22 17:44 ` Anthony Liguori
1 sibling, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2009-06-22 17:37 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel
Avi Kivity wrote:
> You mean, a NONZERO bit which is cleared by ballooning and set on any
> write. This will work naturally with the qemu dirty bytemap.
Yes.
>>
>> For KVM, we would have to enable dirty tracking always to keep
>> ZERO_DIRTY up to date. Since write faults are going to happen anyway
>> at start up, perhaps the cost of doing this wouldn't be so bad?
>
> You need to do this on the source node. Unfortunately, there's no way
> to initialize the values racelessly when you start live migration
> without introducing a new ioctl. I'd like a more general solution
> rather than something that targets this specific problem.
I'm saying, always enable dirty tracking from start-of-day with KVM.
Then the QEMU dirty bitmap is always accurate. The trick is to never
start resetting it until you need to do live migration.
The idea being that once the dirty bits have been set, the overhead
(hopefully) should be zero.
--
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 17:20 ` Avi Kivity
2009-06-22 17:37 ` Anthony Liguori
@ 2009-06-22 17:44 ` Anthony Liguori
2009-06-22 18:04 ` Avi Kivity
1 sibling, 1 reply; 13+ messages in thread
From: Anthony Liguori @ 2009-06-22 17:44 UTC (permalink / raw)
To: Avi Kivity; +Cc: qemu-devel
[-- Attachment #1: Type: text/plain, Size: 46 bytes --]
See attached.
--
Regards,
Anthony Liguori
[-- Attachment #2: zero-mem-madvise.patch --]
[-- Type: text/x-patch, Size: 1243 bytes --]
commit a70787dc8200dad03f3d3a85fdb40497f336f12b
Author: Anthony Liguori <aliguori@us.ibm.com>
Date: Mon Jun 22 12:39:00 2009 -0500
Make sure to zero out memory before calling madvise to increase robustness
Avi pointed out that it's not entirely safe to rely on madvise zeroing out
memory. So let's do it explicitly before calling madvise.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
diff --git a/vl.c b/vl.c
index 60a00e1..1c077b4 100644
--- a/vl.c
+++ b/vl.c
@@ -3358,13 +3358,13 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
if (flags & RAM_SAVE_FLAG_COMPRESS) {
uint8_t ch = qemu_get_byte(f);
-#if defined(__linux__)
+ memset(qemu_get_ram_ptr(addr), ch, TARGET_PAGE_SIZE);
+#ifndef _WIN32
if (ch == 0 &&
(!kvm_enabled() || kvm_has_sync_mmu())) {
madvise(qemu_get_ram_ptr(addr), TARGET_PAGE_SIZE, MADV_DONTNEED);
- } else
+ }
#endif
- memset(qemu_get_ram_ptr(addr), ch, TARGET_PAGE_SIZE);
} else if (flags & RAM_SAVE_FLAG_PAGE)
qemu_get_buffer(f, qemu_get_ram_ptr(addr), TARGET_PAGE_SIZE);
} while (!(flags & RAM_SAVE_FLAG_EOS));
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 17:37 ` Anthony Liguori
@ 2009-06-22 18:01 ` Avi Kivity
0 siblings, 0 replies; 13+ messages in thread
From: Avi Kivity @ 2009-06-22 18:01 UTC (permalink / raw)
To: Anthony Liguori; +Cc: qemu-devel
On 06/22/2009 08:37 PM, Anthony Liguori wrote:
> I'm saying, always enable dirty tracking from start-of-day with KVM.
> Then the QEMU dirty bitmap is always accurate. The trick is to never
> start resetting it until you need to do live migration.
That disables large pages and incurs a performance penalty (needlessly
setting bits).
> The idea being that once the dirty bits have been set, the overhead
> (hopefully) should be zero.
>
It's not zero. You also need to clear the bits somehow when you balloon.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 17:44 ` Anthony Liguori
@ 2009-06-22 18:04 ` Avi Kivity
0 siblings, 0 replies; 13+ messages in thread
From: Avi Kivity @ 2009-06-22 18:04 UTC (permalink / raw)
To: Anthony Liguori; +Cc: qemu-devel
On 06/22/2009 08:44 PM, Anthony Liguori wrote:
> See attached.
> Author: Anthony Liguori<aliguori@us.ibm.com>
> Date: Mon Jun 22 12:39:00 2009 -0500
>
> Make sure to zero out memory before calling madvise to increase robustness
>
> Avi pointed out that it's not entirely safe to rely on madvise zeroing out
> memory. So let's do it explicitly before calling madvise.
>
> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>
> diff --git a/vl.c b/vl.c
> index 60a00e1..1c077b4 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -3358,13 +3358,13 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
>
> if (flags& RAM_SAVE_FLAG_COMPRESS) {
> uint8_t ch = qemu_get_byte(f);
> -#if defined(__linux__)
> + memset(qemu_get_ram_ptr(addr), ch, TARGET_PAGE_SIZE);
> +#ifndef _WIN32
> if (ch == 0&&
> (!kvm_enabled() || kvm_has_sync_mmu())) {
> madvise(qemu_get_ram_ptr(addr), TARGET_PAGE_SIZE, MADV_DONTNEED);
> - } else
> + }
> #endif
> - memset(qemu_get_ram_ptr(addr), ch, TARGET_PAGE_SIZE);
> } else if (flags& RAM_SAVE_FLAG_PAGE)
> qemu_get_buffer(f, qemu_get_ram_ptr(addr), TARGET_PAGE_SIZE);
> } while (!(flags& RAM_SAVE_FLAG_EOS));
>
Pretty similar to my December patch... which had another case, is it
missing?
http://article.gmane.org/gmane.comp.emulators.qemu/34523
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 17:03 ` Anthony Liguori
2009-06-22 17:20 ` Avi Kivity
@ 2009-06-22 19:38 ` Paul Brook
2009-06-22 19:49 ` Anthony Liguori
1 sibling, 1 reply; 13+ messages in thread
From: Paul Brook @ 2009-06-22 19:38 UTC (permalink / raw)
To: qemu-devel; +Cc: Anthony Liguori, Avi Kivity
> >>> We have the same issue on the migration source node. I don't see a
> >>> simple way to solve it, though.
> >>
> >> I don't follow. In this case, the issue is:
> >>
> >> 1) Start a guest with 1024, balloon down to 128MB. RSS size is now
> >> ~128MB
> >> 2) Live migrate to a different node
> >> 3) RSS on different node jumps to ~1GB
> >
> > 3.5) RSS on source node jumps to ~1GB, since reading the page
> > instantiates the pte
>
> Surely we can do better here...
Huh, I'm surprised we have to. I was expecting linux to populate the region
with CoW mappings of a single zero page, but apparently not. I guess reading a
zero page without writing to it is a fairly rare case, and it's usually better
to instantiate a writable page on the first access.
Paul
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away
2009-06-22 19:38 ` Paul Brook
@ 2009-06-22 19:49 ` Anthony Liguori
0 siblings, 0 replies; 13+ messages in thread
From: Anthony Liguori @ 2009-06-22 19:49 UTC (permalink / raw)
To: Paul Brook; +Cc: qemu-devel, Avi Kivity
Paul Brook wrote:
>>>>> We have the same issue on the migration source node. I don't see a
>>>>> simple way to solve it, though.
>>>>>
>>>> I don't follow. In this case, the issue is:
>>>>
>>>> 1) Start a guest with 1024, balloon down to 128MB. RSS size is now
>>>> ~128MB
>>>> 2) Live migrate to a different node
>>>> 3) RSS on different node jumps to ~1GB
>>>>
>>> 3.5) RSS on source node jumps to ~1GB, since reading the page
>>> instantiates the pte
>>>
>> Surely we can do better here...
>>
>
> Huh, I'm surprised we have to. I was expecting linux to populate the region
> with CoW mappings of a single zero page, but apparently not. I guess reading a
> zero page without writing to it is a fairly rare case, and it's usually better
> to instantiate a writable page on the first access.
>
I did too. If you mmap(/dev/zero) you'll get that behavior but not with
a normal allocation.
Regards,
Anthony Liguori
> Paul
>
--
Regards,
Anthony Liguori
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-06-22 19:49 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <200906221549.n5MFn3Qd015389@d03av02.boulder.ibm.com>
2009-06-22 16:12 ` [Qemu-devel] Re: [Qemu-commits] [COMMIT 3086844] Instead of writing a zero page, madvise it away Avi Kivity
2009-06-22 16:25 ` Anthony Liguori
2009-06-22 16:38 ` Avi Kivity
2009-06-22 16:58 ` Anthony Liguori
2009-06-22 17:12 ` Avi Kivity
2009-06-22 17:03 ` Anthony Liguori
2009-06-22 17:20 ` Avi Kivity
2009-06-22 17:37 ` Anthony Liguori
2009-06-22 18:01 ` Avi Kivity
2009-06-22 17:44 ` Anthony Liguori
2009-06-22 18:04 ` Avi Kivity
2009-06-22 19:38 ` Paul Brook
2009-06-22 19:49 ` Anthony Liguori
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.