All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Performance regression using KVM/ARM
@ 2016-04-21 16:23 Christoffer Dall
  2016-04-21 19:50 ` Alexander Graf
  2016-04-21 21:58 ` Laszlo Ersek
  0 siblings, 2 replies; 12+ messages in thread
From: Christoffer Dall @ 2016-04-21 16:23 UTC (permalink / raw)
  To: Peter Maydell, Michael S. Tsirkin, qemu-devel
  Cc: Marc Zyngier, Alexander Graf

Hi,

Commit 9fac18f (oslib: allocate PROT_NONE pages on top of RAM,
2015-09-10) had the unfortunate side effect that memory slots registered
with KVM no longer contain a userspace address that is aligned to a 2M
boundary, causing the use of THP to fail in the kernel.

I fail to see where in the QEMU code we should be asking for a 2M
alignment of our memory region.  Can someone help pointing me to the
right place to fix this or suggest a patch?

This causes a performance regssion of hackbench on KVM/ARM of about 62%
compared to the workload running with THP.

We have verified that this is indeed the cause of the failure by adding
various prints to QEMU and the kernel, but unfortunatley my QEMU
knowledge is not sufficient for me to fix it myself.

Any help would be much appreciated!

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-21 16:23 [Qemu-devel] Performance regression using KVM/ARM Christoffer Dall
@ 2016-04-21 19:50 ` Alexander Graf
  2016-04-22 10:01   ` Christoffer Dall
  2016-04-21 21:58 ` Laszlo Ersek
  1 sibling, 1 reply; 12+ messages in thread
From: Alexander Graf @ 2016-04-21 19:50 UTC (permalink / raw)
  To: Christoffer Dall, Peter Maydell, Michael S. Tsirkin, qemu-devel
  Cc: Marc Zyngier



On 21.04.16 18:23, Christoffer Dall wrote:
> Hi,
> 
> Commit 9fac18f (oslib: allocate PROT_NONE pages on top of RAM,
> 2015-09-10) had the unfortunate side effect that memory slots registered
> with KVM no longer contain a userspace address that is aligned to a 2M
> boundary, causing the use of THP to fail in the kernel.
> 
> I fail to see where in the QEMU code we should be asking for a 2M
> alignment of our memory region.  Can someone help pointing me to the
> right place to fix this or suggest a patch?
> 
> This causes a performance regssion of hackbench on KVM/ARM of about 62%
> compared to the workload running with THP.
> 
> We have verified that this is indeed the cause of the failure by adding
> various prints to QEMU and the kernel, but unfortunatley my QEMU
> knowledge is not sufficient for me to fix it myself.
> 
> Any help would be much appreciated!

The code changed quite heavily since I last looked at it, but could you
please try whether the (untested) patch below makes a difference?


Alex

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 0b4cc7f..24e73b1 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -36,7 +36,7 @@ size_t qemu_fd_getpagesize(int fd)
     }
 #endif

-    return getpagesize();
+    return 2 * 1024 * 1024;
 }

 void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared)

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-21 16:23 [Qemu-devel] Performance regression using KVM/ARM Christoffer Dall
  2016-04-21 19:50 ` Alexander Graf
@ 2016-04-21 21:58 ` Laszlo Ersek
  2016-04-22 10:02   ` Christoffer Dall
  1 sibling, 1 reply; 12+ messages in thread
From: Laszlo Ersek @ 2016-04-21 21:58 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Michael S. Tsirkin, qemu-devel, Marc Zyngier,
	Alexander Graf

[-- Attachment #1: Type: text/plain, Size: 1154 bytes --]

On 04/21/16 18:23, Christoffer Dall wrote:
> Hi,
> 
> Commit 9fac18f (oslib: allocate PROT_NONE pages on top of RAM,
> 2015-09-10) had the unfortunate side effect that memory slots registered
> with KVM no longer contain a userspace address that is aligned to a 2M
> boundary, causing the use of THP to fail in the kernel.
> 
> I fail to see where in the QEMU code we should be asking for a 2M
> alignment of our memory region.  Can someone help pointing me to the
> right place to fix this or suggest a patch?
> 
> This causes a performance regssion of hackbench on KVM/ARM of about 62%
> compared to the workload running with THP.
> 
> We have verified that this is indeed the cause of the failure by adding
> various prints to QEMU and the kernel, but unfortunatley my QEMU
> knowledge is not sufficient for me to fix it myself.
> 
> Any help would be much appreciated!

Can you please test the attached series?

(Note that I'm only interested in solving this problem as a productive
distraction, so if the patches don't work, or require a lot of massaging
for merging, I'll just drop them (or, preferably, give them to someone
else).)

Thanks
Laszlo

[-- Attachment #2: 0001-util-mmap-alloc-take-alignment-in-qemu_ram_munmap.patch --]
[-- Type: text/x-patch, Size: 2798 bytes --]

>From b6afd850a8edec1da7eafcbb4705ec752ad6021b Mon Sep 17 00:00:00 2001
From: Laszlo Ersek <lersek@redhat.com>
Date: Thu, 21 Apr 2016 22:50:40 +0200
Subject: [PATCH 1/3] util/mmap-alloc: take alignment in qemu_ram_munmap()

This parameter will become necessary later in the series. At this point,
it is not used for anything.

The alignment passed to qemu_ram_munmap() must be identical to the
alignment that was used earlier for allocating the area (with
qemu_ram_mmap()) that is now being released. There are two callers:

- qemu_anon_ram_free(): its peer is qemu_anon_ram_alloc(), which uses
  QEMU_VMALLOC_ALIGN as alignment,

- reclaim_ramblock(), via qemu_ram_free() -> call_rcu(), when
  (block->fd>=0): in this case its peer is file_ram_alloc(), which uses
  qemu_fd_getpagesize() as alignment.

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
 include/qemu/mmap-alloc.h | 2 +-
 exec.c                    | 3 ++-
 util/mmap-alloc.c         | 2 +-
 util/oslib-posix.c        | 2 +-
 4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/qemu/mmap-alloc.h b/include/qemu/mmap-alloc.h
index 0899b2f01e97..47ce60e6623c 100644
--- a/include/qemu/mmap-alloc.h
+++ b/include/qemu/mmap-alloc.h
@@ -7,6 +7,6 @@ size_t qemu_fd_getpagesize(int fd);
 
 void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared);
 
-void qemu_ram_munmap(void *ptr, size_t size);
+void qemu_ram_munmap(void *ptr, size_t size, size_t align);
 
 #endif
diff --git a/exec.c b/exec.c
index c4f9036184d8..d043cc4e496e 100644
--- a/exec.c
+++ b/exec.c
@@ -1762,7 +1762,8 @@ static void reclaim_ramblock(RAMBlock *block)
         xen_invalidate_map_cache_entry(block->host);
 #ifndef _WIN32
     } else if (block->fd >= 0) {
-        qemu_ram_munmap(block->host, block->max_length);
+        qemu_ram_munmap(block->host, block->max_length,
+                        qemu_fd_getpagesize(block->fd));
         close(block->fd);
 #endif
     } else {
diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 0b4cc7f7f117..63bb1e215c6e 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -101,7 +101,7 @@ void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared)
     return ptr;
 }
 
-void qemu_ram_munmap(void *ptr, size_t size)
+void qemu_ram_munmap(void *ptr, size_t size, size_t align)
 {
     if (ptr) {
         /* Unmap both the RAM block and the guard page */
diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index 20ca141dec11..c1a196d71f02 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -142,7 +142,7 @@ void qemu_vfree(void *ptr)
 void qemu_anon_ram_free(void *ptr, size_t size)
 {
     trace_qemu_anon_ram_free(ptr, size);
-    qemu_ram_munmap(ptr, size);
+    qemu_ram_munmap(ptr, size, QEMU_VMALLOC_ALIGN);
 }
 
 void qemu_set_block(int fd)
-- 
1.8.3.1


[-- Attachment #3: 0002-util-mmap-alloc-factor-out-size_with_guard_pages.patch --]
[-- Type: text/x-patch, Size: 2715 bytes --]

>From 09d67ba76cb5027f69e9635c07b899546d121b56 Mon Sep 17 00:00:00 2001
From: Laszlo Ersek <lersek@redhat.com>
Date: Thu, 21 Apr 2016 23:11:22 +0200
Subject: [PATCH 2/3] util/mmap-alloc: factor out size_with_guard_pages()

The qemu_ram_mmap() and qemu_ram_munmap() functions currently hard-code
the number of guard pages as 1. Expressed in bytes:

  final_size = size + getpagesize();

Factor out this formula to a static helper function, so that we can
customize it in the next patch. For now there is no change in observable
behavior.

Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
 util/mmap-alloc.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 63bb1e215c6e..41e36f74d7be 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -39,13 +39,19 @@ size_t qemu_fd_getpagesize(int fd)
     return getpagesize();
 }
 
+static size_t size_with_guard_pages(size_t size, size_t align)
+{
+    return size + getpagesize();
+}
+
 void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared)
 {
     /*
      * Note: this always allocates at least one extra page of virtual address
      * space, even if size is already aligned.
      */
-    size_t total = size + align;
+    size_t final_size = size_with_guard_pages(size, align);
+    size_t total = final_size + (align - getpagesize());
 #if defined(__powerpc64__) && defined(__linux__)
     /* On ppc64 mappings in the same segment (aka slice) must share the same
      * page size. Since we will be re-allocating part of this segment
@@ -91,11 +97,14 @@ void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared)
     }
 
     /*
-     * Leave a single PROT_NONE page allocated after the RAM block, to serve as
-     * a guard page guarding against potential buffer overflows.
+     * "final_size" includes a customized number of PROT_NONE pages allocated
+     * after the RAM block, to serve as guard pages guarding against potential
+     * buffer overflows. Here we chip off any leftover PROT_NONE pages from the
+     * end: those that we had to allocate initially for guaranteeing the
+     * alignment, but no longer need now.
      */
-    if (total > size + getpagesize()) {
-        munmap(ptr + size + getpagesize(), total - size - getpagesize());
+    if (total > final_size) {
+        munmap(ptr + final_size, total - final_size);
     }
 
     return ptr;
@@ -105,6 +114,6 @@ void qemu_ram_munmap(void *ptr, size_t size, size_t align)
 {
     if (ptr) {
         /* Unmap both the RAM block and the guard page */
-        munmap(ptr, size + getpagesize());
+        munmap(ptr, size_with_guard_pages(size, align));
     }
 }
-- 
1.8.3.1


[-- Attachment #4: 0003-util-mmap-alloc-preserve-size-alignment-with-guard-p.patch --]
[-- Type: text/x-patch, Size: 1626 bytes --]

>From 8e7cd9425417189f5fc894039a8af956ca2e19dd Mon Sep 17 00:00:00 2001
From: Laszlo Ersek <lersek@redhat.com>
Date: Thu, 21 Apr 2016 22:19:16 +0200
Subject: [PATCH 3/3] util/mmap-alloc: preserve size alignment with guard pages
 on ARM

Commit 9fac18f03a90 ("oslib: allocate PROT_NONE pages on top of RAM")
introduced a guard page after the user-requested area.

(Commit 794e8f301a17 ("exec: factor out duplicate mmap code") factored out
this logic, preserving the behavior of 9fac18f03a90.)

Christoffer reports that 9fac18f03a90 renders the KVM/ARM performance
optimization added in 2e07b297e0b4 ("oslib-posix: Align to permit
transparent hugepages on ARM Linux") ineffective, because the single guard
page makes the size of the region unaligned, preventing the application of
THP.

Restore 2e07b297e0b4 to working state by aligning the full area size --
consisting of user requested and guard pages -- on ARM.

Ref: http://thread.gmane.org/gmane.comp.emulators.qemu/407833
Reported-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
---
 util/mmap-alloc.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/util/mmap-alloc.c b/util/mmap-alloc.c
index 41e36f74d7be..153a586cec63 100644
--- a/util/mmap-alloc.c
+++ b/util/mmap-alloc.c
@@ -41,7 +41,11 @@ size_t qemu_fd_getpagesize(int fd)
 
 static size_t size_with_guard_pages(size_t size, size_t align)
 {
+#if defined(__arm__)
+    return QEMU_ALIGN_UP(size + getpagesize(), align);
+#else
     return size + getpagesize();
+#endif
 }
 
 void *qemu_ram_mmap(int fd, size_t size, size_t align, bool shared)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-21 19:50 ` Alexander Graf
@ 2016-04-22 10:01   ` Christoffer Dall
  2016-04-22 10:06     ` Alexander Graf
  0 siblings, 1 reply; 12+ messages in thread
From: Christoffer Dall @ 2016-04-22 10:01 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Michael S. Tsirkin, qemu-devel, Marc Zyngier

On Thu, Apr 21, 2016 at 09:50:05PM +0200, Alexander Graf wrote:
> 
> 
> On 21.04.16 18:23, Christoffer Dall wrote:
> > Hi,
> > 
> > Commit 9fac18f (oslib: allocate PROT_NONE pages on top of RAM,
> > 2015-09-10) had the unfortunate side effect that memory slots registered
> > with KVM no longer contain a userspace address that is aligned to a 2M
> > boundary, causing the use of THP to fail in the kernel.
> > 
> > I fail to see where in the QEMU code we should be asking for a 2M
> > alignment of our memory region.  Can someone help pointing me to the
> > right place to fix this or suggest a patch?
> > 
> > This causes a performance regssion of hackbench on KVM/ARM of about 62%
> > compared to the workload running with THP.
> > 
> > We have verified that this is indeed the cause of the failure by adding
> > various prints to QEMU and the kernel, but unfortunatley my QEMU
> > knowledge is not sufficient for me to fix it myself.
> > 
> > Any help would be much appreciated!
> 
> The code changed quite heavily since I last looked at it, but could you
> please try whether the (untested) patch below makes a difference?
> 
> 
Unfortunately this doesn't make any difference.  It feels to me like
we're missing specifying a 2M alignemnt in QEMU somewhere, but I can't
properly understand the links between the actual allocation, registering
mem slots with the KVM part of QEMU, and actually setting up KVM user
memory regions.

What has to happen is that the resulting struct
kvm_userspace_memory_region() has the same alignment offset from 2M (the
huge page size) of the ->guest_phys_addr and ->userspace-addr fields.

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-21 21:58 ` Laszlo Ersek
@ 2016-04-22 10:02   ` Christoffer Dall
  0 siblings, 0 replies; 12+ messages in thread
From: Christoffer Dall @ 2016-04-22 10:02 UTC (permalink / raw)
  To: Laszlo Ersek
  Cc: Peter Maydell, Michael S. Tsirkin, qemu-devel, Marc Zyngier,
	Alexander Graf

Hi Laszlo,

On Thu, Apr 21, 2016 at 11:58:07PM +0200, Laszlo Ersek wrote:
> On 04/21/16 18:23, Christoffer Dall wrote:
> > Hi,
> > 
> > Commit 9fac18f (oslib: allocate PROT_NONE pages on top of RAM,
> > 2015-09-10) had the unfortunate side effect that memory slots registered
> > with KVM no longer contain a userspace address that is aligned to a 2M
> > boundary, causing the use of THP to fail in the kernel.
> > 
> > I fail to see where in the QEMU code we should be asking for a 2M
> > alignment of our memory region.  Can someone help pointing me to the
> > right place to fix this or suggest a patch?
> > 
> > This causes a performance regssion of hackbench on KVM/ARM of about 62%
> > compared to the workload running with THP.
> > 
> > We have verified that this is indeed the cause of the failure by adding
> > various prints to QEMU and the kernel, but unfortunatley my QEMU
> > knowledge is not sufficient for me to fix it myself.
> > 
> > Any help would be much appreciated!
> 
> Can you please test the attached series?
> 
> (Note that I'm only interested in solving this problem as a productive
> distraction, so if the patches don't work, or require a lot of massaging
> for merging, I'll just drop them (or, preferably, give them to someone
> else).)
> 

I like your procrastination methods!

Unfortunately this fix wasn't the right one either.

-Christoffer

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-22 10:01   ` Christoffer Dall
@ 2016-04-22 10:06     ` Alexander Graf
  2016-04-22 10:15       ` Christoffer Dall
  0 siblings, 1 reply; 12+ messages in thread
From: Alexander Graf @ 2016-04-22 10:06 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, Michael S. Tsirkin, qemu-devel, Marc Zyngier,
	Paolo Bonzini

On 04/22/2016 12:01 PM, Christoffer Dall wrote:
> On Thu, Apr 21, 2016 at 09:50:05PM +0200, Alexander Graf wrote:
>>
>> On 21.04.16 18:23, Christoffer Dall wrote:
>>> Hi,
>>>
>>> Commit 9fac18f (oslib: allocate PROT_NONE pages on top of RAM,
>>> 2015-09-10) had the unfortunate side effect that memory slots registered
>>> with KVM no longer contain a userspace address that is aligned to a 2M
>>> boundary, causing the use of THP to fail in the kernel.
>>>
>>> I fail to see where in the QEMU code we should be asking for a 2M
>>> alignment of our memory region.  Can someone help pointing me to the
>>> right place to fix this or suggest a patch?
>>>
>>> This causes a performance regssion of hackbench on KVM/ARM of about 62%
>>> compared to the workload running with THP.
>>>
>>> We have verified that this is indeed the cause of the failure by adding
>>> various prints to QEMU and the kernel, but unfortunatley my QEMU
>>> knowledge is not sufficient for me to fix it myself.
>>>
>>> Any help would be much appreciated!
>> The code changed quite heavily since I last looked at it, but could you
>> please try whether the (untested) patch below makes a difference?
>>
>>
> Unfortunately this doesn't make any difference.  It feels to me like
> we're missing specifying a 2M alignemnt in QEMU somewhere, but I can't
> properly understand the links between the actual allocation, registering
> mem slots with the KVM part of QEMU, and actually setting up KVM user
> memory regions.
>
> What has to happen is that the resulting struct
> kvm_userspace_memory_region() has the same alignment offset from 2M (the
> huge page size) of the ->guest_phys_addr and ->userspace-addr fields.

Well, I would expect that the guest address space is always very big 
aligned - and definitely at least 2MB - so we're safe there.

That means we only need to align the qemu virtual address. There used to 
be a memalign() call for that, but it got replaced with direct mmap() 
and then a lot of code changed on top. Looking at the logs, I'm sure 
Paolo knows the answer though :)


Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-22 10:06     ` Alexander Graf
@ 2016-04-22 10:15       ` Christoffer Dall
  2016-04-22 10:17         ` Peter Maydell
  2016-06-13 14:53         ` Auger Eric
  0 siblings, 2 replies; 12+ messages in thread
From: Christoffer Dall @ 2016-04-22 10:15 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Peter Maydell, Michael S. Tsirkin, qemu-devel, Marc Zyngier,
	Paolo Bonzini

On Fri, Apr 22, 2016 at 12:06:52PM +0200, Alexander Graf wrote:
> On 04/22/2016 12:01 PM, Christoffer Dall wrote:
> >On Thu, Apr 21, 2016 at 09:50:05PM +0200, Alexander Graf wrote:
> >>
> >>On 21.04.16 18:23, Christoffer Dall wrote:
> >>>Hi,
> >>>
> >>>Commit 9fac18f (oslib: allocate PROT_NONE pages on top of RAM,
> >>>2015-09-10) had the unfortunate side effect that memory slots registered
> >>>with KVM no longer contain a userspace address that is aligned to a 2M
> >>>boundary, causing the use of THP to fail in the kernel.
> >>>
> >>>I fail to see where in the QEMU code we should be asking for a 2M
> >>>alignment of our memory region.  Can someone help pointing me to the
> >>>right place to fix this or suggest a patch?
> >>>
> >>>This causes a performance regssion of hackbench on KVM/ARM of about 62%
> >>>compared to the workload running with THP.
> >>>
> >>>We have verified that this is indeed the cause of the failure by adding
> >>>various prints to QEMU and the kernel, but unfortunatley my QEMU
> >>>knowledge is not sufficient for me to fix it myself.
> >>>
> >>>Any help would be much appreciated!
> >>The code changed quite heavily since I last looked at it, but could you
> >>please try whether the (untested) patch below makes a difference?
> >>
> >>
> >Unfortunately this doesn't make any difference.  It feels to me like
> >we're missing specifying a 2M alignemnt in QEMU somewhere, but I can't
> >properly understand the links between the actual allocation, registering
> >mem slots with the KVM part of QEMU, and actually setting up KVM user
> >memory regions.
> >
> >What has to happen is that the resulting struct
> >kvm_userspace_memory_region() has the same alignment offset from 2M (the
> >huge page size) of the ->guest_phys_addr and ->userspace-addr fields.
> 
> Well, I would expect that the guest address space is always very big
> aligned - and definitely at least 2MB - so we're safe there.
> 
> That means we only need to align the qemu virtual address. There
> used to be a memalign() call for that, but it got replaced with
> direct mmap() and then a lot of code changed on top. Looking at the
> logs, I'm sure Paolo knows the answer though :)
> 
Peter just pointed me to a change I remember doing for ARM, so perhaps
this fix is the right one?


diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index d25f671..a36e734 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -35,7 +35,7 @@
 extern int daemon(int, int);
 #endif
 
-#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
+#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
    /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
       Valgrind does not support alignments larger than 1 MiB,
       therefore we need special code which handles running on Valgrind. */


Thanks,
-Christoffer

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-22 10:15       ` Christoffer Dall
@ 2016-04-22 10:17         ` Peter Maydell
  2016-04-22 10:26           ` Christoffer Dall
  2016-06-13 14:53         ` Auger Eric
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Maydell @ 2016-04-22 10:17 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Alexander Graf, Michael S. Tsirkin, QEMU Developers,
	Marc Zyngier, Paolo Bonzini

On 22 April 2016 at 11:15, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> Peter just pointed me to a change I remember doing for ARM, so perhaps
> this fix is the right one?
>
>
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index d25f671..a36e734 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -35,7 +35,7 @@
>  extern int daemon(int, int);
>  #endif
>
> -#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
> +#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
>     /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
>        Valgrind does not support alignments larger than 1 MiB,
>        therefore we need special code which handles running on Valgrind. */

I hadn't realised AArch64 didn't define __arm__. Your extra clause
wants to be inside the parens for the ||, not outside.

So was the problem just that we weren't passing 2MB as the align
parameter to qemu_ram_mmap(), and if we do pass 2MB then it does the
right thing ?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-22 10:17         ` Peter Maydell
@ 2016-04-22 10:26           ` Christoffer Dall
  2016-04-22 11:16             ` Andrew Jones
  0 siblings, 1 reply; 12+ messages in thread
From: Christoffer Dall @ 2016-04-22 10:26 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Alexander Graf, Michael S. Tsirkin, QEMU Developers,
	Marc Zyngier, Paolo Bonzini

On Fri, Apr 22, 2016 at 11:17:47AM +0100, Peter Maydell wrote:
> On 22 April 2016 at 11:15, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > Peter just pointed me to a change I remember doing for ARM, so perhaps
> > this fix is the right one?
> >
> >
> > diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > index d25f671..a36e734 100644
> > --- a/util/oslib-posix.c
> > +++ b/util/oslib-posix.c
> > @@ -35,7 +35,7 @@
> >  extern int daemon(int, int);
> >  #endif
> >
> > -#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
> > +#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
> >     /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
> >        Valgrind does not support alignments larger than 1 MiB,
> >        therefore we need special code which handles running on Valgrind. */
> 
> I hadn't realised AArch64 didn't define __arm__. Your extra clause
> wants to be inside the parens for the ||, not outside.
> 
> So was the problem just that we weren't passing 2MB as the align
> parameter to qemu_ram_mmap(), and if we do pass 2MB then it does the
> right thing ?
> 
Yes, that was essentially the problem.  I'll send a proper patch.

However, Marc pointed out, that we should probably think about improving
on this in the future, because if you're running on a 64K page system
and want huge pages, you have to align at a higher boundary, but I'm on
the other hand not sure such a boundary is always practical on a 64K
system.  Which would suggest that we either need to:

1) Probe the kernel for the page size and always align to something that
allows huge pages, or

2) Let the user specify an option that says it wants to be able to use
THP and only then align to the huge page boundary.

Not sure...

-Christoffer

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-22 10:26           ` Christoffer Dall
@ 2016-04-22 11:16             ` Andrew Jones
  2016-04-22 11:24               ` Alexander Graf
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Jones @ 2016-04-22 11:16 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Peter Maydell, QEMU Developers, Marc Zyngier, Paolo Bonzini,
	Alexander Graf, Michael S. Tsirkin

On Fri, Apr 22, 2016 at 12:26:52PM +0200, Christoffer Dall wrote:
> On Fri, Apr 22, 2016 at 11:17:47AM +0100, Peter Maydell wrote:
> > On 22 April 2016 at 11:15, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > > Peter just pointed me to a change I remember doing for ARM, so perhaps
> > > this fix is the right one?
> > >
> > >
> > > diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> > > index d25f671..a36e734 100644
> > > --- a/util/oslib-posix.c
> > > +++ b/util/oslib-posix.c
> > > @@ -35,7 +35,7 @@
> > >  extern int daemon(int, int);
> > >  #endif
> > >
> > > -#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
> > > +#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
> > >     /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
> > >        Valgrind does not support alignments larger than 1 MiB,
> > >        therefore we need special code which handles running on Valgrind. */
> > 
> > I hadn't realised AArch64 didn't define __arm__. Your extra clause
> > wants to be inside the parens for the ||, not outside.
> > 
> > So was the problem just that we weren't passing 2MB as the align
> > parameter to qemu_ram_mmap(), and if we do pass 2MB then it does the
> > right thing ?
> > 
> Yes, that was essentially the problem.  I'll send a proper patch.
> 
> However, Marc pointed out, that we should probably think about improving
> on this in the future, because if you're running on a 64K page system
> and want huge pages, you have to align at a higher boundary, but I'm on
> the other hand not sure such a boundary is always practical on a 64K
> system.  Which would suggest that we either need to:
> 
> 1) Probe the kernel for the page size and always align to something that
> allows huge pages, or
> 
> 2) Let the user specify an option that says it wants to be able to use
> THP and only then align to the huge page boundary.
> 
> Not sure...

Probably (1), otherwise the T in THP becomes t, i.e. the use of THP
gets slightly less transparent. Though I agree that using THP on 64k
page systems isn't always a good choice, I think, in those cases,
that users would likely just disable THP for the whole system.

drew

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-22 11:16             ` Andrew Jones
@ 2016-04-22 11:24               ` Alexander Graf
  0 siblings, 0 replies; 12+ messages in thread
From: Alexander Graf @ 2016-04-22 11:24 UTC (permalink / raw)
  To: Andrew Jones, Christoffer Dall
  Cc: Peter Maydell, QEMU Developers, Marc Zyngier, Paolo Bonzini,
	Michael S. Tsirkin

On 04/22/2016 01:16 PM, Andrew Jones wrote:
> On Fri, Apr 22, 2016 at 12:26:52PM +0200, Christoffer Dall wrote:
>> On Fri, Apr 22, 2016 at 11:17:47AM +0100, Peter Maydell wrote:
>>> On 22 April 2016 at 11:15, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>> Peter just pointed me to a change I remember doing for ARM, so perhaps
>>>> this fix is the right one?
>>>>
>>>>
>>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>>> index d25f671..a36e734 100644
>>>> --- a/util/oslib-posix.c
>>>> +++ b/util/oslib-posix.c
>>>> @@ -35,7 +35,7 @@
>>>>   extern int daemon(int, int);
>>>>   #endif
>>>>
>>>> -#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
>>>> +#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
>>>>      /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
>>>>         Valgrind does not support alignments larger than 1 MiB,
>>>>         therefore we need special code which handles running on Valgrind. */
>>> I hadn't realised AArch64 didn't define __arm__. Your extra clause
>>> wants to be inside the parens for the ||, not outside.
>>>
>>> So was the problem just that we weren't passing 2MB as the align
>>> parameter to qemu_ram_mmap(), and if we do pass 2MB then it does the
>>> right thing ?
>>>
>> Yes, that was essentially the problem.  I'll send a proper patch.
>>
>> However, Marc pointed out, that we should probably think about improving
>> on this in the future, because if you're running on a 64K page system
>> and want huge pages, you have to align at a higher boundary, but I'm on
>> the other hand not sure such a boundary is always practical on a 64K
>> system.  Which would suggest that we either need to:
>>
>> 1) Probe the kernel for the page size and always align to something that
>> allows huge pages, or
>>
>> 2) Let the user specify an option that says it wants to be able to use
>> THP and only then align to the huge page boundary.
>>
>> Not sure...
> Probably (1), otherwise the T in THP becomes t, i.e. the use of THP
> gets slightly less transparent. Though I agree that using THP on 64k
> page systems isn't always a good choice, I think, in those cases,
> that users would likely just disable THP for the whole system.

I'd say that THP on 64k systems really only makes sense with support for 
combined pages that allows you to use 2MB huge pages on a 64k PAGE_SIZE 
system. So in that case aligning to 2MB is fine again.


Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Performance regression using KVM/ARM
  2016-04-22 10:15       ` Christoffer Dall
  2016-04-22 10:17         ` Peter Maydell
@ 2016-06-13 14:53         ` Auger Eric
  1 sibling, 0 replies; 12+ messages in thread
From: Auger Eric @ 2016-06-13 14:53 UTC (permalink / raw)
  To: Christoffer Dall, Alexander Graf
  Cc: Marc Zyngier, Peter Maydell, Paolo Bonzini, qemu-devel,
	Michael S. Tsirkin, Alex Williamson

Hi,

Le 22/04/2016 à 12:15, Christoffer Dall a écrit :
> On Fri, Apr 22, 2016 at 12:06:52PM +0200, Alexander Graf wrote:
>> On 04/22/2016 12:01 PM, Christoffer Dall wrote:
>>> On Thu, Apr 21, 2016 at 09:50:05PM +0200, Alexander Graf wrote:
>>>>
>>>> On 21.04.16 18:23, Christoffer Dall wrote:
>>>>> Hi,
>>>>>
>>>>> Commit 9fac18f (oslib: allocate PROT_NONE pages on top of RAM,
>>>>> 2015-09-10) had the unfortunate side effect that memory slots registered
>>>>> with KVM no longer contain a userspace address that is aligned to a 2M
>>>>> boundary, causing the use of THP to fail in the kernel.
>>>>>
>>>>> I fail to see where in the QEMU code we should be asking for a 2M
>>>>> alignment of our memory region.  Can someone help pointing me to the
>>>>> right place to fix this or suggest a patch?
>>>>>
>>>>> This causes a performance regssion of hackbench on KVM/ARM of about 62%
>>>>> compared to the workload running with THP.
>>>>>
>>>>> We have verified that this is indeed the cause of the failure by adding
>>>>> various prints to QEMU and the kernel, but unfortunatley my QEMU
>>>>> knowledge is not sufficient for me to fix it myself.
>>>>>
>>>>> Any help would be much appreciated!
>>>> The code changed quite heavily since I last looked at it, but could you
>>>> please try whether the (untested) patch below makes a difference?
>>>>
>>>>
>>> Unfortunately this doesn't make any difference.  It feels to me like
>>> we're missing specifying a 2M alignemnt in QEMU somewhere, but I can't
>>> properly understand the links between the actual allocation, registering
>>> mem slots with the KVM part of QEMU, and actually setting up KVM user
>>> memory regions.
>>>
>>> What has to happen is that the resulting struct
>>> kvm_userspace_memory_region() has the same alignment offset from 2M (the
>>> huge page size) of the ->guest_phys_addr and ->userspace-addr fields.
>>
>> Well, I would expect that the guest address space is always very big
>> aligned - and definitely at least 2MB - so we're safe there.
>>
>> That means we only need to align the qemu virtual address. There
>> used to be a memalign() call for that, but it got replaced with
>> direct mmap() and then a lot of code changed on top. Looking at the
>> logs, I'm sure Paolo knows the answer though :)
>>
> Peter just pointed me to a change I remember doing for ARM, so perhaps
> this fix is the right one?
As shared with Christoffer, the patch below also alters multiple vfio
platform device assignment (outcome of a bisect). I guess this relates
to the GPA allocation on the platform bus but I need to further
investigate. I plan to work on a QEMU fix this week.

Thanks

Eric
> 
> 
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index d25f671..a36e734 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -35,7 +35,7 @@
>  extern int daemon(int, int);
>  #endif
>  
> -#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__))
> +#if defined(__linux__) && (defined(__x86_64__) || defined(__arm__)) || defined(__aarch64__)
>     /* Use 2 MiB alignment so transparent hugepages can be used by KVM.
>        Valgrind does not support alignments larger than 1 MiB,
>        therefore we need special code which handles running on Valgrind. */
> 
> 
> Thanks,
> -Christoffer
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-06-13 14:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-21 16:23 [Qemu-devel] Performance regression using KVM/ARM Christoffer Dall
2016-04-21 19:50 ` Alexander Graf
2016-04-22 10:01   ` Christoffer Dall
2016-04-22 10:06     ` Alexander Graf
2016-04-22 10:15       ` Christoffer Dall
2016-04-22 10:17         ` Peter Maydell
2016-04-22 10:26           ` Christoffer Dall
2016-04-22 11:16             ` Andrew Jones
2016-04-22 11:24               ` Alexander Graf
2016-06-13 14:53         ` Auger Eric
2016-04-21 21:58 ` Laszlo Ersek
2016-04-22 10:02   ` Christoffer Dall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.