All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] page table bugfix for s390/kvm
@ 2015-04-23 11:08 Christian Borntraeger
  2015-04-23 11:08 ` [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM Christian Borntraeger
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Borntraeger @ 2015-04-23 11:08 UTC (permalink / raw)
  To: Paolo Bonzini, Alexander Graf
  Cc: KVM, Cornelia Huck, Jens Freimann, Martin Schwidefsky,
	Christian Borntraeger

Paolo, Alex,

we plan to submit this fixup via Martins s390 tree (as it is all in
common s390 memory management).

It fixes a fundamental design bug in our page table handling. Some
background: Normal page tables are 2kb. For KVM we need a special page
table extension that creates another 2k after the page table (pgste).
As there are some workloads which have a high page table footprint
(e.g. data  bases with thousands of processes on shared memory), we
want to miminize the impact of the page table extensions to just KVM
processes. Now: our approach of replacing the page table on CREATE_VM
or ENABLE_SIE has a fundamental race to code that gets page table
pointers or ptl locks without holding the pmd lock or page table lock.
So here is another approach: Have a sysctl (with a KCONFIG default)
that decides if we need 4k page tables or 2k page tables.

KVM then needs this sysctl to be set, otherwise CREATE_VM will
return EINVAL.


Martin Schwidefsky (1):
  KVM: s390: remove delayed reallocation of page tables for KVM

 arch/s390/include/asm/mmu.h         |   4 +-
 arch/s390/include/asm/mmu_context.h |   3 +-
 arch/s390/include/asm/pgalloc.h     |   1 +
 arch/s390/include/asm/pgtable.h     |   9 +++
 arch/s390/kvm/Kconfig               |  16 ++++
 arch/s390/mm/pgtable.c              | 142 +++++++++++-------------------------
 6 files changed, 74 insertions(+), 101 deletions(-)

-- 
2.3.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-23 11:08 [PATCH] page table bugfix for s390/kvm Christian Borntraeger
@ 2015-04-23 11:08 ` Christian Borntraeger
  2015-04-23 11:37   ` Alexander Graf
  0 siblings, 1 reply; 15+ messages in thread
From: Christian Borntraeger @ 2015-04-23 11:08 UTC (permalink / raw)
  To: Paolo Bonzini, Alexander Graf
  Cc: KVM, Cornelia Huck, Jens Freimann, Martin Schwidefsky,
	Christian Borntraeger

From: Martin Schwidefsky <schwidefsky@de.ibm.com>

Replacing a 2K page table with a 4K page table while a VMA is active
for the affected memory region is fundamentally broken. Rip out the
page table reallocation code and replace it with a simple system
control 'vm.allocate_pgste'. If the system control is set the page
tables for all processes are allocated as full 4K pages, even for
processes that do not need it.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
---
 arch/s390/include/asm/mmu.h         |   4 +-
 arch/s390/include/asm/mmu_context.h |   3 +-
 arch/s390/include/asm/pgalloc.h     |   1 +
 arch/s390/include/asm/pgtable.h     |   9 +++
 arch/s390/kvm/Kconfig               |  16 ++++
 arch/s390/mm/pgtable.c              | 142 +++++++++++-------------------------
 6 files changed, 74 insertions(+), 101 deletions(-)

diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index a5e6562..d29ad95 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -14,7 +14,9 @@ typedef struct {
 	unsigned long asce_bits;
 	unsigned long asce_limit;
 	unsigned long vdso_base;
-	/* The mmu context has extended page tables. */
+	/* The mmu context allocates 4K page tables. */
+	unsigned int alloc_pgste:1;
+	/* The mmu context uses extended page tables. */
 	unsigned int has_pgste:1;
 	/* The mmu context uses storage keys. */
 	unsigned int use_skey:1;
diff --git a/arch/s390/include/asm/mmu_context.h b/arch/s390/include/asm/mmu_context.h
index 8fb3802..8b91128 100644
--- a/arch/s390/include/asm/mmu_context.h
+++ b/arch/s390/include/asm/mmu_context.h
@@ -21,9 +21,10 @@ static inline int init_new_context(struct task_struct *tsk,
 	mm->context.asce_bits = _ASCE_TABLE_LENGTH | _ASCE_USER_BITS;
 #ifdef CONFIG_64BIT
 	mm->context.asce_bits |= _ASCE_TYPE_REGION3;
-#endif
+	mm->context.alloc_pgste = page_table_allocate_pgste;
 	mm->context.has_pgste = 0;
 	mm->context.use_skey = 0;
+#endif
 	mm->context.asce_limit = STACK_TOP_MAX;
 	crst_table_init((unsigned long *) mm->pgd, pgd_entry_type(mm));
 	return 0;
diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 3009c2b..e0c5834 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -21,6 +21,7 @@ void crst_table_free(struct mm_struct *, unsigned long *);
 unsigned long *page_table_alloc(struct mm_struct *);
 void page_table_free(struct mm_struct *, unsigned long *);
 void page_table_free_rcu(struct mmu_gather *, unsigned long *, unsigned long);
+extern int page_table_allocate_pgste;
 
 int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
 			  unsigned long key, bool nq);
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index e08ec38..4aaea1d 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -492,6 +492,15 @@ static inline int mm_has_pgste(struct mm_struct *mm)
 	return 0;
 }
 
+static inline int mm_alloc_pgste(struct mm_struct *mm)
+{
+#ifdef CONFIG_PGSTE
+	if (unlikely(mm->context.alloc_pgste))
+		return 1;
+#endif
+	return 0;
+}
+
 /*
  * In the case that a guest uses storage keys
  * faults should no longer be backed by zero pages
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index 5fce52c..031b5db 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -51,6 +51,22 @@ config KVM_S390_UCONTROL
 
 	  If unsure, say N.
 
+config KVM_S390_PGSTE_DEFAULT
+	bool "Allocate page status table extension (PGSTE) by default"
+	depends on KVM
+	---help---
+	  To start a KVM guest the page tables for the process need
+	  to be allocated with the page status table extension.
+
+	  The system control 'vm.allocate_pgste' is queried at process
+	  creation if 4K page tables with the PGSTE are required or if
+	  2K page tables are sufficient.
+
+	  This option sets the default for 'vm.allocate_pgste'. If
+	  you compile a kernel to be used for a KVM host, say Y.
+
+	  If unsure, say N.
+
 # OK, it's a little counter-intuitive to do this, but it puts it neatly under
 # the virtualization menu.
 source drivers/vhost/Kconfig
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index b2c1542..d0612d6 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -18,6 +18,7 @@
 #include <linux/rcupdate.h>
 #include <linux/slab.h>
 #include <linux/swapops.h>
+#include <linux/sysctl.h>
 #include <linux/ksm.h>
 #include <linux/mman.h>
 
@@ -928,6 +929,40 @@ unsigned long get_guest_storage_key(struct mm_struct *mm, unsigned long addr)
 }
 EXPORT_SYMBOL(get_guest_storage_key);
 
+static int page_table_allocate_pgste_min = 0;
+static int page_table_allocate_pgste_max = 1;
+int page_table_allocate_pgste = IS_ENABLED(CONFIG_KVM_S390_PGSTE_DEFAULT);
+EXPORT_SYMBOL(page_table_allocate_pgste);
+
+static struct ctl_table page_table_sysctl[] = {
+	{
+		.procname	= "allocate_pgste",
+		.data		= &page_table_allocate_pgste,
+		.maxlen		= sizeof(int),
+		.mode		= S_IRUGO | S_IWUSR,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &page_table_allocate_pgste_min,
+		.extra2		= &page_table_allocate_pgste_max,
+	},
+	{ }
+};
+
+static struct ctl_table page_table_sysctl_dir[] = {
+	{
+		.procname	= "vm",
+		.maxlen		= 0,
+		.mode		= 0555,
+		.child		= page_table_sysctl,
+	},
+	{ }
+};
+
+static int __init page_table_register_sysctl(void)
+{
+	return register_sysctl_table(page_table_sysctl_dir) ? 0 : -ENOMEM;
+}
+__initcall(page_table_register_sysctl);
+
 #else /* CONFIG_PGSTE */
 
 static inline int page_table_with_pgste(struct page *page)
@@ -971,7 +1006,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 	struct page *uninitialized_var(page);
 	unsigned int mask, bit;
 
-	if (mm_has_pgste(mm))
+	if (mm_alloc_pgste(mm))
 		return page_table_alloc_pgste(mm);
 	/* Allocate fragments of a 4K page as 1K/2K page table */
 	spin_lock_bh(&mm->context.list_lock);
@@ -1173,116 +1208,25 @@ static inline void thp_split_mm(struct mm_struct *mm)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static unsigned long page_table_realloc_pmd(struct mmu_gather *tlb,
-				struct mm_struct *mm, pud_t *pud,
-				unsigned long addr, unsigned long end)
-{
-	unsigned long next, *table, *new;
-	struct page *page;
-	spinlock_t *ptl;
-	pmd_t *pmd;
-
-	pmd = pmd_offset(pud, addr);
-	do {
-		next = pmd_addr_end(addr, end);
-again:
-		if (pmd_none_or_clear_bad(pmd))
-			continue;
-		table = (unsigned long *) pmd_deref(*pmd);
-		page = pfn_to_page(__pa(table) >> PAGE_SHIFT);
-		if (page_table_with_pgste(page))
-			continue;
-		/* Allocate new page table with pgstes */
-		new = page_table_alloc_pgste(mm);
-		if (!new)
-			return -ENOMEM;
-
-		ptl = pmd_lock(mm, pmd);
-		if (likely((unsigned long *) pmd_deref(*pmd) == table)) {
-			/* Nuke pmd entry pointing to the "short" page table */
-			pmdp_flush_lazy(mm, addr, pmd);
-			pmd_clear(pmd);
-			/* Copy ptes from old table to new table */
-			memcpy(new, table, PAGE_SIZE/2);
-			clear_table(table, _PAGE_INVALID, PAGE_SIZE/2);
-			/* Establish new table */
-			pmd_populate(mm, pmd, (pte_t *) new);
-			/* Free old table with rcu, there might be a walker! */
-			page_table_free_rcu(tlb, table, addr);
-			new = NULL;
-		}
-		spin_unlock(ptl);
-		if (new) {
-			page_table_free_pgste(new);
-			goto again;
-		}
-	} while (pmd++, addr = next, addr != end);
-
-	return addr;
-}
-
-static unsigned long page_table_realloc_pud(struct mmu_gather *tlb,
-				   struct mm_struct *mm, pgd_t *pgd,
-				   unsigned long addr, unsigned long end)
-{
-	unsigned long next;
-	pud_t *pud;
-
-	pud = pud_offset(pgd, addr);
-	do {
-		next = pud_addr_end(addr, end);
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		next = page_table_realloc_pmd(tlb, mm, pud, addr, next);
-		if (unlikely(IS_ERR_VALUE(next)))
-			return next;
-	} while (pud++, addr = next, addr != end);
-
-	return addr;
-}
-
-static unsigned long page_table_realloc(struct mmu_gather *tlb, struct mm_struct *mm,
-					unsigned long addr, unsigned long end)
-{
-	unsigned long next;
-	pgd_t *pgd;
-
-	pgd = pgd_offset(mm, addr);
-	do {
-		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		next = page_table_realloc_pud(tlb, mm, pgd, addr, next);
-		if (unlikely(IS_ERR_VALUE(next)))
-			return next;
-	} while (pgd++, addr = next, addr != end);
-
-	return 0;
-}
-
 /*
  * switch on pgstes for its userspace process (for kvm)
  */
 int s390_enable_sie(void)
 {
-	struct task_struct *tsk = current;
-	struct mm_struct *mm = tsk->mm;
-	struct mmu_gather tlb;
+	struct mm_struct *mm = current->mm;
 
 	/* Do we have pgstes? if yes, we are done */
-	if (mm_has_pgste(tsk->mm))
+	if (mm_has_pgste(mm))
 		return 0;
-
+	/* Fail if the page tables are 2K */
+	if (!mm_alloc_pgste(mm))
+		return -EINVAL;
 	down_write(&mm->mmap_sem);
+	mm->context.has_pgste = 1;
 	/* split thp mappings and disable thp for future mappings */
 	thp_split_mm(mm);
-	/* Reallocate the page tables with pgstes */
-	tlb_gather_mmu(&tlb, mm, 0, TASK_SIZE);
-	if (!page_table_realloc(&tlb, mm, 0, TASK_SIZE))
-		mm->context.has_pgste = 1;
-	tlb_finish_mmu(&tlb, 0, TASK_SIZE);
 	up_write(&mm->mmap_sem);
-	return mm->context.has_pgste ? 0 : -ENOMEM;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(s390_enable_sie);
 
-- 
2.3.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-23 11:08 ` [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM Christian Borntraeger
@ 2015-04-23 11:37   ` Alexander Graf
  2015-04-23 11:43     ` Christian Borntraeger
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2015-04-23 11:37 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Paolo Bonzini, KVM, Cornelia Huck, Jens Freimann, Martin Schwidefsky



> Am 23.04.2015 um 13:08 schrieb Christian Borntraeger <borntraeger@de.ibm.com>:
> 
> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
> 
> Replacing a 2K page table with a 4K page table while a VMA is active
> for the affected memory region is fundamentally broken. Rip out the
> page table reallocation code and replace it with a simple system
> control 'vm.allocate_pgste'. If the system control is set the page
> tables for all processes are allocated as full 4K pages, even for
> processes that do not need it.
> 
> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>

Couldn't you make this a hidden kconfig option that gets automatically selected when kvm is enabled? Or is there a non-kvm case that needs it too?


Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-23 11:37   ` Alexander Graf
@ 2015-04-23 11:43     ` Christian Borntraeger
  2015-04-23 12:01       ` Alexander Graf
  2015-04-23 12:07       ` Paolo Bonzini
  0 siblings, 2 replies; 15+ messages in thread
From: Christian Borntraeger @ 2015-04-23 11:43 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Paolo Bonzini, KVM, Cornelia Huck, Jens Freimann, Martin Schwidefsky

Am 23.04.2015 um 13:37 schrieb Alexander Graf:
> 
> 
>> Am 23.04.2015 um 13:08 schrieb Christian Borntraeger <borntraeger@de.ibm.com>:
>>
>> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
>>
>> Replacing a 2K page table with a 4K page table while a VMA is active
>> for the affected memory region is fundamentally broken. Rip out the
>> page table reallocation code and replace it with a simple system
>> control 'vm.allocate_pgste'. If the system control is set the page
>> tables for all processes are allocated as full 4K pages, even for
>> processes that do not need it.
>>
>> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
> 
> Couldn't you make this a hidden kconfig option that gets automatically selected when kvm is enabled? Or is there a non-kvm case that needs it too?

For things like RHEV the default could certainly be "enabled", but for normal
distros like SLES/RHEL, the idea was to NOT enable that by default, as the non-KVM
case is more common and might suffer from the additional memory consumption of
the page tables. (big databases come to mind)

We could think about having rpms like kvm to provide a sysctl file that sets it if we
want to minimize the impact. Other ideas?

Christian


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-23 11:43     ` Christian Borntraeger
@ 2015-04-23 12:01       ` Alexander Graf
  2015-04-23 12:08         ` Christian Borntraeger
  2015-04-23 12:13         ` Martin Schwidefsky
  2015-04-23 12:07       ` Paolo Bonzini
  1 sibling, 2 replies; 15+ messages in thread
From: Alexander Graf @ 2015-04-23 12:01 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Paolo Bonzini, KVM, Cornelia Huck, Jens Freimann, Martin Schwidefsky



> Am 23.04.2015 um 13:43 schrieb Christian Borntraeger <borntraeger@de.ibm.com>:
> 
>> Am 23.04.2015 um 13:37 schrieb Alexander Graf:
>> 
>> 
>>> Am 23.04.2015 um 13:08 schrieb Christian Borntraeger <borntraeger@de.ibm.com>:
>>> 
>>> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
>>> 
>>> Replacing a 2K page table with a 4K page table while a VMA is active
>>> for the affected memory region is fundamentally broken. Rip out the
>>> page table reallocation code and replace it with a simple system
>>> control 'vm.allocate_pgste'. If the system control is set the page
>>> tables for all processes are allocated as full 4K pages, even for
>>> processes that do not need it.
>>> 
>>> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
>>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
>> 
>> Couldn't you make this a hidden kconfig option that gets automatically selected when kvm is enabled? Or is there a non-kvm case that needs it too?
> 
> For things like RHEV the default could certainly be "enabled", but for normal
> distros like SLES/RHEL, the idea was to NOT enable that by default, as the non-KVM
> case is more common and might suffer from the additional memory consumption of
> the page tables. (big databases come to mind)
> 
> We could think about having rpms like kvm to provide a sysctl file that sets it if we
> want to minimize the impact. Other ideas?

Oh, I'm sorry, I misread the ifdef. I don't think it makes sense to have a config option for the default value then, just rely only on sysctl.conf for changed defaults.

As far as mechanisms to change it go, every distribution has their own ways of dealing with this. RH has a "profile" thing, we don't really have anything central, but individual sysctl.d files for example that a kvm package could provide.

Either way, the default choosing shouldn't happen in .config ;). Also, please add some helpful error message in qemu to guide users to the sysctl.

As far as alternative approaches go, I don't have a great idea otoh. We could have an elf flag indicating that this process needs 4k page tables to limit the impact to a single process. In fact, could we maybe still limit the scope to non-global? A personality may work as well. Or ulimit?


Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-23 11:43     ` Christian Borntraeger
  2015-04-23 12:01       ` Alexander Graf
@ 2015-04-23 12:07       ` Paolo Bonzini
  2015-04-23 13:57         ` Cole Robinson
  1 sibling, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2015-04-23 12:07 UTC (permalink / raw)
  To: Christian Borntraeger, Alexander Graf
  Cc: KVM, Cornelia Huck, Jens Freimann, Martin Schwidefsky, Cole Robinson



On 23/04/2015 13:43, Christian Borntraeger wrote:
>> > 
>> > Couldn't you make this a hidden kconfig option that gets automatically selected when kvm is enabled? Or is there a non-kvm case that needs it too?
> For things like RHEV the default could certainly be "enabled", but for normal
> distros like SLES/RHEL, the idea was to NOT enable that by default, as the non-KVM
> case is more common and might suffer from the additional memory consumption of
> the page tables. (big databases come to mind)
> 
> We could think about having rpms like kvm to provide a sysctl file that sets it if we
> want to minimize the impact. Other ideas?

I can say what _won't_ work which is tying it to the KVM module.
Nowadays it is loaded automatically via udev on the first /dev/kvm
access, and that's already too late because qemu-kvm's page tables have
been created already.  Right?

With my Fedora hat on, adding a sysctl file to the userspace RPMs (e.g.
qemu) would work.  CCing Cole Robinson who is the main maintainer of the
Fedora virt packages.

Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-23 12:01       ` Alexander Graf
@ 2015-04-23 12:08         ` Christian Borntraeger
  2015-04-27 13:49           ` Alexander Graf
  2015-04-23 12:13         ` Martin Schwidefsky
  1 sibling, 1 reply; 15+ messages in thread
From: Christian Borntraeger @ 2015-04-23 12:08 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Paolo Bonzini, KVM, Cornelia Huck, Jens Freimann, Martin Schwidefsky

Am 23.04.2015 um 14:01 schrieb Alexander Graf:
> 
> 
>> Am 23.04.2015 um 13:43 schrieb Christian Borntraeger <borntraeger@de.ibm.com>:
>>
>>> Am 23.04.2015 um 13:37 schrieb Alexander Graf:
>>>
>>>
>>>> Am 23.04.2015 um 13:08 schrieb Christian Borntraeger <borntraeger@de.ibm.com>:
>>>>
>>>> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
>>>>
>>>> Replacing a 2K page table with a 4K page table while a VMA is active
>>>> for the affected memory region is fundamentally broken. Rip out the
>>>> page table reallocation code and replace it with a simple system
>>>> control 'vm.allocate_pgste'. If the system control is set the page
>>>> tables for all processes are allocated as full 4K pages, even for
>>>> processes that do not need it.
>>>>
>>>> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
>>>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>
>>> Couldn't you make this a hidden kconfig option that gets automatically selected when kvm is enabled? Or is there a non-kvm case that needs it too?
>>
>> For things like RHEV the default could certainly be "enabled", but for normal
>> distros like SLES/RHEL, the idea was to NOT enable that by default, as the non-KVM
>> case is more common and might suffer from the additional memory consumption of
>> the page tables. (big databases come to mind)
>>
>> We could think about having rpms like kvm to provide a sysctl file that sets it if we
>> want to minimize the impact. Other ideas?
> 
> Oh, I'm sorry, I misread the ifdef. I don't think it makes sense to have a config option for the default value then, just rely only on sysctl.conf for changed defaults.
> 
> As far as mechanisms to change it go, every distribution has their own ways of dealing with this. RH has a "profile" thing, we don't really have anything central, but individual sysctl.d files for example that a kvm package could provide.
> Either way, the default choosing shouldn't happen in .config ;).

So you vote for getting rid of the Kconfig?

Also, please add some helpful error message in qemu to guide users to the sysctl.

Yes, we will provide a qemu patch (cc stable) after this hits the kernel.

> As far as alternative approaches go, I don't have a great idea otoh. We could have an elf flag indicating that this process needs 4k page tables to limit the impact to a single process.

This approach was actually Martins first fix. The problem is that the decision takes place on execve,
but we need an answer at fork time. So we always started with 4k page tables and freed the 2nd halv on
execve. Now this did not work for processes that only fork (without execve).

> In fact, could we maybe still limit the scope to non-global? A personality may work as well. Or ulimit?

I think we will go for now with the sysctl and see if we can come up with some automatic way as additional
patch later on.

Christian




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-23 12:01       ` Alexander Graf
  2015-04-23 12:08         ` Christian Borntraeger
@ 2015-04-23 12:13         ` Martin Schwidefsky
  2015-04-27 13:48           ` Alexander Graf
  1 sibling, 1 reply; 15+ messages in thread
From: Martin Schwidefsky @ 2015-04-23 12:13 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Christian Borntraeger, Paolo Bonzini, KVM, Cornelia Huck, Jens Freimann

On Thu, 23 Apr 2015 14:01:23 +0200
Alexander Graf <agraf@suse.de> wrote:

> As far as alternative approaches go, I don't have a great idea otoh.
> We could have an elf flag indicating that this process needs 4k page
> tables to limit the impact to a single process. In fact, could we
> maybe still limit the scope to non-global? A personality may work
> as well. Or ulimit?

I tried the ELF flag approach, does not work. The trouble is that
allocate_mm() has to create the page tables with 4K tables if you
want to change the page table layout later on. We have learned the
hard way that the direction 2K to 4K does not work due to races
in the mm.

Now there are two major cases: 1) fork + execve and 2) fork only.
The ELF flag can be used to reduce from 4K to 2K for 1) but not 2).
2) is required for apps that use lots of forking, e.g. database or
web servers. Same goes for the approach with a personality flag or
ulimit.

We would have to distinguish the two cases for allocate_mm(),
if the new mm is allocated for a fork the current mm decides
2K vs. 4K. If the new mm is allocated by binfmt_elf, then start
with 4K and do the downgrade after the ELF flag has been evaluated.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-23 12:07       ` Paolo Bonzini
@ 2015-04-23 13:57         ` Cole Robinson
  0 siblings, 0 replies; 15+ messages in thread
From: Cole Robinson @ 2015-04-23 13:57 UTC (permalink / raw)
  To: Paolo Bonzini, Christian Borntraeger, Alexander Graf
  Cc: KVM, Cornelia Huck, Jens Freimann, Martin Schwidefsky

On 04/23/2015 08:07 AM, Paolo Bonzini wrote:
> 
> 
> On 23/04/2015 13:43, Christian Borntraeger wrote:
>>>>
>>>> Couldn't you make this a hidden kconfig option that gets automatically selected when kvm is enabled? Or is there a non-kvm case that needs it too?
>> For things like RHEV the default could certainly be "enabled", but for normal
>> distros like SLES/RHEL, the idea was to NOT enable that by default, as the non-KVM
>> case is more common and might suffer from the additional memory consumption of
>> the page tables. (big databases come to mind)
>>
>> We could think about having rpms like kvm to provide a sysctl file that sets it if we
>> want to minimize the impact. Other ideas?
> 
> I can say what _won't_ work which is tying it to the KVM module.
> Nowadays it is loaded automatically via udev on the first /dev/kvm
> access, and that's already too late because qemu-kvm's page tables have
> been created already.  Right?
> 
> With my Fedora hat on, adding a sysctl file to the userspace RPMs (e.g.
> qemu) would work.  CCing Cole Robinson who is the main maintainer of the
> Fedora virt packages.
> 

>From a packaging POV that sounds fine to me

- Cole

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-23 12:13         ` Martin Schwidefsky
@ 2015-04-27 13:48           ` Alexander Graf
  2015-04-27 13:52             ` Paolo Bonzini
  2015-04-27 13:57             ` Martin Schwidefsky
  0 siblings, 2 replies; 15+ messages in thread
From: Alexander Graf @ 2015-04-27 13:48 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christian Borntraeger, Paolo Bonzini, KVM, Cornelia Huck, Jens Freimann

On 04/23/2015 02:13 PM, Martin Schwidefsky wrote:
> On Thu, 23 Apr 2015 14:01:23 +0200
> Alexander Graf <agraf@suse.de> wrote:
>
>> As far as alternative approaches go, I don't have a great idea otoh.
>> We could have an elf flag indicating that this process needs 4k page
>> tables to limit the impact to a single process. In fact, could we
>> maybe still limit the scope to non-global? A personality may work
>> as well. Or ulimit?
> I tried the ELF flag approach, does not work. The trouble is that
> allocate_mm() has to create the page tables with 4K tables if you
> want to change the page table layout later on. We have learned the
> hard way that the direction 2K to 4K does not work due to races
> in the mm.
>
> Now there are two major cases: 1) fork + execve and 2) fork only.
> The ELF flag can be used to reduce from 4K to 2K for 1) but not 2).
> 2) is required for apps that use lots of forking, e.g. database or
> web servers. Same goes for the approach with a personality flag or
> ulimit.
>
> We would have to distinguish the two cases for allocate_mm(),
> if the new mm is allocated for a fork the current mm decides
> 2K vs. 4K. If the new mm is allocated by binfmt_elf, then start
> with 4K and do the downgrade after the ELF flag has been evaluated.

Well, you could also make it a personality flag for example, no? Then 
every new process below a certain one always gets 4k page tables until 
they drop the personality, at which point each child would only get 2k 
page tables again.

I'm mostly concerned that people will end up mixing VMs and other 
workloads on the same LPAR, so I don't think there's a one-shoe-fits-all 
solution.


Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-23 12:08         ` Christian Borntraeger
@ 2015-04-27 13:49           ` Alexander Graf
  0 siblings, 0 replies; 15+ messages in thread
From: Alexander Graf @ 2015-04-27 13:49 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Paolo Bonzini, KVM, Cornelia Huck, Jens Freimann, Martin Schwidefsky

On 04/23/2015 02:08 PM, Christian Borntraeger wrote:
> Am 23.04.2015 um 14:01 schrieb Alexander Graf:
>>
>>> Am 23.04.2015 um 13:43 schrieb Christian Borntraeger <borntraeger@de.ibm.com>:
>>>
>>>> Am 23.04.2015 um 13:37 schrieb Alexander Graf:
>>>>
>>>>
>>>>> Am 23.04.2015 um 13:08 schrieb Christian Borntraeger <borntraeger@de.ibm.com>:
>>>>>
>>>>> From: Martin Schwidefsky <schwidefsky@de.ibm.com>
>>>>>
>>>>> Replacing a 2K page table with a 4K page table while a VMA is active
>>>>> for the affected memory region is fundamentally broken. Rip out the
>>>>> page table reallocation code and replace it with a simple system
>>>>> control 'vm.allocate_pgste'. If the system control is set the page
>>>>> tables for all processes are allocated as full 4K pages, even for
>>>>> processes that do not need it.
>>>>>
>>>>> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
>>>>> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>> Couldn't you make this a hidden kconfig option that gets automatically selected when kvm is enabled? Or is there a non-kvm case that needs it too?
>>> For things like RHEV the default could certainly be "enabled", but for normal
>>> distros like SLES/RHEL, the idea was to NOT enable that by default, as the non-KVM
>>> case is more common and might suffer from the additional memory consumption of
>>> the page tables. (big databases come to mind)
>>>
>>> We could think about having rpms like kvm to provide a sysctl file that sets it if we
>>> want to minimize the impact. Other ideas?
>> Oh, I'm sorry, I misread the ifdef. I don't think it makes sense to have a config option for the default value then, just rely only on sysctl.conf for changed defaults.
>>
>> As far as mechanisms to change it go, every distribution has their own ways of dealing with this. RH has a "profile" thing, we don't really have anything central, but individual sysctl.d files for example that a kvm package could provide.
>> Either way, the default choosing shouldn't happen in .config ;).
> So you vote for getting rid of the Kconfig?
>
> Also, please add some helpful error message in qemu to guide users to the sysctl.
>
> Yes, we will provide a qemu patch (cc stable) after this hits the kernel.
>
>> As far as alternative approaches go, I don't have a great idea otoh. We could have an elf flag indicating that this process needs 4k page tables to limit the impact to a single process.
> This approach was actually Martins first fix. The problem is that the decision takes place on execve,
> but we need an answer at fork time. So we always started with 4k page tables and freed the 2nd halv on
> execve. Now this did not work for processes that only fork (without execve).
>
>> In fact, could we maybe still limit the scope to non-global? A personality may work as well. Or ulimit?
> I think we will go for now with the sysctl and see if we can come up with some automatic way as additional
> patch later on.

Sounds perfectly reasonable to me. You can for example also just set the 
sysctl bit in libvirtd :).


Alex


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-27 13:48           ` Alexander Graf
@ 2015-04-27 13:52             ` Paolo Bonzini
  2015-04-27 13:57             ` Martin Schwidefsky
  1 sibling, 0 replies; 15+ messages in thread
From: Paolo Bonzini @ 2015-04-27 13:52 UTC (permalink / raw)
  To: Alexander Graf, Martin Schwidefsky
  Cc: Christian Borntraeger, KVM, Cornelia Huck, Jens Freimann



On 27/04/2015 15:48, Alexander Graf wrote:
> Well, you could also make it a personality flag for example, no? Then
> every new process below a certain one always gets 4k page tables until
> they drop the personality, at which point each child would only get 2k
> page tables again.

But if you have to make (say) bash get the personality, this means that
you need a full userspace recompile to run VMs.

Paolo

> I'm mostly concerned that people will end up mixing VMs and other
> workloads on the same LPAR, so I don't think there's a one-shoe-fits-all
> solution.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-27 13:48           ` Alexander Graf
  2015-04-27 13:52             ` Paolo Bonzini
@ 2015-04-27 13:57             ` Martin Schwidefsky
  2015-04-27 14:03               ` Alexander Graf
  1 sibling, 1 reply; 15+ messages in thread
From: Martin Schwidefsky @ 2015-04-27 13:57 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Christian Borntraeger, Paolo Bonzini, KVM, Cornelia Huck, Jens Freimann

On Mon, 27 Apr 2015 15:48:42 +0200
Alexander Graf <agraf@suse.de> wrote:

> On 04/23/2015 02:13 PM, Martin Schwidefsky wrote:
> > On Thu, 23 Apr 2015 14:01:23 +0200
> > Alexander Graf <agraf@suse.de> wrote:
> >
> >> As far as alternative approaches go, I don't have a great idea otoh.
> >> We could have an elf flag indicating that this process needs 4k page
> >> tables to limit the impact to a single process. In fact, could we
> >> maybe still limit the scope to non-global? A personality may work
> >> as well. Or ulimit?
> > I tried the ELF flag approach, does not work. The trouble is that
> > allocate_mm() has to create the page tables with 4K tables if you
> > want to change the page table layout later on. We have learned the
> > hard way that the direction 2K to 4K does not work due to races
> > in the mm.
> >
> > Now there are two major cases: 1) fork + execve and 2) fork only.
> > The ELF flag can be used to reduce from 4K to 2K for 1) but not 2).
> > 2) is required for apps that use lots of forking, e.g. database or
> > web servers. Same goes for the approach with a personality flag or
> > ulimit.
> >
> > We would have to distinguish the two cases for allocate_mm(),
> > if the new mm is allocated for a fork the current mm decides
> > 2K vs. 4K. If the new mm is allocated by binfmt_elf, then start
> > with 4K and do the downgrade after the ELF flag has been evaluated.
> 
> Well, you could also make it a personality flag for example, no? Then 
> every new process below a certain one always gets 4k page tables until 
> they drop the personality, at which point each child would only get 2k 
> page tables again.
> 
> I'm mostly concerned that people will end up mixing VMs and other 
> workloads on the same LPAR, so I don't think there's a one-shoe-fits-all 
> solution.

If I add an argument to mm_init() to indicate if this context
is for fork() or execve() then the ELF header flag approach works.

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-27 13:57             ` Martin Schwidefsky
@ 2015-04-27 14:03               ` Alexander Graf
  2015-04-27 14:08                 ` Christian Borntraeger
  0 siblings, 1 reply; 15+ messages in thread
From: Alexander Graf @ 2015-04-27 14:03 UTC (permalink / raw)
  To: Martin Schwidefsky
  Cc: Christian Borntraeger, Paolo Bonzini, KVM, Cornelia Huck, Jens Freimann

On 04/27/2015 03:57 PM, Martin Schwidefsky wrote:
> On Mon, 27 Apr 2015 15:48:42 +0200
> Alexander Graf <agraf@suse.de> wrote:
>
>> On 04/23/2015 02:13 PM, Martin Schwidefsky wrote:
>>> On Thu, 23 Apr 2015 14:01:23 +0200
>>> Alexander Graf <agraf@suse.de> wrote:
>>>
>>>> As far as alternative approaches go, I don't have a great idea otoh.
>>>> We could have an elf flag indicating that this process needs 4k page
>>>> tables to limit the impact to a single process. In fact, could we
>>>> maybe still limit the scope to non-global? A personality may work
>>>> as well. Or ulimit?
>>> I tried the ELF flag approach, does not work. The trouble is that
>>> allocate_mm() has to create the page tables with 4K tables if you
>>> want to change the page table layout later on. We have learned the
>>> hard way that the direction 2K to 4K does not work due to races
>>> in the mm.
>>>
>>> Now there are two major cases: 1) fork + execve and 2) fork only.
>>> The ELF flag can be used to reduce from 4K to 2K for 1) but not 2).
>>> 2) is required for apps that use lots of forking, e.g. database or
>>> web servers. Same goes for the approach with a personality flag or
>>> ulimit.
>>>
>>> We would have to distinguish the two cases for allocate_mm(),
>>> if the new mm is allocated for a fork the current mm decides
>>> 2K vs. 4K. If the new mm is allocated by binfmt_elf, then start
>>> with 4K and do the downgrade after the ELF flag has been evaluated.
>> Well, you could also make it a personality flag for example, no? Then
>> every new process below a certain one always gets 4k page tables until
>> they drop the personality, at which point each child would only get 2k
>> page tables again.
>>
>> I'm mostly concerned that people will end up mixing VMs and other
>> workloads on the same LPAR, so I don't think there's a one-shoe-fits-all
>> solution.
> If I add an argument to mm_init() to indicate if this context
> is for fork() or execve() then the ELF header flag approach works.

So you don't need the sysctl?


Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM
  2015-04-27 14:03               ` Alexander Graf
@ 2015-04-27 14:08                 ` Christian Borntraeger
  0 siblings, 0 replies; 15+ messages in thread
From: Christian Borntraeger @ 2015-04-27 14:08 UTC (permalink / raw)
  To: Alexander Graf, Martin Schwidefsky
  Cc: Paolo Bonzini, KVM, Cornelia Huck, Jens Freimann

Am 27.04.2015 um 16:03 schrieb Alexander Graf:
> On 04/27/2015 03:57 PM, Martin Schwidefsky wrote:
>> On Mon, 27 Apr 2015 15:48:42 +0200
>> Alexander Graf <agraf@suse.de> wrote:
>>
>>> On 04/23/2015 02:13 PM, Martin Schwidefsky wrote:
>>>> On Thu, 23 Apr 2015 14:01:23 +0200
>>>> Alexander Graf <agraf@suse.de> wrote:
>>>>
>>>>> As far as alternative approaches go, I don't have a great idea otoh.
>>>>> We could have an elf flag indicating that this process needs 4k page
>>>>> tables to limit the impact to a single process. In fact, could we
>>>>> maybe still limit the scope to non-global? A personality may work
>>>>> as well. Or ulimit?
>>>> I tried the ELF flag approach, does not work. The trouble is that
>>>> allocate_mm() has to create the page tables with 4K tables if you
>>>> want to change the page table layout later on. We have learned the
>>>> hard way that the direction 2K to 4K does not work due to races
>>>> in the mm.
>>>>
>>>> Now there are two major cases: 1) fork + execve and 2) fork only.
>>>> The ELF flag can be used to reduce from 4K to 2K for 1) but not 2).
>>>> 2) is required for apps that use lots of forking, e.g. database or
>>>> web servers. Same goes for the approach with a personality flag or
>>>> ulimit.
>>>>
>>>> We would have to distinguish the two cases for allocate_mm(),
>>>> if the new mm is allocated for a fork the current mm decides
>>>> 2K vs. 4K. If the new mm is allocated by binfmt_elf, then start
>>>> with 4K and do the downgrade after the ELF flag has been evaluated.
>>> Well, you could also make it a personality flag for example, no? Then
>>> every new process below a certain one always gets 4k page tables until
>>> they drop the personality, at which point each child would only get 2k
>>> page tables again.
>>>
>>> I'm mostly concerned that people will end up mixing VMs and other
>>> workloads on the same LPAR, so I don't think there's a one-shoe-fits-all
>>> solution.
>> If I add an argument to mm_init() to indicate if this context
>> is for fork() or execve() then the ELF header flag approach works.
> 
> So you don't need the sysctl?

It would not be enough to enable old userspaces that do not have the
ELF header flag. So we need both to enable old userspace - new kernel.

Christian


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-04-27 14:08 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-23 11:08 [PATCH] page table bugfix for s390/kvm Christian Borntraeger
2015-04-23 11:08 ` [PATCH] KVM: s390: remove delayed reallocation of page tables for KVM Christian Borntraeger
2015-04-23 11:37   ` Alexander Graf
2015-04-23 11:43     ` Christian Borntraeger
2015-04-23 12:01       ` Alexander Graf
2015-04-23 12:08         ` Christian Borntraeger
2015-04-27 13:49           ` Alexander Graf
2015-04-23 12:13         ` Martin Schwidefsky
2015-04-27 13:48           ` Alexander Graf
2015-04-27 13:52             ` Paolo Bonzini
2015-04-27 13:57             ` Martin Schwidefsky
2015-04-27 14:03               ` Alexander Graf
2015-04-27 14:08                 ` Christian Borntraeger
2015-04-23 12:07       ` Paolo Bonzini
2015-04-23 13:57         ` Cole Robinson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.