Re: [PATCH RFC 0/2] KVM: s390: avoid having to enable vm.alloc_pgste

From: Christian Borntraeger <borntraeger@de.ibm.com>
To: Martin Schwidefsky <schwidefsky@de.ibm.com>,
	David Hildenbrand <david@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Thomas Huth <thuth@redhat.com>
Subject: Re: [PATCH RFC 0/2] KVM: s390: avoid having to enable vm.alloc_pgste
Date: Thu, 1 Jun 2017 13:24:26 +0200	[thread overview]
Message-ID: <51ccff8c-bb09-3dc4-4d75-bf1b86ca75a9@de.ibm.com> (raw)
In-Reply-To: <20170601124651.3e7969ab@mschwideX1>

On 06/01/2017 12:46 PM, Martin Schwidefsky wrote:
> Hi David,
> 
> it is nice to see that you are still working on s390 related topics.
> 
> On Mon, 29 May 2017 18:32:00 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Having to enable vm.alloc_pgste globally might not be the best solution.
>> 4k page tables are created for all processes and running QEMU KVM guests
>> is more complicated than it should be.
> 
> To run KVM guests you need to issue a single sysctl to set vm.allocate_pgste,
> this is the best solution we found so far.

Suse and Ubuntu seem to have a sysctl.conf file in the qemu-kvm package that
does a global switch.

> 
>> Unfortunately, converting all page tables to 4k pgste page tables is
>> not possible without provoking various race conditions.
> 
> That is one approach we tried and was found to be buggy. The point is that
> you are not allowed to reallocate a page table while a VMA exists that is
> in the address range of that page table.
> 
> Another approach we tried is to use an ELF flag on the qemu executable.
> That does not work either because fs/exec.c allocates and populates the
> new mm struct for the argument pages before fs/binfmt_elf.c comes into
> play.
> 
>> However, we
>> might be able to let 2k and 4k page tables co-exist. We only need
>> 4k page tables whenever we want to expose such memory to a guest. So
>> turning on 4k page table allocation at one point and only allowing such
>> memory to go into our gmap (guest mapping) might be a solution.
>> User space tools like QEMU that create the VM before mmap-ing any memory
>> that will belong to the guest can simply use the new VM type. Proper 4k
>> page tables will be created for any memory mmap-ed afterwards. And these
>> can be used in the gmap without problems. Existing user space tools
>> will work as before - having to enable vm.alloc_pgste explicitly.
> 
> I can not say that I like this approach. Right now a process either uses
> 2K page tables or 4K page tables. With your patch it is basically per page
> table page. Memory areas that existed before the switch to allocate
> 4K page tables can not be mapped to the guests gmap anymore. There might
> be hidden pitfalls e.g. with guest migration.
> 
>> This should play fine with vSIE, as vSIE code works completely on the gmap.
>> So if only page tables with pgste go into our gmap, we should be fine.
>>
>> Not sure if this breaks important concepts, has some serious performance
>> problems or I am missing important cases. If so, I guess there is really
>> no way to avoid setting vm.alloc_pgste.
>>
>> Possible modifications:
>> - Enable this option via an ioctl (like KVM_S390_ENABLE_SIE) instead of
>>   a new VM type
>> - Remember if we have mixed pgtables. If !mixed, we can make maybe faster
>>   decisions (if that is really a problem).
> 
> What I do not like in particular is this function:
> 
> static inline int pgtable_has_pgste(struct mm_struct *mm, unsigned long addr)
> {
> 	struct page *page;
> 
> 	if (!mm_has_pgste(mm))
> 		return 0;
> 
> 	page = pfn_to_page(addr >> PAGE_SHIFT);
> 	return atomic_read(&page->_mapcount) & 0x4U;
> }

The good thing with this approach is that the first condition will make non-KVM
processes as fast as before. In fact, given the sysctl thing being present everywhere,
this patch might actually move non-KVM processes back to 2k page tables so it
improve those.

> 
> The check for pgstes got more complicated, it used to be a test-under-mask
> of a bit in the mm struct and a branch. Now we have an additional pfn_to_page,
> an atomic_read and a bit test. That is done multiple times for every ptep_xxx
> operation. 
> 
> Is the operational simplification of not having to set vm.allocate_pgste really
> that important ?
>