From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751695AbdFAL1d (ORCPT <rfc822;w@1wt.eu>);
        Thu, 1 Jun 2017 07:27:33 -0400
Received: from mx1.redhat.com ([209.132.183.28]:52330 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751505AbdFAL1b (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Thu, 1 Jun 2017 07:27:31 -0400
DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 3C7D381250
Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com
Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=david@redhat.com
DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 3C7D381250
Subject: Re: [PATCH RFC 0/2] KVM: s390: avoid having to enable vm.alloc_pgste
To: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
        Heiko Carstens <heiko.carstens@de.ibm.com>,
        Thomas Huth <thuth@redhat.com>,
        Christian Borntraeger <borntraeger@de.ibm.com>
References: <20170529163202.13077-1-david@redhat.com>
 <20170601124651.3e7969ab@mschwideX1>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat GmbH
Message-ID: <a02736a2-f257-36c4-0c62-2cd57ae8c141@redhat.com>
Date: Thu, 1 Jun 2017 13:27:28 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.1.0
MIME-Version: 1.0
In-Reply-To: <20170601124651.3e7969ab@mschwideX1>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Thu, 01 Jun 2017 11:27:31 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Martin,

thanks for having a look at this!

>> However, we
>> might be able to let 2k and 4k page tables co-exist. We only need
>> 4k page tables whenever we want to expose such memory to a guest. So
>> turning on 4k page table allocation at one point and only allowing such
>> memory to go into our gmap (guest mapping) might be a solution.
>> User space tools like QEMU that create the VM before mmap-ing any memory
>> that will belong to the guest can simply use the new VM type. Proper 4k
>> page tables will be created for any memory mmap-ed afterwards. And these
>> can be used in the gmap without problems. Existing user space tools
>> will work as before - having to enable vm.alloc_pgste explicitly.
> 
> I can not say that I like this approach. Right now a process either uses
> 2K page tables or 4K page tables. With your patch it is basically per page
> table page. Memory areas that existed before the switch to allocate
> 4K page tables can not be mapped to the guests gmap anymore. There might
> be hidden pitfalls e.g. with guest migration.

As long as QEMU knows what it is doing, I don't see problems with
migration. Even for migration, all memory is mmaped afterwards. But yes,
we have to think about this carefully. (that's why QEMU explicitly has
to switch this behavior on, to not break user space that mmaps it
differently).

> 
>> This should play fine with vSIE, as vSIE code works completely on the gmap.
>> So if only page tables with pgste go into our gmap, we should be fine.
>>
>> Not sure if this breaks important concepts, has some serious performance
>> problems or I am missing important cases. If so, I guess there is really
>> no way to avoid setting vm.alloc_pgste.
>>
>> Possible modifications:
>> - Enable this option via an ioctl (like KVM_S390_ENABLE_SIE) instead of
>>   a new VM type
>> - Remember if we have mixed pgtables. If !mixed, we can make maybe faster
>>   decisions (if that is really a problem).
> 
> What I do not like in particular is this function:
> 
> static inline int pgtable_has_pgste(struct mm_struct *mm, unsigned long addr)
> {
> 	struct page *page;
> 
> 	if (!mm_has_pgste(mm))
> 		return 0;
> 
> 	page = pfn_to_page(addr >> PAGE_SHIFT);
> 	return atomic_read(&page->_mapcount) & 0x4U;
> }
> 
> The check for pgstes got more complicated, it used to be a test-under-mask
> of a bit in the mm struct and a branch. Now we have an additional pfn_to_page,
> an atomic_read and a bit test. That is done multiple times for every ptep_xxx
> operation. 

The "heavy overhead" only applies to processes using KVM, ordinary
processes only have one bit test (just as before !mm_has_pgste(mm)).

Can you judge how much overhead this could be in practice? (esp: can it
be noticed?)

We could of course "tune" the functions to reduce the number of
pgtable_has_pgste() but we won't get rid of at least one such
pfn_to_page. And the question is if already one such test degrades
performance noticeably. Unfortunately the z/VM test system I have is not
really the best fit for performance tests.

> 
> Is the operational simplification of not having to set vm.allocate_pgste really
> that important ?
> 

The problem is that once we have a package that installs QEMU, and we
don't want to completely confuse the user (let him care about
vm.allocate_pgste), we have to set this automatically.

E.g. see https://bugzilla.redhat.com/show_bug.cgi?id=1290589

a) install a config file that will set it on every boot (like fedora)
b) set vm.allocate_pgste=1 right away, so QEMU can be used without a reboot

Now, this is possible, but as you know this can implicitly influence
other workload (double size of page tables for everything). Esp.
thinking about systems that are not pure KVM hosts. So by simply
installing the QEMU package, we change system behavior. Even if QEMU
will only be used sometimes on that host.

So actually what we want is : PGSTES only for processes that actually
need it and handle the details internally.

I don't really see too many ways to do this differently. The more we can
hide "internal specialties", the better. E.g. think about guests being
based on huge pages completely someday in the future: the concept of
pgstes does not apply here. Still, pgste will have to be enabled for the
complete system and there isn't too much we can do about it
(s390_enable_sie can't know that QEMU will only use huge pages, so we
would need yet another VM type in QEMU to handle guests that can only
map huge pages).

An alternative: Have some process that enables PGSTE for all of its
future children. Fork+execv qemu. However, such a process in between
will most likely confuse tooling like libvirt when it comes to process ids.

Can you think of any other way to handle this? Setting
vm.allocate_pgste=1 should really be avoided if possible.


Thanks a lot again!


-- 

Thanks,

David