linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Borntraeger <borntraeger@de.ibm.com>
To: Martin Schwidefsky <schwidefsky@de.ibm.com>,
	David Hildenbrand <david@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Heiko Carstens <heiko.carstens@de.ibm.com>,
	Thomas Huth <thuth@redhat.com>
Subject: Re: [PATCH RFC 0/2] KVM: s390: avoid having to enable vm.alloc_pgste
Date: Thu, 1 Jun 2017 13:24:26 +0200	[thread overview]
Message-ID: <51ccff8c-bb09-3dc4-4d75-bf1b86ca75a9@de.ibm.com> (raw)
In-Reply-To: <20170601124651.3e7969ab@mschwideX1>

On 06/01/2017 12:46 PM, Martin Schwidefsky wrote:
> Hi David,
> 
> it is nice to see that you are still working on s390 related topics.
> 
> On Mon, 29 May 2017 18:32:00 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Having to enable vm.alloc_pgste globally might not be the best solution.
>> 4k page tables are created for all processes and running QEMU KVM guests
>> is more complicated than it should be.
> 
> To run KVM guests you need to issue a single sysctl to set vm.allocate_pgste,
> this is the best solution we found so far.

Suse and Ubuntu seem to have a sysctl.conf file in the qemu-kvm package that
does a global switch.


> 
>> Unfortunately, converting all page tables to 4k pgste page tables is
>> not possible without provoking various race conditions.
> 
> That is one approach we tried and was found to be buggy. The point is that
> you are not allowed to reallocate a page table while a VMA exists that is
> in the address range of that page table.
> 
> Another approach we tried is to use an ELF flag on the qemu executable.
> That does not work either because fs/exec.c allocates and populates the
> new mm struct for the argument pages before fs/binfmt_elf.c comes into
> play.
> 
>> However, we
>> might be able to let 2k and 4k page tables co-exist. We only need
>> 4k page tables whenever we want to expose such memory to a guest. So
>> turning on 4k page table allocation at one point and only allowing such
>> memory to go into our gmap (guest mapping) might be a solution.
>> User space tools like QEMU that create the VM before mmap-ing any memory
>> that will belong to the guest can simply use the new VM type. Proper 4k
>> page tables will be created for any memory mmap-ed afterwards. And these
>> can be used in the gmap without problems. Existing user space tools
>> will work as before - having to enable vm.alloc_pgste explicitly.
> 
> I can not say that I like this approach. Right now a process either uses
> 2K page tables or 4K page tables. With your patch it is basically per page
> table page. Memory areas that existed before the switch to allocate
> 4K page tables can not be mapped to the guests gmap anymore. There might
> be hidden pitfalls e.g. with guest migration.
> 
>> This should play fine with vSIE, as vSIE code works completely on the gmap.
>> So if only page tables with pgste go into our gmap, we should be fine.
>>
>> Not sure if this breaks important concepts, has some serious performance
>> problems or I am missing important cases. If so, I guess there is really
>> no way to avoid setting vm.alloc_pgste.
>>
>> Possible modifications:
>> - Enable this option via an ioctl (like KVM_S390_ENABLE_SIE) instead of
>>   a new VM type
>> - Remember if we have mixed pgtables. If !mixed, we can make maybe faster
>>   decisions (if that is really a problem).
> 
> What I do not like in particular is this function:
> 
> static inline int pgtable_has_pgste(struct mm_struct *mm, unsigned long addr)
> {
> 	struct page *page;
> 
> 	if (!mm_has_pgste(mm))
> 		return 0;
> 
> 	page = pfn_to_page(addr >> PAGE_SHIFT);
> 	return atomic_read(&page->_mapcount) & 0x4U;
> }

The good thing with this approach is that the first condition will make non-KVM
processes as fast as before. In fact, given the sysctl thing being present everywhere,
this patch might actually move non-KVM processes back to 2k page tables so it
improve those.


> 
> The check for pgstes got more complicated, it used to be a test-under-mask
> of a bit in the mm struct and a branch. Now we have an additional pfn_to_page,
> an atomic_read and a bit test. That is done multiple times for every ptep_xxx
> operation. 
> 
> Is the operational simplification of not having to set vm.allocate_pgste really
> that important ?
> 

  reply	other threads:[~2017-06-01 11:24 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-29 16:32 [PATCH RFC 0/2] KVM: s390: avoid having to enable vm.alloc_pgste David Hildenbrand
2017-05-29 16:32 ` [PATCH RFC 1/2] s390x: mm: allow mixed page table types (2k and 4k) David Hildenbrand
2017-06-01 11:39   ` Christian Borntraeger
2017-06-01 12:44     ` David Hildenbrand
2017-06-01 12:59   ` David Hildenbrand
2017-06-02  7:11     ` Christian Borntraeger
2017-05-29 16:32 ` [PATCH RFC 2/2] KVM: s390: Introduce KVM_VM_S390_LATE_MMAP David Hildenbrand
2017-06-01 10:46 ` [PATCH RFC 0/2] KVM: s390: avoid having to enable vm.alloc_pgste Martin Schwidefsky
2017-06-01 11:24   ` Christian Borntraeger [this message]
2017-06-01 11:27   ` David Hildenbrand
2017-06-02  7:06     ` Heiko Carstens
2017-06-02  7:02   ` Heiko Carstens
2017-06-02  7:13     ` Christian Borntraeger
2017-06-02  7:16       ` Martin Schwidefsky
2017-06-02  7:18         ` Christian Borntraeger
2017-06-02  7:25           ` Christian Borntraeger
2017-06-02  8:11             ` Martin Schwidefsky
2017-06-02  9:46     ` Martin Schwidefsky
2017-06-02 10:19       ` Christian Borntraeger
2017-06-02 10:53         ` Martin Schwidefsky
2017-06-02 13:20           ` Christian Borntraeger
2017-06-07 12:34             ` Martin Schwidefsky
2017-06-07 20:47               ` Heiko Carstens
2017-06-08  5:35                 ` Martin Schwidefsky
2017-06-08  6:25                   ` Heiko Carstens
2017-06-08 11:24                     ` Martin Schwidefsky
2017-06-08 13:17                       ` Heiko Carstens
2017-06-02 10:28       ` Heiko Carstens
2017-06-02 10:48         ` Martin Schwidefsky
2017-06-02 10:54     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51ccff8c-bb09-3dc4-4d75-bf1b86ca75a9@de.ibm.com \
    --to=borntraeger@de.ibm.com \
    --cc=david@redhat.com \
    --cc=heiko.carstens@de.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=schwidefsky@de.ibm.com \
    --cc=thuth@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).