Re: [PATCH v3] Optimise TLB flush for kernel mm in UML

From: Anton Ivanov <anton.ivanov@cambridgegreys.com>
To: linux-um@lists.infradead.org
Cc: Richard Weinberger <richard@nod.at>
Subject: Re: [PATCH v3] Optimise TLB flush for kernel mm in UML
Date: Thu, 6 Dec 2018 08:19:02 +0000	[thread overview]
Message-ID: <ea57230c-efe3-8c57-092b-83849896a506@cambridgegreys.com> (raw)
In-Reply-To: <0603f0a8-c824-b783-7708-fefd5be68f57@kot-begemot.co.uk>

On 12/4/18 11:19 AM, Anton Ivanov wrote:
>
> On 10/6/18 10:15 PM, Richard Weinberger wrote:
>> Am Samstag, 6. Oktober 2018, 23:04:08 CEST schrieb Anton Ivanov:
>>> On 06/10/2018 21:38, Richard Weinberger wrote:
>>>> Anton,
>>>>
>>>> Am Donnerstag, 4. Oktober 2018, 19:25:10 CEST schrieb 
>>>> anton.ivanov@cambridgegreys.com:
>>>>> From: Anton Ivanov <anton.ivanov@cambridgegreys.com>
>>>>>
>>>>> This patch introduces bulking up memory ranges to be passed to
>>>>> mmap/munmap/mprotect instead of doing everything one page at a time.
>>>>>
>>>>> This is already being done for the userspace UML portion, this
>>>>> adds a simplified version of it for the kernel mm.
>>>>>
>>>>> This results in speed up of up to 10%+ in some areas (sequential
>>>>> disk read measured with dd, etc).
>>>> Nice!
>>>> Do you have also data on how much less memory mappings get installed?
>>>
>>> Not proper statistics. I had some debug printks early on and instead of
>>> single pages I was seeing a few hundred Kbytes at a time being 
>>> mapped in
>>> places. I can try a few trial runs with some debug printks to 
>>> collect stats.
>>>
>>>>> Add further speed-up by removing a mandatory tlb force flush
>>>>> for swapless kernel.
>>>> It is also not entirely clear to me why swap is a problem here,
>>>> can you please elaborate?
>>> I asked this question on the list a while back.
>>>
>>> One of the main remaining huge performance bugbears  in UML which
>>> accounts for most of its "fame" of being slow is the fact that there is
>>> a full TLB flush every time a fork happens in the UML userspace. It is
>>> also executed with force = 1.
>>>
>>> You pointed me to an old commit from the days svn was being used which
>>> was fixing exactly that by introducing the force parameter.
>>>
>>> I tested force on/off and the condition that commit is trying to cure
>>> still stands. If swap is enabled the tlb flush on fork/exec needs to
>>> have force=1. If, however, there is no swap in the system the force is
>>> not needed. It happily works without it.
>>>
>>> Why - dunno. I do not fully understand some of that code.
>> Okay, I hoped you figured in the meanwhile.
>> Seems like we need to dig deeper in the history.
>
> I am going to split this into two patches.
>
> The tlb mapping acceleration for the kernel is logically independent 
> from the change which makes force_all "soft" if there is no swap.
>
> While the merging of areas is fairly clear and its advantages are well 
> defined, the second is not something I understand fully.
>
> It looks like it implements the following observation which I cannot 
> judge as valid or invalid out of hand:
>
> "On UML, in the absence of swap the memory map after a fork does not 
> need to be updated. The new process is OK with whatever is already 
> mapped/unmapped as starting point".

I have it figured out. After a fork, the "mappings in" are valid, the 
mappings out which trigger the faults needed for paging are not 
necessarily so. That is why making a "softer flush" works fine without 
swap and fails with swap - the relevant pages which are paged out are 
not marked correctly as such. As a result, when they are accessed 
"interesting things" happen instead of a page fault.

Thus, the tlb flush after a fork has to ensure that all unmaps and 
pending protection changes have been refreshed. That is what is actually 
going on here and there is a possible optimization - mmaps for anything 
besides new pages can be skipped. Skipping munmaps and mprotects in any 
shape or form is actually not advisable - it may lead to 
security/information leaks.

As with all other tlb optimizations, this improves things predominantly 
by decreasing the amount of "interruptions" in the memory operation 
sequences executed by do_host_ops().

I will have the patch posted this afternoon along with benchmarks to 
demonstrate it. So far the boot speed improvement is ~ 15%. It is stable 
with and without swap and I am now clear what it does. Once I do my 
other "heavy" tests like f.e. recompiling UML itself I will have that 
posted as well.

The dream scenario here would be to be able to somehow have a set of 
fully valid tables after fork making the flush unnecessary. That will 
speed up UML for day-to-day use by an order of magnitude. There is no 
solution for this in tlb.c though. If this is possible in the first 
place, the way to do it should be somewhere down in the guts of skas 
which I do not fully understand.

A.

>
> This does not seem to be the case if there are swapped pages and I 
> actually do not understand why.
>
> A.
>
>>
>> Thanks,
>> //richard
>>
>>
>>
>> _______________________________________________
>> linux-um mailing list
>> linux-um@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-um
>>
>
> _______________________________________________
> linux-um mailing list
> linux-um@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-um
>
-- 
Anton R. Ivanov

Cambridge Greys Limited, England and Wales company No 10273661
http://www.cambridgegreys.com/

_______________________________________________
linux-um mailing list
linux-um@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-um