From mboxrd@z Thu Jan 1 00:00:00 1970 From: Khalid Aziz Subject: Re: [RFC PATCH v9 12/13] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only) Date: Thu, 4 Apr 2019 16:55:48 -0600 Message-ID: <91f1dbce-332e-25d1-15f6-0e9cfc8b797b@oracle.com> References: <4495dda4bfc4a06b3312cc4063915b306ecfaecb.1554248002.git.khalid.aziz@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Content-Language: en-US List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org Errors-To: iommu-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org To: Andy Lutomirski Cc: Catalin Marinas , Will Deacon , steven.sistare-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org, Christoph Hellwig , Tycho Andersen , Andi Kleen , aneesh.kumar-tEXmvtCZX7AybS5Ee8rs3A@public.gmane.org, James Morris , David Rientjes , anthony.yznaga-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org, Rik van Riel , Nicholas Piggin , Mike Rapoport , Thomas Gleixner , Greg KH , Randy Dunlap , LKML , Souptick Joarder , Jiri Kosina , Joe Perches , arunks-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org, Andrew Morton , "Woodhouse, David" , Mark Rutland , "open list:DOCUMENTATION" List-Id: iommu@lists.linux-foundation.org On 4/3/19 10:10 PM, Andy Lutomirski wrote: > On Wed, Apr 3, 2019 at 10:36 AM Khalid Aziz wrote: >> >> XPFO flushes kernel space TLB entries for pages that are now mapped >> in userspace on not only the current CPU but also all other CPUs >> synchronously. Processes on each core allocating pages causes a >> flood of IPI messages to all other cores to flush TLB entries. >> Many of these messages are to flush the entire TLB on the core if >> the number of entries being flushed from local core exceeds >> tlb_single_page_flush_ceiling. The cost of TLB flush caused by >> unmapping pages from physmap goes up dramatically on machines with >> high core count. >> >> This patch flushes relevant TLB entries for current process or >> entire TLB depending upon number of entries for the current CPU >> and posts a pending TLB flush on all other CPUs when a page is >> unmapped from kernel space and mapped in userspace. Each core >> checks the pending TLB flush flag for itself on every context >> switch, flushes its TLB if the flag is set and clears it. >> This patch potentially aggregates multiple TLB flushes into one. >> This has very significant impact especially on machines with large >> core counts. > > Why is this a reasonable strategy? Ideally when pages are unmapped from physmap, all CPUs would be sent IPI synchronously to flush TLB entry for those pages immediately. This may be ideal from correctness and consistency point of view, but it also results in IPI storm and repeated TLB flushes on all processors. Any time a page is allocated to userspace, we are going to go through this and it is very expensive. On a 96-core server, performance degradation is 26x!! When xpfo unmaps a page from physmap only (after mapping the page in userspace in response to an allocation request from userspace) on one processor, there is a small window of opportunity for ret2dir attack on other cpus until the TLB entry in physmap for the unmapped pages on other cpus is cleared. Forcing that to happen synchronously is the expensive part. A multiple of these requests can come in over a very short time across multiple processors resulting in every cpu asking every other cpusto flush TLB just to close this small window of vulnerability in the kernel. If each request is processed synchronously, each CPU will do multiple TLB flushes in short order. If we could consolidate these TLB flush requests instead and do one TLB flush on each cpu at the time of context switch, we can reduce the performance impact significantly. This bears out in real life measuring the system time when doing a parallel kernel build on a large server. Without this, system time on 96-core server when doing "make -j60 all" went up 26x. After this optimization, impact went down to 1.44x. The trade-off with this strategy is, the kernel on a cpu is vulnerable for a short time if the current running processor is the malicious process. Is that an acceptable trade-off? I am open to other ideas on reducing the performance impact due to xpfo. > >> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end) >> +{ >> + struct cpumask tmp_mask; >> + >> + /* >> + * Balance as user space task's flush, a bit conservative. >> + * Do a local flush immediately and post a pending flush on all >> + * other CPUs. Local flush can be a range flush or full flush >> + * depending upon the number of entries to be flushed. Remote >> + * flushes will be done by individual processors at the time of >> + * context switch and this allows multiple flush requests from >> + * other CPUs to be batched together. >> + */ > > I don't like this function at all. A core function like this is a > contract of sorts between the caller and the implementation. There is > no such thing as an "xpfo" flush, and this function's behavior isn't > at all well defined. For flush_tlb_kernel_range(), I can tell you > exactly what that function does, and the implementation is either > correct or incorrect. With this function, I have no idea what is > actually required, and I can't possibly tell whether it's correct. > > As far as I can see, xpfo_flush_tlb_kernel_range() actually means > "flush this range on this CPU right now, and flush it on remote CPUs > eventually". It would be valid, but probably silly, to flush locally > and to never flush at all on remote CPUs. This makes me wonder what > the point is. > I would restate that as "flush this range on this cpu right now, and flush it on remote cpus at the next context switch". A better name for the routine and a better description is a reasonable change to make. Thanks, Khalid