From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757492AbbKFQEP (ORCPT ); Fri, 6 Nov 2015 11:04:15 -0500 Received: from foss.arm.com ([217.140.101.70]:42700 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757398AbbKFQEM (ORCPT ); Fri, 6 Nov 2015 11:04:12 -0500 Date: Fri, 6 Nov 2015 16:04:08 +0000 From: Catalin Marinas To: Arnd Bergmann Cc: linux-arm-kernel@lists.infradead.org, Linus Torvalds , Linux Kernel Mailing List , Will Deacon Subject: Re: [GIT PULL] arm64 updates for 4.4 Message-ID: <20151106160407.GX7637@e104818-lin.cambridge.arm.com> References: <20151104182508.GA28726@e104818-lin.cambridge.arm.com> <20151105182718.GV7637@e104818-lin.cambridge.arm.com> <6512377.hSX37CgtMC@wuerfel> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6512377.hSX37CgtMC@wuerfel> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 06, 2015 at 10:57:58AM +0100, Arnd Bergmann wrote: > On Thursday 05 November 2015 18:27:18 Catalin Marinas wrote: > > On Wed, Nov 04, 2015 at 02:55:01PM -0800, Linus Torvalds wrote: > > > On Wed, Nov 4, 2015 at 10:25 AM, Catalin Marinas wrote: > > > It's good for single-process loads - if you do a lot of big fortran > > > jobs, or a lot of big database loads, and nothing else, you're fine. > > > > These are some of the arguments from the server camp: specific > > workloads. > > I think (a little overgeneralized), you want 4KB pages for any file > based mappings, In general, yes, but if the main/only workload on your server is mapping large db files, the memory usage cost may be amortised. For general purpose stuff like compiling a Linux kernel, I did some tests (kernbench) and the page cache usage went from ~2.5GB with 4KB pages to ~6.6GB with 64KB pages, so clearly not suitable. Unfortunately I couldn't get any meaningful performance numbers as the test was done over slow NFS. I'm not recommending 64KB pages but I'm closely following how it's used and any performance figures. In terms of TLB, there are two aspects that larger pages try to address (to the detriment of memory usage): 1. A reduction in TLB misses 2. A reduction in the cost of a TLB miss by having fewer page table levels (42-bit VA with 2 levels vs 3 or even 4 with 4KB). Of course, Linus' point for making TLB faster is always good idea but even on x86 people are looking to improve things (otherwise we may not have had THP/hugetlb supported on this architecture). > but larger (in some cases much larger) for anonymous > memory. The last time this came up, I theorized about a way to change > do_anonymous_page() to always operate on 64KB units on a normal > 4KB page based kernel, and use the ARM64 contiguous page hint > to get 64KB TLBs for those pages. We have transparent huge pages for this, though the much higher 2MB size. This would also improve the cost of a TLB miss by walking one fewer level (point 2 above). I've seen patches for THP on file maps but I'm not sure what the status is. As a test, we could fake a 64KB THP by using a dummy PMD that contains 16 PTE entries, just to see how the performance goes. But this would only address point 1 above. > This could be done compile-time, system-wide, or per-process if > desired, and should provide performance as good as the current > 64KB page kernels for almost any server workloads, and in > some cases much better than that, as long as the hints are > actually interpreted by the CPU implementation. Apart from anonymous mappings, could the file page cache be optimised? Not all file accesses use mmap() (e.g. gcc compilation seems to do sequential accesses for the C files the compiler reads), so you don't always need a full page cache page for a file. We could have a feature to allow sharing of partially filled page cache pages and only break them up if mmap'ed to user. A less optimal implementation based on the current kernel infrastructure could be something like a cleancache driver able to store partially filled page cache pages more efficiently (together with a more aggressive eviction of such pages from the page cache into the cleancache). -- Catalin