From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.2 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9CFFC10F01 for ; Wed, 20 Feb 2019 11:17:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A550B205F4 for ; Wed, 20 Feb 2019 11:17:21 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="sr7q2DuZ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727458AbfBTLRU (ORCPT ); Wed, 20 Feb 2019 06:17:20 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:43110 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726612AbfBTLRU (ORCPT ); Wed, 20 Feb 2019 06:17:20 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x1KBE3Dm001346; Wed, 20 Feb 2019 11:17:16 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : content-type : content-transfer-encoding : mime-version : subject : message-id : date : to; s=corp-2018-07-02; bh=8yfozOGmH82u2AWnV7ygGPHczsbRyAAPKipKRdT7+q0=; b=sr7q2DuZV3JOAVLyIQbppxKb5HwYMhDEhf5o1jTgRVkP13qYTwt45zPEAQ4UGamhUkh1 tu0H9N1CxZZLhhdpODJflLsTEk5+o9KtK89pGWV/NUW51qrMH+qfJbJXQILopmfHxK2Q yx9O2EA9kWowdoCtjr7XkOt6Hid2wXKE3D5mkOBuJc4WXDJNH2IkjTB16zhUqUOvYDuW sNbNixbL4BobIIOw5oMBO+DXUXyqJR8CSoFohE52EhNVIHcxyAi6RgVCra0IEvU8HILR vhx5LoWdWy7nTdrUzOUPbNY5E/YMg0OhNeuqb8bYEuUoA4ZjHQuoyHpCIdDf+CTXjQo+ aQ== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by userp2120.oracle.com with ESMTP id 2qpb5rgsr1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 Feb 2019 11:17:16 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id x1KBHFks017454 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 20 Feb 2019 11:17:15 GMT Received: from abhmp0010.oracle.com (abhmp0010.oracle.com [141.146.116.16]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id x1KBHEjV016376; Wed, 20 Feb 2019 11:17:14 GMT Received: from [192.168.0.110] (/73.243.10.6) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 20 Feb 2019 03:17:14 -0800 From: William Kucharski Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.2\)) Subject: [LSF/MM TOPIC ][LSF/MM ATTEND] Read-only Mapping of Program Text using Large THP Pages Message-Id: <379F21DD-006F-4E33-9BD5-F81F9BA75C10@oracle.com> Date: Wed, 20 Feb 2019 04:17:13 -0700 To: lsf-pc@lists.linux-foundation.org, Linux-MM , linux-fsdevel@vger.kernel.org X-Mailer: Apple Mail (2.3445.104.2) X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9172 signatures=668683 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1902200082 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org For the past year or so I have been working on further developing my = original prototype support of mapping read-only program text using large THP = pages. I developed a prototype described below which I continue to work on, but = the major issues I have yet to solve involve page cache integration and = filesystem support. At present, the conventional methodology of reading a single base PAGE = and using readahead to fill in additional pages isn't useful as the entire = (in my prototype) PMD page needs to be read in before the page can be mapped = (and at that point it is unclear whether readahead of additional PMD sized pages = would be of benefit or too costly. Additionally, there are no good interfaces at present to tell filesystem = layers that content is desired in chunks larger than a hardcoded limit of 64K, = or to to read disk blocks in chunks appropriate for PMD sized pages. I very briefly discussed some of this work with Kirill in the past, and = am currently somewhat blocked on progress with my prototype due to issues = with multiorder page size support in the radix tree page cache. I don't feel = it is worth the time to debug those issues since the radix tree page cache is = dead, and it's much more useful to help Matthew Wilcox get multiorder page = support for XArray tested and approved upstream. The following is a backgrounder on the work I have done to date and some performance numbers. Since it's just a prototype, I am unsure as to whether it would make a = good topic of a discussion talk per se, but should I be invited to attend it could certainly engender a good amount of discussion as a BOF/cross-discipline = topic between the MM and FS tracks. Thanks, William Kucharski =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D One of the downsides of THP as currently implemented is that it only = supports large page mappings for anonymous pages. I embarked upon this prototype on the theory that it would be = advantageous to=20 be able to map large ranges of read-only text pages using THP as well. The idea is that the kernel will attempt to allocate and map the range = using a=20 PMD sized THP page upon first fault; if the allocation is successful the = page=20 will be populated (at present using a call to kernel_read()) and the = page will=20 be mapped at the PMD level. If memory allocation fails, the page fault = routines=20 will drop through to the conventional PAGESIZE-oriented routines for = mapping=20 the faulting page. Since this approach will map a PMD size block of the memory map at a = time, we=20 should see a slight uptick in time spent in disk I/O but a substantial = drop in=20 page faults as well as a reduction in iTLB misses as address ranges will = be=20 mapped with the larger page. Analysis of a test program that consists of = a very=20 large text area (483,138,032 bytes in size) that thrashes D$ and I$ = shows this=20 does occur and there is a slight reduction in program execution time. The text segment as seen from readelf: LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 0x000000001ccc19f0 0x000000001ccc19f0 R E 0x200000 As currently implemented for test purposes, the prototype will only use = large=20 pages to map an executable with a particular filename ("testr"), = enabling easy=20 comparison of the same executable using 4K and 2M (x64) pages on the = same=20 kernel. It is understood that this is just a proof of concept = implementation=20 and much more work regarding enabling the feature and overall system = usage of=20 it would need to be done before it was submitted as a kernel patch. = However, I=20 felt it would be worthy to send it out as an RFC so I can find out = whether=20 there are huge objections from the community to doing this at all, or a = better=20 understanding of the major concerns that must be assuaged before it = would even=20 be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the=20= equivalent of "always" and bypass some checks for anonymous pages by = simply=20 #ifdefing the code out; obviously I would need to determine the right = thing to=20 do in those cases. Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d = -d -r10"=20 follow; the 4K pagesize program was named "foo" and the 2M pagesize = program=20 "testr" (as noted above) - please note that these numbers do vary from = run to=20 run, but the orders of magnitude of the differences between the two = versions=20 remain relatively constant: 4K Pages: =3D=3D=3D=3D=3D=3D=3D=3D=3D Performance counter stats for './foo' (10 runs): 307054.450421 task-clock:u (msec) # 1.000 CPUs utilized = ( +- 0.21% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 7,728 page-faults:u # 0.025 K/sec = ( +- 0.00% ) 1,401,295,823,265 cycles:u # 4.564 GHz = ( +- 0.19% ) (30.77%) 562,704,668,718 instructions:u # 0.40 insn per cycle = ( +- 0.00% ) (38.46%) 20,100,243,102 branches:u # 65.461 M/sec = ( +- 0.00% ) (38.46%) 2,628,944 branch-misses:u # 0.01% of all = branches ( +- 3.32% ) (38.46%) 180,885,880,185 L1-dcache-loads:u # 589.100 M/sec = ( +- 0.00% ) (38.46%) 40,374,420,279 L1-dcache-load-misses:u # 22.32% of all = L1-dcache hits ( +- 0.01% ) (38.46%) 232,184,583 LLC-loads:u # 0.756 M/sec = ( +- 1.48% ) (30.77%) 23,990,082 LLC-load-misses:u # 10.33% of all = LL-cache hits ( +- 1.48% ) (30.77%) L1-icache-loads:u 74,897,499,234 L1-icache-load-misses:u = ( +- 0.00% ) (30.77%) 180,990,026,447 dTLB-loads:u # 589.440 M/sec = ( +- 0.00% ) (30.77%) 707,373 dTLB-load-misses:u # 0.00% of all dTLB = cache hits ( +- 4.62% ) (30.77%) 5,583,675 iTLB-loads:u # 0.018 M/sec = ( +- 0.31% ) (30.77%) 1,219,514,499 iTLB-load-misses:u # 21840.71% of all iTLB = cache hits ( +- 0.01% ) (30.77%) L1-dcache-prefetches:u L1-dcache-prefetch-misses:u 307.093088771 seconds time elapsed = ( +- 0.20% ) 2M Pages: =3D=3D=3D=3D=3D=3D=3D=3D=3D Performance counter stats for './testr' (10 runs): 289504.209769 task-clock:u (msec) # 1.000 CPUs utilized = ( +- 0.19% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 598 page-faults:u # 0.002 K/sec = ( +- 0.03% ) 1,323,835,488,984 cycles:u # 4.573 GHz = ( +- 0.19% ) (30.77%) 562,658,682,055 instructions:u # 0.43 insn per cycle = ( +- 0.00% ) (38.46%) 20,099,662,528 branches:u # 69.428 M/sec = ( +- 0.00% ) (38.46%) 2,877,086 branch-misses:u # 0.01% of all = branches ( +- 4.52% ) (38.46%) 180,899,297,017 L1-dcache-loads:u # 624.859 M/sec = ( +- 0.00% ) (38.46%) 40,209,140,089 L1-dcache-load-misses:u # 22.23% of all = L1-dcache hits ( +- 0.00% ) (38.46%) 135,968,232 LLC-loads:u # 0.470 M/sec = ( +- 1.56% ) (30.77%) 6,704,890 LLC-load-misses:u # 4.93% of all = LL-cache hits ( +- 1.92% ) (30.77%) L1-icache-loads:u 74,955,673,747 L1-icache-load-misses:u = ( +- 0.00% ) (30.77%) 180,987,794,366 dTLB-loads:u # 625.165 M/sec = ( +- 0.00% ) (30.77%) 835 dTLB-load-misses:u # 0.00% of all dTLB = cache hits ( +- 14.35% ) (30.77%) 6,386,207 iTLB-loads:u # 0.022 M/sec = ( +- 0.42% ) (30.77%) 51,929,869 iTLB-load-misses:u # 813.16% of all iTLB = cache hits ( +- 1.61% ) (30.77%) L1-dcache-prefetches:u L1-dcache-prefetch-misses:u 289.551551387 seconds time elapsed = ( +- 0.20% ) A check of /proc/meminfo with the test program running shows the large = mappings: ShmemPmdMapped: 471040 kB The obvious problem with this first swipe at things is the large pages = are not placed into the page cache, so for example multiple concurrent = executions of the test program allocate and map the large pages each time. A greater architectural issue is the best way to support large pages in = the page cache, which is something Matthew Wilcox's multiorder page support in = XArray should solve. Some questions: * What is the best approach to deal with large pages when PAGESIZE = mappings exist? At present, the prototype evicts PAGESIZE pages from the page cache, = replacing them with a mapping for the large page, and future mappings of a = PAGESIZE range should map using an offset into the PMD sized physical page used to map = the PMD sized virtual page. * Do we need to create per-filesystem routines to handle large pages or = can we delay that (ideally we would want to be able to read in the contents of large pages without having to read_iter however many PAGESIZE pages we need.) I am happy to take whatever approach is best to add large pages to the = page cache, but it seems useful and crucuial that a way be provided for the = system to automatically use THP to map large text pages if so desired, read-only = to begin but eventually read/write to accommodate applications that self-modify = code such as databases and Java. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=