From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.5 required=3.0 tests=DKIM_ADSP_ALL,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 302ACC35246 for ; Tue, 28 Jan 2020 08:58:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C24F724684 for ; Tue, 28 Jan 2020 08:58:31 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="e1jWSv73" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C24F724684 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=amazon.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 678DB6B0008; Tue, 28 Jan 2020 03:58:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 629A96B000A; Tue, 28 Jan 2020 03:58:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5408E6B000C; Tue, 28 Jan 2020 03:58:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0098.hostedemail.com [216.40.44.98]) by kanga.kvack.org (Postfix) with ESMTP id 3C95A6B0008 for ; Tue, 28 Jan 2020 03:58:31 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id F0A6B8248047 for ; Tue, 28 Jan 2020 08:58:30 +0000 (UTC) X-FDA: 76426441980.16.crate99_846b51d02923b X-HE-Tag: crate99_846b51d02923b X-Filterd-Recvd-Size: 23064 Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29]) by imf08.hostedemail.com (Postfix) with ESMTP for ; Tue, 28 Jan 2020 08:58:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1580201910; x=1611737910; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=YSLH1mZSs0eXFOZoGZ3CD57T5B707q16Ega33VWrWbc=; b=e1jWSv73gzK1/jNhvM+B6JkO08RaxEgLL69ewyh0NG2sPHJXRBHGiafB RGl7kuiayY0aNlQYCErAWKpWC02WnwdZRIUKH1yJFB+lLKT0jqmO/X+eg 4jpYgNKgaXphpU0LepxlkrZOSxiXGcGjJzPcNdDqa8E+58qr8jEn0aMwk Q=; IronPort-SDR: 0Hl+1LO4I8m/LkKTrVEvXd1ByJP2sbXj+/o0/GHqKSrNtqv+LRQOi13Pfz03hyyu1ho3BZ1mkp o1YtLH9F8tVg== X-IronPort-AV: E=Sophos;i="5.70,373,1574121600"; d="scan'208";a="21494117" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2b-55156cd4.us-west-2.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 28 Jan 2020 08:58:18 +0000 Received: from EX13MTAUEA002.ant.amazon.com (pdx4-ws-svc-p6-lb7-vlan2.pdx.amazon.com [10.170.41.162]) by email-inbound-relay-2b-55156cd4.us-west-2.amazon.com (Postfix) with ESMTPS id 46E31A1991; Tue, 28 Jan 2020 08:58:17 +0000 (UTC) Received: from EX13D31EUA001.ant.amazon.com (10.43.165.15) by EX13MTAUEA002.ant.amazon.com (10.43.61.77) with Microsoft SMTP Server (TLS) id 15.0.1236.3; Tue, 28 Jan 2020 08:58:16 +0000 Received: from u886c93fd17d25d.ant.amazon.com (10.43.160.29) by EX13D31EUA001.ant.amazon.com (10.43.165.15) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Tue, 28 Jan 2020 08:58:08 +0000 From: To: CC: SeongJae Park , , , , , , , , , , , , , , , , , Subject: [PATCH v2 0/9] Introduce Data Access MONitor (DAMON) Date: Tue, 28 Jan 2020 09:57:33 +0100 Message-ID: <20200128085742.14566-1-sjpark@amazon.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" X-Originating-IP: [10.43.160.29] X-ClientProxiedBy: EX13D21UWA001.ant.amazon.com (10.43.160.154) To EX13D31EUA001.ant.amazon.com (10.43.165.15) Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: SeongJae Park This patchset introduces a new kernel module for practical monitoring of = data accesses, namely DAMON. The patches are organized in the following sequence. The first four patc= hes implements the core logic of DAMON one by one. After that, the fifth pat= ch implements DAMON's debugfs interface for users. To provide a minimal ref= erence to the low level interface and for more convenient use/tests of the DAMON= , the sixth patch implements an user space tool. The seventh patch adds a docu= ment for administrators of DAMON, and the eightth patch provides DAMON's kunit tests. Finally, the ninth patch implements a tracepoint for DAMON. As t= he tracepoint prints every monitoring results, it will be easily integrated = with other tracers supporting tracepoints including perf. The patches are based on the v5.5. You can also clone the complete git tree: $ git clone git://github.com/sjp38/linux -b damon/patches/v2 The web is also available: https://github.com/sjp38/linux/releases/tag/damon/patches/v2 Patch History ------------- Changes from v1 (https://lore.kernel.org/linux-mm/20200120162757.32375-1-sjpark@amazon.co= m/) - Rebase on v5.5 - Add a tracepoint for integration with other tracers (Kirill A. Shutemo= v) - document: Add more description for the user space tool (Brendan Higgin= s) - unittest: Improve readability (Brendan Higgins) - unittest: Use consistent name and helpers function (Brendan Higgins) - Update PG_Young to avoid reclaim logic interference (Yunjae Lee) Changes from RFC (https://lore.kernel.org/linux-mm/20200110131522.29964-1-sjpark@amazon.co= m/) - Specify an ambiguous plan of access pattern based mm optimizations - Support loadable module build - Cleanup code ---- DAMON is a kernel module that allows users to monitor the actual memory a= ccess pattern of specific user-space processes. It aims to be 1) accurate enou= gh to be useful for performance-centric domains, and 2) sufficiently light-weig= ht so that it can be applied online. For the goals, DAMON utilizes its two core mechanisms, called region-base= d sampling and adaptive regions adjustment. The region-based sampling allo= ws users to make their own trade-off between the quality and the overhead of= the monitoring and set the upperbound of the monitoring overhead. Further, t= he adaptive regions adjustment mechanism makes DAMON to maximize the quality= and minimize the overhead with its best efforts while preserving the users configured trade-off. Please note that the term 'memory' in this document means 'main memory'. = It also assumes that it would usually utilizes the middle level speed memory devices such as DRAMs or NVRAMs. CPU caches or storage devices are not o= ur concern, as those are too fast or too slow to be in DAMON's scope. Background =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D For performance-centric analysis and optimizations of memory management s= chemes (either that of kernel space or user space), the actual data access patte= rn of the workloads is highly useful. The information need to be only reasonab= le rather than strictly correct, because some level of incorrectness can be handled in many performance-centric domains. It also need to be taken wi= thin reasonably short time with only light-weight overhead. Manually extracting such data is not easy and time consuming if the targe= t workload is huge and complex, even for the developers of the programs. T= here are a range of tools and techniques developed for general memory access investigations, and some of those could be partially used for this purpos= e. However, most of those are not practical or unscalable, mainly because th= ose are designed with no consideration about the trade-off between the accura= cy of the output and the overhead. The memory access instrumentation techniques which is applied to many too= ls such as Intel PIN is essential for correctness required cases such as inv= alid memory access bug detections. However, those usually incur high overhead= which is unacceptable for many of the performance-centric domains. Periodic ac= cess checks based on H/W or S/W access counting features (e.g., the Accessed b= its of PTEs or the PG_Idle flags of pages) can dramatically decrease the overhea= d by forgiving some of the quality, compared to the instrumentation based techniques. The reduced quality is still reasonable for many of the doma= ins, but the overhead can arbitrarily increase as the size of the target workl= oad grows. Miniature-like static region based sampling can set the upperboun= d of the overhead, but it will now decrease the quality of the output as the s= ize of the workload grows. Related Works =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D There are a number of researches[1,2,3,4,5,6] optimizing memory managemen= t mechanisms based on the actual memory access patterns that shows impressi= ve results. However, most of those has no deep consideration about the moni= toring of the accesses itself. Some of those focused on the overhead of the monitoring, but does not consider the accuracy scalability[6] or has addi= tional dependencies[7]. Indeed, one recent research[5] about the proactive reclamation has also proposed[8] to the kernel community but the monitori= ng overhead was considered a main problem. [1] Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram= , Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2= 016. Data tiering in heterogeneous memory systems. In Proceedings of the 1= 1th European Conference on Computer Systems (EuroSys). ACM, 15. [2] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J Rossbach, and = Emmett Witchel. 2016. Coordinated and efficient huge page management with in= gens. In 12th USENIX Symposium on Operating Systems Design and Implementati= on (OSDI). 705=E2=80=93721. [3] Harald Servat, Antonio J Pe=C3=B1a, Germ=C3=A1n Llort, Estanislao Mer= cadal, HansChristian Hoppe, and Jes=C3=BAs Labarta. 2017. Automating the app= lication data placement in hybrid memory systems. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 126=E2=80=93136. [4] Vlad Nitu, Boris Teabe, Alain Tchana, Canturk Isci, and Daniel Hagimo= nt. 2018. Welcome to zombieland: practical and energy-efficient memory disaggregation in a datacenter. In Proceedings of the 13th European Conference on Computer Systems (EuroSys). ACM, 16. [5] Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal, Ra= doslaw Burny, Shakeel Butt, Jichuan Chang, Ashwin Chaugule, Nan Deng, Junaid Shahid, Greg Thelen, Kamil Adam Yurtsever, Yu Zhao, and Parthasarathy Ranganathan. 2019. Software-Defined Far Memory in Warehouse-Scale Computers. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, New York, NY, USA, 317=E2=80=93330. DOI:https://doi.org/10.1145/3297858.3304053 [6] Carl Waldspurger, Trausti Saemundsson, Irfan Ahmad, and Nohhyun Park. 2017. Cache Modeling and Optimization using Miniature Simulations. In= 2017 USENIX Annual Technical Conference (ATC). USENIX Association, Santa Clara, CA, 487=E2=80=93498. https://www.usenix.org/conference/atc17/technical-sessions/ [7] Haojie Wang, Jidong Zhai, Xiongchao Tang, Bowen Yu, Xiaosong Ma, and Wenguang Chen. 2018. Spindle: Informed Memory Access Monitoring. In 2= 018 USENIX Annual Technical Conference (ATC). USENIX Association, Boston,= MA, 561=E2=80=93574. https://www.usenix.org/conference/atc18/presentatio= n/wang-haojie [8] Jonathan Corbet. 2019. Proactively reclaiming idle memory. (2019). https://lwn.net/Articles/787611/. Expected Use-cases =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D A straightforward usecase of DAMON would be the program behavior analysis= . With the DAMON output, users can confirm whether the program is running a= s intended or not. This will be useful for debuggings and tests of design points. The monitored results can also be useful for counting the dynamic working= set size of workloads. For the administration of memory overcommitted system= s or selection of the environments (e.g., containers providing different amoun= t of memory) for your workloads, this will be useful. If you are a programmer, you can optimize your program by managing the me= mory based on the actual data access pattern. For example, you can identify t= he dynamic hotness of your data using DAMON and call ``mlock()`` to keep you= r hot data in DRAM, or call ``madvise()`` with ``MADV_PAGEOUT`` to proactively reclaim cold data. Even though your program is guaranteed to not encount= er memory pressure, you can still improve the performance by applying the DA= MON outputs for call of ``MADV_HUGEPAGE`` and ``MADV_NOHUGEPAGE``. More crea= tive optimizations would be possible. Our evaluations of DAMON includes a straightforward optimization using the ``mlock()``. Please refer to the = below Evaluation section for more detail. As DAMON incurs very low overhead, such optimizations can be applied not = only offline, but also online. Also, there is no reason to limit such optimiz= ations to the user space. Several parts of the kernel's memory management mecha= nisms could be also optimized using DAMON. The reclamation, the THP (de)promoti= on decisions, and the compaction would be such a candidates. Nevertheless, current version of DAMON is not highly optimized for the online/in-kernel= uses. A Future Plan: Data Access Based Optimizations Support ------------------------------------------------------ As described in the above section, DAMON could be helpful for actual acce= ss based memory management optimizations. Nevertheless, users who want to d= o such optimizations should run DAMON, read the traced data (either online or offline), analyze it, plan a new memory management scheme, and apply the = new scheme by themselves. It must be easier than the past, but could still r= equire some level of efforts. In its next development stage, DAMON will reduce = some of such efforts by allowing users to specify some access based memory management rules for their specific processes. Because this is just a plan, the specific interface is not fixed yet, but= for example, users will be allowed to write their desired memory management r= ules to a special file in a DAMON specific format. The rules will be somethin= g like 'if a memory region of size in a range is keeping a range of hotness for = more than a duration, apply specific memory management rule using madvise() or mlock() to the region'. For example, we can imagine rules like below: # format is: # if a region of a size keeps a very high access frequency for more t= han # 100ms, lock the region in the main memory (call mlock()). But, if t= he # region is larger than 500 MiB, skip it. The exception might be help= ful # if the system has only, say, 600 MiB of DRAM, a region of size larg= er # than 600 MiB cannot be locked in the DRAM at all. na 500M 90 99 100ms mlock # if a region keeps a high access frequency for more than 100ms, put = the # region on the head of the LRU list (call madvise() with MADV_WILLNE= ED). na na 80 90 100ms madv_willneed # if a region keeps a low access frequency for more than 100ms, put t= he # region on the tail of the LRU list (call madvise() with MADV_COLD). na na 10 20 100ms madv_cold # if a region keeps a very low access frequency for more than 100ms, = swap # out the region immediately (call madvise() with MADV_PAGEOUT). na na 0 10 100ms madv_pageout # if a region of a size bigger than 2MB keeps a very high access freq= uency # for more than 100ms, let the region to use huge pages (call madvise= () # with MADV_HUGEPAGE). 2M na 90 99 100ms madv_hugepage # If a regions of a size bigger than > 2MB keeps no high access frequ= ency # for more than 100ms, avoid the region from using huge pages (call # madvise() with MADV_NOHUGEPAGE). 2M na 0 25 100ms madv_nohugepage Mechanisms of DAMON =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Basic Access Check ------------------ DAMON basically reports what pages are how frequently accessed. The repo= rt is passed to users in binary format via a ``result file`` which users can se= t it's path. Note that the frequency is not an absolute number of accesses, but= a relative frequency among the pages of the target workloads. Users can also control the resolution of the reports by setting two time intervals, ``sampling interval`` and ``aggregation interval``. In detail= , DAMON checks access to each page per ``sampling interval``, aggregates th= e results (counts the number of the accesses to each page), and reports the aggregated results per ``aggregation interval``. For the access check of= each page, DAMON uses the Accessed bits of PTEs. This is thus similar to the previously mentioned periodic access checks b= ased mechanisms, which overhead is increasing as the size of the target proces= s grows. Region Based Sampling --------------------- To avoid the unbounded increase of the overhead, DAMON groups a number of adjacent pages that assumed to have same access frequencies into a region= . As long as the assumption (pages in a region have same access frequencies) i= s kept, only one page in the region is required to be checked. Thus, for e= ach ``sampling interval``, DAMON randomly picks one page in each region and c= lears its Accessed bit. After one more ``sampling interval``, DAMON reads the Accessed bit of the page and increases the access frequency of the region= if the bit has set meanwhile. Therefore, the monitoring overhead is control= lable by setting the number of regions. DAMON allows users to set the minimal = and maximum number of regions for the trade-off. Except the assumption, this is almost same with the above-mentioned miniature-like static region based sampling. In other words, this scheme cannot preserve the quality of the output if the assumption is not guaran= teed. Adaptive Regions Adjustment --------------------------- At the beginning of the monitoring, DAMON constructs the initial regions = by evenly splitting the memory mapped address space of the process into the user-specified minimal number of regions. In this initial state, the assumption is normally not kept and thus the quality could be low. To ke= ep the assumption as much as possible, DAMON adaptively merges and splits each r= egion. For each ``aggregation interval``, it compares the access frequencies of adjacent regions and merges those if the frequency difference is small. = Then, after it reports and clears the aggregated access frequency of each regio= n, it splits each region into two regions if the total number of regions is sma= ller than the half of the user-specified maximum number of regions. In this way, DAMON provides its best-effort quality and minimal overhead = while keeping the bounds users set for their trade-off. Applying Dynamic Memory Mappings -------------------------------- Only a number of small parts in the super-huge virtual address space of t= he processes is mapped to physical memory and accessed. Thus, tracking the unmapped address regions is just wasteful. However, tracking every memor= y mapping change might incur an overhead. For the reason, DAMON applies th= e dynamic memory mapping changes to the tracking regions only for each of a= n user-specified time interval (``regions update interval``). Evaluations =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D A prototype of DAMON has evaluated on an Intel Xeon E7-8837 machine using= 20 benchmarks that picked from SPEC CPU 2006, NAS, Tensorflow Benchmark, SPLASH-2X, and PARSEC 3 benchmark suite. Nonethless, this section provid= es only summary of the results. For more detail, please refer to the slides= used for the introduction of DAMON at the Linux Plumbers Conference 2019[1] or= the MIDDLEWARE'19 industrial track paper[2]. Quality ------- We first traced and visualized the data access pattern of each workload. = We were able to confirm that the visualized results are reasonably accurate = by manually comparing those with the source code of the workloads. To see the usefulness of the monitoring, we optimized 9 memory intensive workloads among them for memory pressure situations using the DAMON outpu= ts. In detail, we identified frequently accessed memory regions in each workl= oad based on the DAMON results and protected them with ``mlock()`` system cal= ls. The optimized versions consistently show speedup (2.55x in best case, 1.6= 5x in average) under memory pressure situation. Overhead -------- We also measured the overhead of DAMON. It was not only under the upperb= ound we set, but was much lower (0.6 percent of the bound in best case, 13.288 percent of the bound in average). This reduction of the overhead is main= ly resulted from the adaptive regions adjustment. We also compared the over= head with that of the straightforward periodic Accessed bit check-based monito= ring, which checks the access of every page frame. DAMON's overhead was much s= maller than the straightforward mechanism by 94,242.42x in best case, 3,159.61x = in average. References =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Prototypes of DAMON have introduced by an LPC kernel summit track talk[1]= and two academic papers[2,3]. Please refer to those for more detailed inform= ation, especially the evaluations. [1] SeongJae Park, Tracing Data Access Pattern with Bounded Overhead and Best-effort Accuracy. In The Linux Kernel Summit, September 2019. https://linuxplumbersconf.org/event/4/contributions/548/ [2] SeongJae Park, Yunjae Lee, Heon Y. Yeom, Profiling Dynamic Data Acces= s Patterns with Controlled Overhead and Quality. In 20th ACM/IFIP International Middleware Conference Industry, December 2019. https://dl.acm.org/doi/10.1145/3366626.3368125 [3] SeongJae Park, Yunjae Lee, Yunhee Kim, Heon Y. Yeom, Profiling Dynami= c Data Access Patterns with Bounded Overhead and Accuracy. In IEEE Internati= onal Workshop on Foundations and Applications of Self- Systems (FAS 2019),= June 2019. SeongJae Park (9): mm: Introduce Data Access MONitor (DAMON) mm/damon: Implement region based sampling mm/damon: Adaptively adjust regions mm/damon: Apply dynamic memory mapping changes mm/damon: Add debugfs interface mm/damon: Add minimal user-space tools Documentation/admin-guide/mm: Add a document for DAMON mm/damon: Add kunit tests mm/damon: Add a tracepoint for result buffer writing .../admin-guide/mm/data_access_monitor.rst | 401 +++++ Documentation/admin-guide/mm/index.rst | 1 + MAINTAINERS | 10 + include/trace/events/damon.h | 32 + mm/Kconfig | 23 + mm/Makefile | 1 + mm/damon-test.h | 571 ++++++++ mm/damon.c | 1297 +++++++++++++++++ tools/damon/.gitignore | 1 + tools/damon/_dist.py | 35 + tools/damon/bin2txt.py | 64 + tools/damon/damo | 37 + tools/damon/heats.py | 358 +++++ tools/damon/nr_regions.py | 88 ++ tools/damon/record.py | 194 +++ tools/damon/report.py | 45 + tools/damon/wss.py | 94 ++ 17 files changed, 3252 insertions(+) create mode 100644 Documentation/admin-guide/mm/data_access_monitor.rst create mode 100644 include/trace/events/damon.h create mode 100644 mm/damon-test.h create mode 100644 mm/damon.c create mode 100644 tools/damon/.gitignore create mode 100644 tools/damon/_dist.py create mode 100644 tools/damon/bin2txt.py create mode 100755 tools/damon/damo create mode 100644 tools/damon/heats.py create mode 100644 tools/damon/nr_regions.py create mode 100644 tools/damon/record.py create mode 100644 tools/damon/report.py create mode 100644 tools/damon/wss.py --=20 2.17.1