From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,UNPARSEABLE_RELAY,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 776FAC4360F for ; Fri, 5 Apr 2019 00:32:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1C3EA217D4 for ; Fri, 5 Apr 2019 00:32:29 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1C3EA217D4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DED7B6B000D; Thu, 4 Apr 2019 20:32:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D9D566B000E; Thu, 4 Apr 2019 20:32:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C3F706B026B; Thu, 4 Apr 2019 20:32:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com [209.85.215.200]) by kanga.kvack.org (Postfix) with ESMTP id 83CA76B000D for ; Thu, 4 Apr 2019 20:32:28 -0400 (EDT) Received: by mail-pg1-f200.google.com with SMTP id e12so2690386pgh.2 for ; Thu, 04 Apr 2019 17:32:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:to:cc :references:from:message-id:date:user-agent:mime-version:in-reply-to :content-transfer-encoding:content-language; bh=kq1gZ1oSWL74k0ayeZ53XFkfMVe0yxXtlQqveMy/Y0s=; b=ktGa77mtEh1jxydpHLQ2RW2GH1Qy+uknn6D3ySCRUHR8yOekyy2x4/4E9G63npodod HHSksZu76YW+bUU7dZ7t7xDY7HUHgu6mYM0+8y7p2nz7NwllDy1X7i6hiJWAPEZLlcVD 5DLUFki2mRlxgpcJ0x0U0XeMLonpKdT73W6N5YmHJQLf0KCBH7NItiXskxQQAj/ywThW 6kAIhAtVYXgOfkA7yFbRcQNsQ9ThcE5WSqoC4ChSbnDOSuOsqFQ6DAmKAPniYg4/Y4eo z3hLVTiSyDnaFx+4/45QYKAld4QvCzYnmbfAvAbzUnq/B4gmgWK2p6tj4q/7GoFgMUuK u8FQ== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of yang.shi@linux.alibaba.com designates 115.124.30.45 as permitted sender) smtp.mailfrom=yang.shi@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Gm-Message-State: APjAAAXiOrRN951THMLgMFjF7PDeNgd2hBI1z33i0XOQ6O70ZYnRbGg0 RXyggi28jJxlJI8IhTPmjYt0oVsFzYgs6tg3wtYY+9RqJWFybSEZvj+l7T+Q/u8JGJMK4UTDpaS PcRr4bWiPGdtdENP84dy6rs41ZzCCv9mrJWHzEvbY2HC/DZvw+zDoaoqhto75DhVH1Q== X-Received: by 2002:a63:e554:: with SMTP id z20mr8795702pgj.234.1554424348094; Thu, 04 Apr 2019 17:32:28 -0700 (PDT) X-Google-Smtp-Source: APXvYqwaQ60J4vYbDFfJxN+cFiN/TDzrXIXEdP2uQVy5EEdo/eXV2qXhEbT5NNF+/AiUcXjXuC0g X-Received: by 2002:a63:e554:: with SMTP id z20mr8795603pgj.234.1554424346932; Thu, 04 Apr 2019 17:32:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1554424346; cv=none; d=google.com; s=arc-20160816; b=HrFonlNI/HMEAW82pEUMXulM+6z5pjjI3wpdkAH27GLOFdA8AckvqoJ94mS+U+KDg9 U8KGNcrs4w1TlByduS9xbYOlpXeR5W0QD9KcZhoJ9HQq9h4eXZPkln/pjvdL2qs96uWa f4PWzzE2xlrQQPUFalQSjwV/A+RygS62V9XGxD8KceTL8+rgX/ruWegruNEyNnlzH5e5 9bu5Y0fEmTlHAswT+8GaxZmSlnmIH7YvyLZswKIfMLeAjvVI2eBD/1Xm+A54czi9J9QN 4EQIbbtZbc6EPsuoPRgMYeWXtGPLlkJN+6YAlbXWGOA4NBPdooo8ueCD6EmHitd2a+KI YV5Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-language:content-transfer-encoding:in-reply-to:mime-version :user-agent:date:message-id:from:references:cc:to:subject; bh=kq1gZ1oSWL74k0ayeZ53XFkfMVe0yxXtlQqveMy/Y0s=; b=CQ6sXPPDjs8yCn51a6LoPci9vOH+NsbY8EDUNyzh5zDqHn7uYEZly25SExMU7pzk6t 84VTqEWmzHPnT/7uir2T9X+fXHduGdW60+mfIhcVNOJ4o2X0HdVlbmkT4yLz2vcFzJbI cGpdUqUscXcFxTb1kNC61vjYlHrjMfy2dpKhYPXEDWVR4pHmp9uW++NMv/VU4ScxoBYB TLF9aj4kURTi4dgw/vx4SSFVXA+q6atxtlL8+4R8o5lQke7rsjfc+hUlog7FWTCZ/50T 8q1FxAPdoGEgLs87QoU2WlZsCX4pCR2x2c4dAz6JpuWL948K04TWR57Mue0L+OzZJVTz LAOA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of yang.shi@linux.alibaba.com designates 115.124.30.45 as permitted sender) smtp.mailfrom=yang.shi@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out30-45.freemail.mail.aliyun.com (out30-45.freemail.mail.aliyun.com. [115.124.30.45]) by mx.google.com with ESMTPS id f7si19501172pfa.50.2019.04.04.17.32.26 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 04 Apr 2019 17:32:26 -0700 (PDT) Received-SPF: pass (google.com: domain of yang.shi@linux.alibaba.com designates 115.124.30.45 as permitted sender) client-ip=115.124.30.45; Authentication-Results: mx.google.com; spf=pass (google.com: domain of yang.shi@linux.alibaba.com designates 115.124.30.45 as permitted sender) smtp.mailfrom=yang.shi@linux.alibaba.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=alibaba.com X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07487;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=17;SR=0;TI=SMTPD_---0TOWDP3r_1554424335; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TOWDP3r_1554424335) by smtp.aliyun-inc.com(127.0.0.1); Fri, 05 Apr 2019 08:32:20 +0800 Subject: Re: [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management To: ziy@nvidia.com, Dave Hansen , Keith Busch , Fengguang Wu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Daniel Jordan , Michal Hocko , "Kirill A . Shutemov" , Andrew Morton , Vlastimil Babka , Mel Gorman , John Hubbard , Mark Hairgrove , Nitin Gupta , Javier Cabezas , David Nellans References: <20190404020046.32741-1-zi.yan@sent.com> From: Yang Shi Message-ID: Date: Thu, 4 Apr 2019 17:32:15 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20190404020046.32741-1-zi.yan@sent.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/3/19 7:00 PM, Zi Yan wrote: > From: Zi Yan > > Thanks to Dave Hansen's patches, which make PMEM as part of memory as NUMA nodes. > How to use PMEM along with normal DRAM remains an open problem. There are > several patchsets posted on the mailing list, proposing to use page migration to > move pages between PMEM and DRAM using Linux page replacement policy [1,2,3]. > There are some important problems not addressed in these patches: > 1. The page migration in Linux does not provide high enough throughput for us to > fully exploit PMEM or other use cases. > 2. Linux page replacement is running too infrequent to distinguish hot and cold > pages. > > I am trying to attack the problems with this patch series. This is not a final > solution, but I would like to gather more feedback and comments from the mailing > list. > > Page migration throughput problem > ==== > > For example, in my recent email [4], I gave the page migration throughput numbers > for different page migrations, none of which can achieve > 2.5GB/s throughput > (the throughput is measured around kernel functions: migrate_pages() and > migrate_page_copy()): > > | migrate_pages() | migrate_page_copy() > migrating single 4KB page: | 0.312GB/s | 1.385GB/s > migrating 512 4KB pages: | 0.854GB/s | 1.983GB/s > migrating single 2MB THP: | 2.387GB/s | 2.481GB/s > > In reality, microbenchmarks show that Intel PMEM can provide ~65GB/s read > throughput and ~16GB/s write throughput [5], which are much higher than > the throughput achieved by Linux page migration. > > In addition, it is also desirable to use page migration to move data > between high-bandwidth memory and DRAM, like IBM Summit, which exposes > high-performance GPU memories as NUMA nodes [6]. This requires even higher page > migration throughput. > > In this patch series, I propose four different ways of improving page migration > throughput (mostly on 2MB THP migration): > 1. multi-threaded page migration: Patch 03 to 06. > 2. DMA-based (using Intel IOAT DMA) page migration: Patch 07 and 08. > 3. concurrent (batched) page migration: Patch 09, 10, and 11. > 4. exchange pages: Patch 12 to 17. (This is a repost of part of [7]) > > Here are some throughput numbers showing clear throughput improvements on > a two-socket NUMA machine with two Xeon E5-2650 v3 @ 2.30GHz and a 19.2GB/s > bandwidth QPI link (the same machine as mentioned in [4]): > > | migrate_pages() | migrate_page_copy() > => migrating single 2MB THP | 2.387GB/s | 2.481GB/s > 2-thread single THP migration | 3.478GB/s | 3.704GB/s > 4-thread single THP migration | 5.474GB/s | 6.054GB/s > 8-thread single THP migration | 7.846GB/s | 9.029GB/s > 16-thread single THP migration | 7.423GB/s | 8.464GB/s > 16-ch. DMA single THP migration | 4.322GB/s | 4.536GB/s > > 2-thread 16-THP migration | 3.610GB/s | 3.838GB/s > 2-thread 16-THP batched migration | 4.138GB/s | 4.344GB/s > 4-thread 16-THP migration | 6.385GB/s | 7.031GB/s > 4-thread 16-THP batched migration | 7.382GB/s | 8.072GB/s > 8-thread 16-THP migration | 8.039GB/s | 9.029GB/s > 8-thread 16-THP batched migration | 9.023GB/s | 10.056GB/s > 16-thread 16-THP migration | 8.137GB/s | 9.137GB/s > 16-thread 16-THP batched migration | 9.907GB/s | 11.175GB/s > > 1-thread 16-THP exchange | 4.135GB/s | 4.225GB/s > 2-thread 16-THP batched exchange | 7.061GB/s | 7.325GB/s > 4-thread 16-THP batched exchange | 9.729GB/s | 10.237GB/s > 8-thread 16-THP batched exchange | 9.992GB/s | 10.533GB/s > 16-thread 16-THP batched exchange | 9.520GB/s | 10.056GB/s > > => migrating 512 4KB pages | 0.854GB/s | 1.983GB/s > 1-thread 512-4KB batched exchange | 1.271GB/s | 3.433GB/s > 2-thread 512-4KB batched exchange | 1.240GB/s | 3.190GB/s > 4-thread 512-4KB batched exchange | 1.255GB/s | 3.823GB/s > 8-thread 512-4KB batched exchange | 1.336GB/s | 3.921GB/s > 16-thread 512-4KB batched exchange | 1.334GB/s | 3.897GB/s > > Concerns were raised on how to avoid CPU resource competition between > page migration and user applications and have power awareness. > Daniel Jordan recently posted a multi-threaded ktask patch series could be > a solution [8]. > > > Infrequent page list update problem > ==== > > Current page lists are updated by calling shrink_list() when memory pressure > comes, which might not be frequent enough to keep track of hot and cold pages. > Because all pages are on active lists at the first time shrink_list() is called > and the reference bit on the pages might not reflect the up to date access status > of these pages. But we also do not want to periodically shrink the global page > lists, which adds unnecessary overheads to the whole system. So I propose to > actively shrink page lists on the memcg we are interested in. > > Patch 18 to 25 add a new system call to shrink page lists on given application's > memcg and migrate pages between two NUMA nodes. It isolates the impact from the > rest of the system. To share DRAM among different applications, Patch 18 and 19 > add per-node memcg size limit, so you can limit the memory usage for particular > NUMA node(s). This sounds a little bit confusing to me. Is it totally user's decision about when to call the syscall to shrink page lists? But, how would user know when is a good timing? Could you please elaborate the usecase? Thanks, Yang > > > Patch structure > ==== > 1. multi-threaded page migration: Patch 01 to 06. > 2. DMA-based (using Intel IOAT DMA) page migration: Patch 07 and 08. > 3. concurrent (batched) page migration: Patch 09, 10, and 11. > 4. exchange pages: Patch 12 to 17. (This is a repost of part of [7]) > 5. per-node size limit in memcg: Patch 18 and 19. > 6. actively shrink page lists and perform page migration in given memcg: Patch 20 to 25. > > > Any comment is welcome. > > [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/ > [2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/ > [3]: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com/ > [4]: https://lore.kernel.org/linux-mm/6A903D34-A293-4056-B135-6FA227DE1828@nvidia.com/ > [5]: https://www.storagereview.com/supermicro_superserver_with_intel_optane_dc_persistent_memory_first_look_review > [6]: https://www.ibm.com/thought-leadership/summit-supercomputer/ > [7]: https://lore.kernel.org/linux-mm/20190215220856.29749-1-zi.yan@sent.com/ > [8]: https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jordan@oracle.com/ > > Zi Yan (25): > mm: migrate: Change migrate_mode to support combination migration > modes. > mm: migrate: Add mode parameter to support future page copy routines. > mm: migrate: Add a multi-threaded page migration function. > mm: migrate: Add copy_page_multithread into migrate_pages. > mm: migrate: Add vm.accel_page_copy in sysfs to control page copy > acceleration. > mm: migrate: Make the number of copy threads adjustable via sysctl. > mm: migrate: Add copy_page_dma to use DMA Engine to copy pages. > mm: migrate: Add copy_page_dma into migrate_page_copy. > mm: migrate: Add copy_page_lists_dma_always to support copy a list of > pages. > mm: migrate: copy_page_lists_mt() to copy a page list using > multi-threads. > mm: migrate: Add concurrent page migration into move_pages syscall. > exchange pages: new page migration mechanism: exchange_pages() > exchange pages: add multi-threaded exchange pages. > exchange pages: concurrent exchange pages. > exchange pages: exchange anonymous page and file-backed page. > exchange page: Add THP exchange support. > exchange page: Add exchange_page() syscall. > memcg: Add per node memory usage&max stats in memcg. > mempolicy: add MPOL_F_MEMCG flag, enforcing memcg memory limit. > memory manage: Add memory manage syscall. > mm: move update_lru_sizes() to mm_inline.h for broader use. > memory manage: active/inactive page list manipulation in memcg. > memory manage: page migration based page manipulation between NUMA > nodes. > memory manage: limit migration batch size. > memory manage: use exchange pages to memory manage to improve > throughput. > > arch/x86/entry/syscalls/syscall_64.tbl | 2 + > fs/aio.c | 12 +- > fs/f2fs/data.c | 6 +- > fs/hugetlbfs/inode.c | 4 +- > fs/iomap.c | 4 +- > fs/ubifs/file.c | 4 +- > include/linux/cgroup-defs.h | 1 + > include/linux/exchange.h | 27 + > include/linux/highmem.h | 3 + > include/linux/ksm.h | 4 + > include/linux/memcontrol.h | 67 ++ > include/linux/migrate.h | 12 +- > include/linux/migrate_mode.h | 8 + > include/linux/mm_inline.h | 21 + > include/linux/sched/coredump.h | 1 + > include/linux/sched/sysctl.h | 3 + > include/linux/syscalls.h | 10 + > include/uapi/linux/mempolicy.h | 9 +- > kernel/sysctl.c | 47 + > mm/Makefile | 5 + > mm/balloon_compaction.c | 2 +- > mm/compaction.c | 22 +- > mm/copy_page.c | 708 +++++++++++++++ > mm/exchange.c | 1560 ++++++++++++++++++++++++++++++++ > mm/exchange_page.c | 228 +++++ > mm/internal.h | 113 +++ > mm/ksm.c | 35 + > mm/memcontrol.c | 80 ++ > mm/memory_manage.c | 649 +++++++++++++ > mm/mempolicy.c | 38 +- > mm/migrate.c | 621 ++++++++++++- > mm/vmscan.c | 115 +-- > mm/zsmalloc.c | 2 +- > 33 files changed, 4261 insertions(+), 162 deletions(-) > create mode 100644 include/linux/exchange.h > create mode 100644 mm/copy_page.c > create mode 100644 mm/exchange.c > create mode 100644 mm/exchange_page.c > create mode 100644 mm/memory_manage.c > > -- > 2.7.4