From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,UNPARSEABLE_RELAY,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE7CFC4360F for ; Fri, 5 Apr 2019 00:32:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 75BDD217D4 for ; Fri, 5 Apr 2019 00:32:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729887AbfDEAcc (ORCPT ); Thu, 4 Apr 2019 20:32:32 -0400 Received: from out30-130.freemail.mail.aliyun.com ([115.124.30.130]:56374 "EHLO out30-130.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727053AbfDEAcb (ORCPT ); Thu, 4 Apr 2019 20:32:31 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R161e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07487;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=17;SR=0;TI=SMTPD_---0TOWDP3r_1554424335; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TOWDP3r_1554424335) by smtp.aliyun-inc.com(127.0.0.1); Fri, 05 Apr 2019 08:32:20 +0800 Subject: Re: [RFC PATCH 00/25] Accelerate page migration and use memcg for PMEM management To: ziy@nvidia.com, Dave Hansen , Keith Busch , Fengguang Wu , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Daniel Jordan , Michal Hocko , "Kirill A . Shutemov" , Andrew Morton , Vlastimil Babka , Mel Gorman , John Hubbard , Mark Hairgrove , Nitin Gupta , Javier Cabezas , David Nellans References: <20190404020046.32741-1-zi.yan@sent.com> From: Yang Shi Message-ID: Date: Thu, 4 Apr 2019 17:32:15 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20190404020046.32741-1-zi.yan@sent.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/3/19 7:00 PM, Zi Yan wrote: > From: Zi Yan > > Thanks to Dave Hansen's patches, which make PMEM as part of memory as NUMA nodes. > How to use PMEM along with normal DRAM remains an open problem. There are > several patchsets posted on the mailing list, proposing to use page migration to > move pages between PMEM and DRAM using Linux page replacement policy [1,2,3]. > There are some important problems not addressed in these patches: > 1. The page migration in Linux does not provide high enough throughput for us to > fully exploit PMEM or other use cases. > 2. Linux page replacement is running too infrequent to distinguish hot and cold > pages. > > I am trying to attack the problems with this patch series. This is not a final > solution, but I would like to gather more feedback and comments from the mailing > list. > > Page migration throughput problem > ==== > > For example, in my recent email [4], I gave the page migration throughput numbers > for different page migrations, none of which can achieve > 2.5GB/s throughput > (the throughput is measured around kernel functions: migrate_pages() and > migrate_page_copy()): > > | migrate_pages() | migrate_page_copy() > migrating single 4KB page: | 0.312GB/s | 1.385GB/s > migrating 512 4KB pages: | 0.854GB/s | 1.983GB/s > migrating single 2MB THP: | 2.387GB/s | 2.481GB/s > > In reality, microbenchmarks show that Intel PMEM can provide ~65GB/s read > throughput and ~16GB/s write throughput [5], which are much higher than > the throughput achieved by Linux page migration. > > In addition, it is also desirable to use page migration to move data > between high-bandwidth memory and DRAM, like IBM Summit, which exposes > high-performance GPU memories as NUMA nodes [6]. This requires even higher page > migration throughput. > > In this patch series, I propose four different ways of improving page migration > throughput (mostly on 2MB THP migration): > 1. multi-threaded page migration: Patch 03 to 06. > 2. DMA-based (using Intel IOAT DMA) page migration: Patch 07 and 08. > 3. concurrent (batched) page migration: Patch 09, 10, and 11. > 4. exchange pages: Patch 12 to 17. (This is a repost of part of [7]) > > Here are some throughput numbers showing clear throughput improvements on > a two-socket NUMA machine with two Xeon E5-2650 v3 @ 2.30GHz and a 19.2GB/s > bandwidth QPI link (the same machine as mentioned in [4]): > > | migrate_pages() | migrate_page_copy() > => migrating single 2MB THP | 2.387GB/s | 2.481GB/s > 2-thread single THP migration | 3.478GB/s | 3.704GB/s > 4-thread single THP migration | 5.474GB/s | 6.054GB/s > 8-thread single THP migration | 7.846GB/s | 9.029GB/s > 16-thread single THP migration | 7.423GB/s | 8.464GB/s > 16-ch. DMA single THP migration | 4.322GB/s | 4.536GB/s > > 2-thread 16-THP migration | 3.610GB/s | 3.838GB/s > 2-thread 16-THP batched migration | 4.138GB/s | 4.344GB/s > 4-thread 16-THP migration | 6.385GB/s | 7.031GB/s > 4-thread 16-THP batched migration | 7.382GB/s | 8.072GB/s > 8-thread 16-THP migration | 8.039GB/s | 9.029GB/s > 8-thread 16-THP batched migration | 9.023GB/s | 10.056GB/s > 16-thread 16-THP migration | 8.137GB/s | 9.137GB/s > 16-thread 16-THP batched migration | 9.907GB/s | 11.175GB/s > > 1-thread 16-THP exchange | 4.135GB/s | 4.225GB/s > 2-thread 16-THP batched exchange | 7.061GB/s | 7.325GB/s > 4-thread 16-THP batched exchange | 9.729GB/s | 10.237GB/s > 8-thread 16-THP batched exchange | 9.992GB/s | 10.533GB/s > 16-thread 16-THP batched exchange | 9.520GB/s | 10.056GB/s > > => migrating 512 4KB pages | 0.854GB/s | 1.983GB/s > 1-thread 512-4KB batched exchange | 1.271GB/s | 3.433GB/s > 2-thread 512-4KB batched exchange | 1.240GB/s | 3.190GB/s > 4-thread 512-4KB batched exchange | 1.255GB/s | 3.823GB/s > 8-thread 512-4KB batched exchange | 1.336GB/s | 3.921GB/s > 16-thread 512-4KB batched exchange | 1.334GB/s | 3.897GB/s > > Concerns were raised on how to avoid CPU resource competition between > page migration and user applications and have power awareness. > Daniel Jordan recently posted a multi-threaded ktask patch series could be > a solution [8]. > > > Infrequent page list update problem > ==== > > Current page lists are updated by calling shrink_list() when memory pressure > comes, which might not be frequent enough to keep track of hot and cold pages. > Because all pages are on active lists at the first time shrink_list() is called > and the reference bit on the pages might not reflect the up to date access status > of these pages. But we also do not want to periodically shrink the global page > lists, which adds unnecessary overheads to the whole system. So I propose to > actively shrink page lists on the memcg we are interested in. > > Patch 18 to 25 add a new system call to shrink page lists on given application's > memcg and migrate pages between two NUMA nodes. It isolates the impact from the > rest of the system. To share DRAM among different applications, Patch 18 and 19 > add per-node memcg size limit, so you can limit the memory usage for particular > NUMA node(s). This sounds a little bit confusing to me. Is it totally user's decision about when to call the syscall to shrink page lists? But, how would user know when is a good timing? Could you please elaborate the usecase? Thanks, Yang > > > Patch structure > ==== > 1. multi-threaded page migration: Patch 01 to 06. > 2. DMA-based (using Intel IOAT DMA) page migration: Patch 07 and 08. > 3. concurrent (batched) page migration: Patch 09, 10, and 11. > 4. exchange pages: Patch 12 to 17. (This is a repost of part of [7]) > 5. per-node size limit in memcg: Patch 18 and 19. > 6. actively shrink page lists and perform page migration in given memcg: Patch 20 to 25. > > > Any comment is welcome. > > [1]: https://lore.kernel.org/linux-mm/20181226131446.330864849@intel.com/ > [2]: https://lore.kernel.org/linux-mm/20190321200157.29678-1-keith.busch@intel.com/ > [3]: https://lore.kernel.org/linux-mm/1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com/ > [4]: https://lore.kernel.org/linux-mm/6A903D34-A293-4056-B135-6FA227DE1828@nvidia.com/ > [5]: https://www.storagereview.com/supermicro_superserver_with_intel_optane_dc_persistent_memory_first_look_review > [6]: https://www.ibm.com/thought-leadership/summit-supercomputer/ > [7]: https://lore.kernel.org/linux-mm/20190215220856.29749-1-zi.yan@sent.com/ > [8]: https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jordan@oracle.com/ > > Zi Yan (25): > mm: migrate: Change migrate_mode to support combination migration > modes. > mm: migrate: Add mode parameter to support future page copy routines. > mm: migrate: Add a multi-threaded page migration function. > mm: migrate: Add copy_page_multithread into migrate_pages. > mm: migrate: Add vm.accel_page_copy in sysfs to control page copy > acceleration. > mm: migrate: Make the number of copy threads adjustable via sysctl. > mm: migrate: Add copy_page_dma to use DMA Engine to copy pages. > mm: migrate: Add copy_page_dma into migrate_page_copy. > mm: migrate: Add copy_page_lists_dma_always to support copy a list of > pages. > mm: migrate: copy_page_lists_mt() to copy a page list using > multi-threads. > mm: migrate: Add concurrent page migration into move_pages syscall. > exchange pages: new page migration mechanism: exchange_pages() > exchange pages: add multi-threaded exchange pages. > exchange pages: concurrent exchange pages. > exchange pages: exchange anonymous page and file-backed page. > exchange page: Add THP exchange support. > exchange page: Add exchange_page() syscall. > memcg: Add per node memory usage&max stats in memcg. > mempolicy: add MPOL_F_MEMCG flag, enforcing memcg memory limit. > memory manage: Add memory manage syscall. > mm: move update_lru_sizes() to mm_inline.h for broader use. > memory manage: active/inactive page list manipulation in memcg. > memory manage: page migration based page manipulation between NUMA > nodes. > memory manage: limit migration batch size. > memory manage: use exchange pages to memory manage to improve > throughput. > > arch/x86/entry/syscalls/syscall_64.tbl | 2 + > fs/aio.c | 12 +- > fs/f2fs/data.c | 6 +- > fs/hugetlbfs/inode.c | 4 +- > fs/iomap.c | 4 +- > fs/ubifs/file.c | 4 +- > include/linux/cgroup-defs.h | 1 + > include/linux/exchange.h | 27 + > include/linux/highmem.h | 3 + > include/linux/ksm.h | 4 + > include/linux/memcontrol.h | 67 ++ > include/linux/migrate.h | 12 +- > include/linux/migrate_mode.h | 8 + > include/linux/mm_inline.h | 21 + > include/linux/sched/coredump.h | 1 + > include/linux/sched/sysctl.h | 3 + > include/linux/syscalls.h | 10 + > include/uapi/linux/mempolicy.h | 9 +- > kernel/sysctl.c | 47 + > mm/Makefile | 5 + > mm/balloon_compaction.c | 2 +- > mm/compaction.c | 22 +- > mm/copy_page.c | 708 +++++++++++++++ > mm/exchange.c | 1560 ++++++++++++++++++++++++++++++++ > mm/exchange_page.c | 228 +++++ > mm/internal.h | 113 +++ > mm/ksm.c | 35 + > mm/memcontrol.c | 80 ++ > mm/memory_manage.c | 649 +++++++++++++ > mm/mempolicy.c | 38 +- > mm/migrate.c | 621 ++++++++++++- > mm/vmscan.c | 115 +-- > mm/zsmalloc.c | 2 +- > 33 files changed, 4261 insertions(+), 162 deletions(-) > create mode 100644 include/linux/exchange.h > create mode 100644 mm/copy_page.c > create mode 100644 mm/exchange.c > create mode 100644 mm/exchange_page.c > create mode 100644 mm/memory_manage.c > > -- > 2.7.4