All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fengguang Wu <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux Memory Management List <linux-mm@kvack.org>
Cc: kvm@vger.kernel.org
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Fan Du <fan.du@intel.com>
Cc: Yao Yuan <yuan.yao@intel.com>
Cc: Peng Dong <dongx.peng@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Liu Jingqi <jingqi.liu@intel.com>
Cc: Dong Eddie <eddie.dong@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Zhang Yi <yi.z.zhang@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Subject: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
Date: Wed, 26 Dec 2018 21:14:46 +0800	[thread overview]
Message-ID: <20181226131446.330864849@intel.com> (raw)

This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's
transparent to normal applications and virtual machines.

The code is still in active development. It's provided for early design review.

Key functionalities:

1) create and describe PMEM NUMA node for NVDIMM memory
2) dumb /proc/PID/idle_pages interface, for user space driven hot page accounting
3) passive kernel cold page migration in page reclaim path
4) improved move_pages() for active user space hot/cold page migration

(1) is foundation for transparent usage of NVDIMM for normal apps and virtual
machines. (2-4) enable auto placing hot pages in DRAM for better performance.
A user space migration daemon is being built based on this kernel patchset to
make the full vertical solution.

Base kernel is v4.20 . The patches are not suitable for upstreaming in near
future -- some are quick hacks, some others need more works. However they are
complete enough to demo the necessary kernel changes for the proposed app&VM
transparent NVDIMM volatile use model.

The interfaces are far from finalized. They kind of illustrate what would be
necessary for creating a user space driven solution. The exact forms will ask
for more thoughts and inputs. We may adopt HMAT based solution for NUMA node
related interface when they are ready. The /proc/PID/idle_pages interface is
standalone but non-trivial. Before upstreaming some day, it's expected to take
long time to collect various real use cases and feedbacks, so as to refine and
stabilize the format.

Create PMEM numa node

	[PATCH 01/21] e820: cheat PMEM as DRAM

Mark numa node as DRAM/PMEM

	[PATCH 02/21] acpi/numa: memorize NUMA node type from SRAT table
	[PATCH 03/21] x86/numa_emulation: fix fake NUMA in uniform case
	[PATCH 04/21] x86/numa_emulation: pass numa node type to fake nodes
	[PATCH 05/21] mmzone: new pgdat flags for DRAM and PMEM
	[PATCH 06/21] x86,numa: update numa node type
	[PATCH 07/21] mm: export node type {pmem|dram} under /sys/bus/node

Point neighbor DRAM/PMEM to each other

	[PATCH 08/21] mm: introduce and export pgdat peer_node
	[PATCH 09/21] mm: avoid duplicate peer target node

Standalone zonelist for DRAM and PMEM nodes

	[PATCH 10/21] mm: build separate zonelist for PMEM and DRAM node

Keep page table pages in DRAM

	[PATCH 11/21] kvm: allocate page table pages from DRAM
	[PATCH 12/21] x86/pgtable: allocate page table pages from DRAM

/proc/PID/idle_pages interface for virtual machine and normal tasks

	[PATCH 13/21] x86/pgtable: dont check PMD accessed bit
	[PATCH 14/21] kvm: register in mm_struct
	[PATCH 15/21] ept-idle: EPT walk for virtual machine
	[PATCH 16/21] mm-idle: mm_walk for normal task
	[PATCH 17/21] proc: introduce /proc/PID/idle_pages
	[PATCH 18/21] kvm-ept-idle: enable module

Mark hot pages

	[PATCH 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag

Kernel DRAM=>PMEM migration

	[PATCH 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node
	[PATCH 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM

 arch/x86/include/asm/numa.h    |    2 
 arch/x86/include/asm/pgalloc.h |   10 
 arch/x86/include/asm/pgtable.h |    3 
 arch/x86/kernel/e820.c         |    3 
 arch/x86/kvm/Kconfig           |   11 
 arch/x86/kvm/Makefile          |    4 
 arch/x86/kvm/ept_idle.c        |  841 +++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h        |  116 ++++
 arch/x86/kvm/mmu.c             |   12 
 arch/x86/mm/numa.c             |    3 
 arch/x86/mm/numa_emulation.c   |   30 +
 arch/x86/mm/pgtable.c          |   22 
 drivers/acpi/numa.c            |    5 
 drivers/base/node.c            |   21 
 fs/proc/base.c                 |    2 
 fs/proc/internal.h             |    1 
 fs/proc/task_mmu.c             |   54 +
 include/linux/mm_types.h       |   11 
 include/linux/mmzone.h         |   38 +
 mm/mempolicy.c                 |   14 
 mm/migrate.c                   |   13 
 mm/page_alloc.c                |   77 ++
 mm/pagewalk.c                  |    1 
 mm/vmscan.c                    |   38 +
 virt/kvm/kvm_main.c            |    3 
 25 files changed, 1306 insertions(+), 29 deletions(-)

V1 patches: https://lkml.org/lkml/2018/9/2/13

Regards,
Fengguang


WARNING: multiple messages have this Message-ID (diff)
From: Fengguang Wu <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Linux Memory Management List <linux-mm@kvack.org>,
	kvm@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Fan Du <fan.du@intel.com>, Yao Yuan <yuan.yao@intel.com>,
	Peng Dong <dongx.peng@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Liu Jingqi <jingqi.liu@intel.com>,
	Dong Eddie <eddie.dong@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Zhang Yi <yi.z.zhang@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Fengguang Wu <fengguang.wu@intel.com>
Subject: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
Date: Wed, 26 Dec 2018 21:14:46 +0800	[thread overview]
Message-ID: <20181226131446.330864849@intel.com> (raw)

This is an attempt to use NVDIMM/PMEM as volatile NUMA memory that's
transparent to normal applications and virtual machines.

The code is still in active development. It's provided for early design review.

Key functionalities:

1) create and describe PMEM NUMA node for NVDIMM memory
2) dumb /proc/PID/idle_pages interface, for user space driven hot page accounting
3) passive kernel cold page migration in page reclaim path
4) improved move_pages() for active user space hot/cold page migration

(1) is foundation for transparent usage of NVDIMM for normal apps and virtual
machines. (2-4) enable auto placing hot pages in DRAM for better performance.
A user space migration daemon is being built based on this kernel patchset to
make the full vertical solution.

Base kernel is v4.20 . The patches are not suitable for upstreaming in near
future -- some are quick hacks, some others need more works. However they are
complete enough to demo the necessary kernel changes for the proposed app&VM
transparent NVDIMM volatile use model.

The interfaces are far from finalized. They kind of illustrate what would be
necessary for creating a user space driven solution. The exact forms will ask
for more thoughts and inputs. We may adopt HMAT based solution for NUMA node
related interface when they are ready. The /proc/PID/idle_pages interface is
standalone but non-trivial. Before upstreaming some day, it's expected to take
long time to collect various real use cases and feedbacks, so as to refine and
stabilize the format.

Create PMEM numa node

	[PATCH 01/21] e820: cheat PMEM as DRAM

Mark numa node as DRAM/PMEM

	[PATCH 02/21] acpi/numa: memorize NUMA node type from SRAT table
	[PATCH 03/21] x86/numa_emulation: fix fake NUMA in uniform case
	[PATCH 04/21] x86/numa_emulation: pass numa node type to fake nodes
	[PATCH 05/21] mmzone: new pgdat flags for DRAM and PMEM
	[PATCH 06/21] x86,numa: update numa node type
	[PATCH 07/21] mm: export node type {pmem|dram} under /sys/bus/node

Point neighbor DRAM/PMEM to each other

	[PATCH 08/21] mm: introduce and export pgdat peer_node
	[PATCH 09/21] mm: avoid duplicate peer target node

Standalone zonelist for DRAM and PMEM nodes

	[PATCH 10/21] mm: build separate zonelist for PMEM and DRAM node

Keep page table pages in DRAM

	[PATCH 11/21] kvm: allocate page table pages from DRAM
	[PATCH 12/21] x86/pgtable: allocate page table pages from DRAM

/proc/PID/idle_pages interface for virtual machine and normal tasks

	[PATCH 13/21] x86/pgtable: dont check PMD accessed bit
	[PATCH 14/21] kvm: register in mm_struct
	[PATCH 15/21] ept-idle: EPT walk for virtual machine
	[PATCH 16/21] mm-idle: mm_walk for normal task
	[PATCH 17/21] proc: introduce /proc/PID/idle_pages
	[PATCH 18/21] kvm-ept-idle: enable module

Mark hot pages

	[PATCH 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag

Kernel DRAM=>PMEM migration

	[PATCH 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node
	[PATCH 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM

 arch/x86/include/asm/numa.h    |    2 
 arch/x86/include/asm/pgalloc.h |   10 
 arch/x86/include/asm/pgtable.h |    3 
 arch/x86/kernel/e820.c         |    3 
 arch/x86/kvm/Kconfig           |   11 
 arch/x86/kvm/Makefile          |    4 
 arch/x86/kvm/ept_idle.c        |  841 +++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h        |  116 ++++
 arch/x86/kvm/mmu.c             |   12 
 arch/x86/mm/numa.c             |    3 
 arch/x86/mm/numa_emulation.c   |   30 +
 arch/x86/mm/pgtable.c          |   22 
 drivers/acpi/numa.c            |    5 
 drivers/base/node.c            |   21 
 fs/proc/base.c                 |    2 
 fs/proc/internal.h             |    1 
 fs/proc/task_mmu.c             |   54 +
 include/linux/mm_types.h       |   11 
 include/linux/mmzone.h         |   38 +
 mm/mempolicy.c                 |   14 
 mm/migrate.c                   |   13 
 mm/page_alloc.c                |   77 ++
 mm/pagewalk.c                  |    1 
 mm/vmscan.c                    |   38 +
 virt/kvm/kvm_main.c            |    3 
 25 files changed, 1306 insertions(+), 29 deletions(-)

V1 patches: https://lkml.org/lkml/2018/9/2/13

Regards,
Fengguang

             reply	other threads:[~2018-12-26 13:38 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-26 13:14 Fengguang Wu [this message]
2018-12-26 13:14 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27  3:41   ` Matthew Wilcox
2018-12-27  4:11     ` Fengguang Wu
2018-12-27  5:13       ` Dan Williams
2018-12-27  5:13         ` Dan Williams
2018-12-27 19:32         ` Yang Shi
2018-12-27 19:32           ` Yang Shi
2018-12-28  3:27           ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27 20:07   ` Christopher Lameter
2018-12-27 20:07     ` Christopher Lameter
2018-12-28  2:31     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:14   ` Aneesh Kumar K.V
2019-01-01  9:14     ` Aneesh Kumar K.V
2019-01-07  9:57     ` Fengguang Wu
2019-01-07 14:09       ` Aneesh Kumar K.V
2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:23   ` Aneesh Kumar K.V
2019-01-01  9:23     ` Aneesh Kumar K.V
2019-01-02  0:59     ` Yuan Yao
2019-01-02 16:47   ` Dave Hansen
2019-01-07 10:21     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2019-02-02  6:57   ` Peter Xu
2019-02-02 10:50     ` Fengguang Wu
2019-02-04 10:46     ` Paolo Bonzini
2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
2018-12-28  5:08   ` Fengguang Wu
2018-12-28  8:41     ` Michal Hocko
2018-12-28  9:42       ` Fengguang Wu
2018-12-28 12:15         ` Michal Hocko
2018-12-28 13:15           ` Fengguang Wu
2018-12-28 13:15             ` Fengguang Wu
2018-12-28 19:46             ` Michal Hocko
2018-12-28 13:31           ` Fengguang Wu
2018-12-28 18:28             ` Yang Shi
2018-12-28 18:28               ` Yang Shi
2018-12-28 19:52             ` Michal Hocko
2019-01-02 12:21               ` Jonathan Cameron
2019-01-02 12:21                 ` Jonathan Cameron
2019-01-08 14:52                 ` Michal Hocko
2019-01-10 15:53                   ` Jerome Glisse
2019-01-10 15:53                     ` Jerome Glisse
2019-01-10 16:42                     ` Michal Hocko
2019-01-10 17:42                       ` Jerome Glisse
2019-01-10 17:42                         ` Jerome Glisse
2019-01-10 18:26                   ` Jonathan Cameron
2019-01-10 18:26                     ` Jonathan Cameron
2019-01-28 17:42                 ` Jonathan Cameron
2019-01-28 17:42                   ` Jonathan Cameron
2019-01-29  2:00                   ` Fengguang Wu
2019-01-03 10:57               ` Mel Gorman
2019-01-10 16:25               ` Jerome Glisse
2019-01-10 16:25                 ` Jerome Glisse
2019-01-10 16:50                 ` Michal Hocko
2019-01-10 18:02                   ` Jerome Glisse
2019-01-10 18:02                     ` Jerome Glisse
2019-01-02 18:12       ` Dave Hansen
2019-01-08 14:53         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181226131446.330864849@intel.com \
    --to=fengguang.wu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.