linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RFC: Memory Tiering Kernel Interfaces (v3)
@ 2022-05-26 21:22 Wei Xu
  2022-05-27  2:58 ` Ying Huang
                   ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Wei Xu @ 2022-05-26 21:22 UTC (permalink / raw)
  To: Huang Ying, Andrew Morton, Greg Thelen, Yang Shi,
	Aneesh Kumar K.V, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Linux MM, Jagdish Gediya, Baolin Wang, David Rientjes

Changes since v2
================
* Updated the design and examples to use "rank" instead of device ID
  to determine the order between memory tiers for better flexibility.

Overview
========

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created during
the kernel initialization and updated when a NUMA node is hot-added or
hot-removed.  The current implementation puts all nodes with CPU into
the top tier, and builds the tier hierarchy tier-by-tier by
establishing the per-node demotion targets based on the distances
between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases:

* The current tier initialization code always initializes
  each memory-only NUMA node into a lower tier.  But a memory-only
  NUMA node may have a high performance memory device (e.g. a DRAM
  device attached via CXL.mem or a DRAM-backed memory-only node on
  a virtual machine) and should be put into a higher tier.

* The current tier hierarchy always puts CPU nodes into the top
  tier. But on a system with HBM (e.g. GPU memory) devices, these
  memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
  with CPUs are better to be placed into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes
  into the top tier, when a CPU is hot-added (or hot-removed) and
  triggers a memory node from CPU-less into a CPU node (or vice
  versa), the memory tier hierarchy gets changed, even though no
  memory node is added or removed.  This can make the tier
  hierarchy unstable and make it difficult to support tier-based
  memory accounting.

* A higher tier node can only be demoted to selected nodes on the
  next lower tier as defined by the demotion path, not any other
  node from any lower tier.  This strict, hard-coded demotion order
  does not work in all use cases (e.g. some use cases may want to
  allow cross-socket demotion to another node in the same demotion
  tier as a fallback when the preferred demotion node is out of
  space), and has resulted in the feature request for an interface to
  override the system-wide, per-node demotion order from the
  userspace.  This demotion order is also inconsistent with the page
  allocation fallback order when all the nodes in a higher tier are
  out of space: The page allocation can fall back to any node from
  any lower tier, whereas the demotion order doesn't allow that.

* There are no interfaces for the userspace to learn about the memory
  tier hierarchy in order to optimize its memory allocations.

I'd like to propose revised memory tier kernel interfaces based on
the discussions in the threads:

- https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
- https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
- https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
- https://lore.kernel.org/linux-mm/d6314cfe1c7898a6680bed1e7cc93b0ab93e3155.camel@intel.com/T/


High-level Design Ideas
=======================

* Define memory tiers explicitly, not implicitly.

* Memory tiers are defined based on hardware capabilities of memory
  nodes, not their relative node distances between each other.

* The tier assignment of each node is independent from each other.
  Moving a node from one tier to another tier doesn't affect the tier
  assignment of any other node.

* The node-tier association is stable. A node can be reassigned to a
  different tier only under the specific conditions that don't block
  future tier-based memory cgroup accounting.

* A node can demote its pages to any nodes of any lower tiers. The
  demotion target node selection follows the allocation fallback order
  of the source node, which is built based on node distances.  The
  demotion targets are also restricted to only the nodes from the tiers
  lower than the source node.  We no longer need to maintain a separate
  per-node demotion order (node_demotion[]).


Sysfs Interfaces
================

* /sys/devices/system/memtier/

  This is the directory containing the information about memory tiers.

  Each memory tier has its own subdirectory.

  The order of memory tiers is determined by their rank values, not by
  their memtier device names.

  - /sys/devices/system/memtier/possible

    Format: ordered list of "memtier(rank)"
    Example: 0(64), 1(128), 2(192)

    Read-only.  When read, list all available memory tiers and their
    associated ranks, ordered by the rank values (from the highest
     tier to the lowest tier).

* /sys/devices/system/memtier/memtierN/

  This is the directory containing the information about a particular
  memory tier, memtierN, where N is the memtier device ID (e.g. 0, 1).

  The memtier device ID number itself is just an identifier and has no
  special meaning, i.e. memtier device ID numbers do not determine the
  order of memory tiers.

  - /sys/devices/system/memtier/memtierN/rank

    Format: int
    Example: 100

    Read-only.  When read, list the "rank" value associated with memtierN.

    "Rank" is an opaque value. Its absolute value doesn't have any
    special meaning. But the rank values of different memtiers can be
    compared with each other to determine the memory tier order.
    For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
    their rank values are 10, 20, 15, then the memory tier order is:
    memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
    and memtier1 is the lowest tier.

    The rank value of each memtier should be unique.

  - /sys/devices/system/memtier/memtierN/nodelist

    Format: node_list
    Example: 1-2

    Read-only.  When read, list the memory nodes in the specified tier.

    If a memory tier has no memory nodes, the kernel can hide the sysfs
    directory of this memory tier, though the tier itself can still be
    visible from /sys/devices/system/memtier/possible.

* /sys/devices/system/node/nodeN/memtier

  where N = 0, 1, ...

  Format: int or empty
  Example: 1

  When read, list the device ID of the memory tier that the node belongs
  to.  Its value is empty for a CPU-only NUMA node.

  When written, the kernel moves the node into the specified memory
  tier if the move is allowed.  The tier assignment of all other nodes
  are not affected.

  Initially, we can make this interface read-only.


Kernel Representation
=====================

* All memory tiering code is guarded by CONFIG_TIERED_MEMORY.

* #define MAX_MEMORY_TIERS  3

  Support 3 memory tiers for now.  This can be a kconfig option.

* #define MEMORY_DEFAULT_TIER_DEVICE 1

  The default tier device that a memory node is assigned to.

* struct memtier_dev {
      nodemask_t nodelist;
      int rank;
      int tier;
  } memtier_devices[MAX_MEMORY_TIERS]

  Store memory tiers by device IDs.

* struct memtier_dev *memory_tier(int tier)

  Returns the memtier device for a given memory tier.

* int node_tier_dev_map[MAX_NUMNODES]

  Map a node to its tier device ID..

  For each CPU-only node c, node_tier_dev_map[c] = -1.


Memory Tier Initialization
==========================

By default, all memory nodes are assigned to the default tier
(MEMORY_DEFAULT_TIER_DEVICE).  The default tier device has a rank value
in the middle of the possible rank value range (e.g. 127 if the range
is [0..255]).

A device driver can move up or down its memory nodes from the default
tier.  For example, PMEM can move down its memory nodes below the
default tier, whereas GPU can move up its memory nodes above the
default tier.

The kernel initialization code makes the decision on which exact tier
a memory node should be assigned to based on the requests from the
device drivers as well as the memory device hardware information
provided by the firmware.


Memory Tier Reassignment
========================

After a memory node is hot-removed, it can be hot-added back to a
different memory tier.  This is useful for supporting dynamically
provisioned CXL.mem NUMA nodes, which may connect to different
memory devices across hot-plug events.  Such tier changes should
be compatible with tier-based memory accounting.

The userspace may also reassign an existing online memory node to a
different tier.  However, this should only be allowed when no pages
are allocated from the memory node or when there are no non-root
memory cgroups (e.g. during the system boot).  This restriction is
important for keeping memory tier hierarchy stable enough for
tier-based memory cgroup accounting.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.


Memory Allocation for Demotion
==============================

To allocate a new page as the demotion target for a page, the kernel
calls the allocation function (__alloc_pages_nodemask) with the
source page node as the preferred node and the union of all lower
tier nodes as the allowed nodemask.  The actual target node selection
then follows the allocation fallback order that the kernel has
already defined.

The pseudo code looks like:

    targets = NODE_MASK_NONE;
    src_nid = page_to_nid(page);
    src_tier = memtier_devices[node_tier_dev_map[src_nid]].tier;
    for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
            nodes_or(targets, targets, memory_tier(i)->nodelist);
    new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);

The memopolicy of cpuset, vma and owner task of the source page can
be set to refine the demotion target nodemask, e.g. to prevent
demotion or select a particular allowed node as the demotion target.


Memory Allocation for Promotion
===============================

The page allocation for promotion is similar to demotion, except that (1)
the target nodemask uses the promotion tiers, (2) the preferred node can
be the accessing CPU node, not the source page node.


Examples
========

* Example 1:

Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.

                  20
  Node 0 (DRAM)  ----  Node 1 (DRAM)
       |        \   /       |
       | 30    40 X 40      | 30
       |        /   \       |
  Node 2 (PMEM)  ----  Node 3 (PMEM)
                  40

node distances:
node   0    1    2    3
   0  10   20   30   40
   1  20   10   40   30
   2  30   40   10   40
   3  40   30   40   10

$ cat /sys/devices/system/memtier/possible
0(64), 1(128), 2(192)

$ grep '' /sys/devices/system/memtier/memtier*/rank
/sys/devices/system/memtier/memtier1/rank:128
/sys/devices/system/memtier/memtier2/rank:192

$ grep '' /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier1/nodelist:0-1
/sys/devices/system/memtier/memtier2/nodelist:2-3

$ grep '' /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier:1
/sys/devices/system/node/node1/memtier:1
/sys/devices/system/node/node2/memtier:2
/sys/devices/system/node/node3/memtier:2

Demotion fallback order:
node 0: 2, 3
node 1: 3, 2
node 2: empty
node 3: empty

To prevent cross-socket demotion and memory access, the user can set
mempolicy, e.g. cpuset.mems=0,2.


* Example 2:

Node 0 & 1 are DRAM nodes.
Node 2 is a PMEM node and closer to node 0.

                  20
  Node 0 (DRAM)  ----  Node 1 (DRAM)
       |            /
       | 30       / 40
       |        /
  Node 2 (PMEM)

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   40
   2  30   40   10

$ cat /sys/devices/system/memtier/possible
0(64), 1(128), 2(192)

$ grep '' /sys/devices/system/memtier/memtier*/rank
/sys/devices/system/memtier/memtier1/rank:128
/sys/devices/system/memtier/memtier2/rank:192

$ grep '' /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier1/nodelist:0-1
/sys/devices/system/memtier/memtier2/nodelist:2

$ grep '' /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier:1
/sys/devices/system/node/node1/memtier:1
/sys/devices/system/node/node2/memtier:2

Demotion fallback order:
node 0: 2
node 1: 2
node 2: empty


* Example 3:

Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.

All nodes are in the same tier.

                  20
  Node 0 (DRAM)  ----  Node 1 (DRAM)
         \                 /
          \ 30            / 30
           \             /
             Node 2 (PMEM)

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   30
   2  30   30   10

$ cat /sys/devices/system/memtier/possible
0(64), 1(128), 2(192)

$ grep '' /sys/devices/system/memtier/memtier*/rank
/sys/devices/system/memtier/memtier1/rank:128

$ grep '' /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier1/nodelist:0-2

$ grep '' /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier:1
/sys/devices/system/node/node1/memtier:1
/sys/devices/system/node/node2/memtier:1

Demotion fallback order:
node 0: empty
node 1: empty
node 2: empty


* Example 4:

Node 0 is a DRAM node with CPU.
Node 1 is a PMEM node.
Node 2 is a GPU node.

                  50
  Node 0 (DRAM)  ----  Node 2 (GPU)
         \                 /
          \ 30            / 60
           \             /
             Node 1 (PMEM)

node distances:
node   0    1    2
   0  10   30   50
   1  30   10   60
   2  50   60   10

$ cat /sys/devices/system/memtier/possible
0(64), 1(128), 2(192)

$ grep '' /sys/devices/system/memtier/memtier*/rank
/sys/devices/system/memtier/memtier0/rank:64
/sys/devices/system/memtier/memtier1/rank:128
/sys/devices/system/memtier/memtier2/rank:192

$ grep '' /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier0/nodelist:2
/sys/devices/system/memtier/memtier1/nodelist:0
/sys/devices/system/memtier/memtier2/nodelist:1

$ grep '' /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier:1
/sys/devices/system/node/node1/memtier:2
/sys/devices/system/node/node2/memtier:0

Demotion fallback order:
node 0: 1
node 1: empty
node 2: 0, 1


* Example 5:

Node 0 is a DRAM node with CPU.
Node 1 is a GPU node.
Node 2 is a PMEM node.
Node 3 is a large, slow DRAM node without CPU.

                    100
     Node 0 (DRAM)  ----  Node 1 (GPU)
    /     |               /    |
   /40    |30        120 /     | 110
  |       |             /      |
  |  Node 2 (PMEM) ----       /
  |        \                 /
   \     80 \               /
    ------- Node 3 (Slow DRAM)

node distances:
node    0    1    2    3
   0   10  100   30   40
   1  100   10  120  110
   2   30  120   10   80
   3   40  110   80   10

MAX_MEMORY_TIERS=4 (memtier3 is a memory tier added later).

$ cat /sys/devices/system/memtier/possible
0(64), 1(128), 3(160), 2(192)

$ grep '' /sys/devices/system/memtier/memtier*/rank
/sys/devices/system/memtier/memtier0/rank:64
/sys/devices/system/memtier/memtier1/rank:128
/sys/devices/system/memtier/memtier2/rank:192
/sys/devices/system/memtier/memtier3/rank:160

$ grep '' /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier0/nodelist:1
/sys/devices/system/memtier/memtier1/nodelist:0
/sys/devices/system/memtier/memtier2/nodelist:2
/sys/devices/system/memtier/memtier3/nodelist:3

$ grep '' /sys/devices/system/node/node*/memtier
/sys/devices/system/node/node0/memtier:1
/sys/devices/system/node/node1/memtier:0
/sys/devices/system/node/node2/memtier:2
/sys/devices/system/node/node3/memtier:3

Demotion fallback order:
node 0: 2, 3
node 1: 0, 3, 2
node 2: empty
node 3: 2

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v3)
  2022-05-26 21:22 RFC: Memory Tiering Kernel Interfaces (v3) Wei Xu
@ 2022-05-27  2:58 ` Ying Huang
  2022-05-27 14:05   ` Hesham Almatary
  2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-05-27 13:40 ` RFC: Memory Tiering Kernel Interfaces (v3) Aneesh Kumar K V
  2 siblings, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-05-27  2:58 UTC (permalink / raw)
  To: Wei Xu, Andrew Morton, Greg Thelen, Yang Shi, Aneesh Kumar K.V,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Linux MM, Jagdish Gediya, Baolin Wang, David Rientjes

On Thu, 2022-05-26 at 14:22 -0700, Wei Xu wrote:
> Changes since v2
> ================
> * Updated the design and examples to use "rank" instead of device ID
>   to determine the order between memory tiers for better flexibility.
> 
> Overview
> ========
> 
> The current kernel has the basic memory tiering support: Inactive
> pages on a higher tier NUMA node can be migrated (demoted) to a lower
> tier NUMA node to make room for new allocations on the higher tier
> NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
> migrated (promoted) to a higher tier NUMA node to improve the
> performance.
> 
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created during
> the kernel initialization and updated when a NUMA node is hot-added or
> hot-removed.  The current implementation puts all nodes with CPU into
> the top tier, and builds the tier hierarchy tier-by-tier by
> establishing the per-node demotion targets based on the distances
> between nodes.
> 
> This current memory tier kernel interface needs to be improved for
> several important use cases:
> 
> * The current tier initialization code always initializes
>   each memory-only NUMA node into a lower tier.  But a memory-only
>   NUMA node may have a high performance memory device (e.g. a DRAM
>   device attached via CXL.mem or a DRAM-backed memory-only node on
>   a virtual machine) and should be put into a higher tier.
> 
> * The current tier hierarchy always puts CPU nodes into the top
>   tier. But on a system with HBM (e.g. GPU memory) devices, these
>   memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
>   with CPUs are better to be placed into the next lower tier.
> 
> * Also because the current tier hierarchy always puts CPU nodes
>   into the top tier, when a CPU is hot-added (or hot-removed) and
>   triggers a memory node from CPU-less into a CPU node (or vice
>   versa), the memory tier hierarchy gets changed, even though no
>   memory node is added or removed.  This can make the tier
>   hierarchy unstable and make it difficult to support tier-based
>   memory accounting.
> 
> * A higher tier node can only be demoted to selected nodes on the
>   next lower tier as defined by the demotion path, not any other
>   node from any lower tier.  This strict, hard-coded demotion order
>   does not work in all use cases (e.g. some use cases may want to
>   allow cross-socket demotion to another node in the same demotion
>   tier as a fallback when the preferred demotion node is out of
>   space), and has resulted in the feature request for an interface to
>   override the system-wide, per-node demotion order from the
>   userspace.  This demotion order is also inconsistent with the page
>   allocation fallback order when all the nodes in a higher tier are
>   out of space: The page allocation can fall back to any node from
>   any lower tier, whereas the demotion order doesn't allow that.
> 
> * There are no interfaces for the userspace to learn about the memory
>   tier hierarchy in order to optimize its memory allocations.
> 
> I'd like to propose revised memory tier kernel interfaces based on
> the discussions in the threads:
> 
> - https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> - https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> - https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> - https://lore.kernel.org/linux-mm/d6314cfe1c7898a6680bed1e7cc93b0ab93e3155.camel@intel.com/T/
> 
> 
> High-level Design Ideas
> =======================
> 
> * Define memory tiers explicitly, not implicitly.
> 
> * Memory tiers are defined based on hardware capabilities of memory
>   nodes, not their relative node distances between each other.
> 
> * The tier assignment of each node is independent from each other.
>   Moving a node from one tier to another tier doesn't affect the tier
>   assignment of any other node.
> 
> * The node-tier association is stable. A node can be reassigned to a
>   different tier only under the specific conditions that don't block
>   future tier-based memory cgroup accounting.
> 
> * A node can demote its pages to any nodes of any lower tiers. The
>   demotion target node selection follows the allocation fallback order
>   of the source node, which is built based on node distances.  The
>   demotion targets are also restricted to only the nodes from the tiers
>   lower than the source node.  We no longer need to maintain a separate
>   per-node demotion order (node_demotion[]).
> 
> 
> Sysfs Interfaces
> ================
> 
> * /sys/devices/system/memtier/
> 
>   This is the directory containing the information about memory tiers.
> 
>   Each memory tier has its own subdirectory.
> 
>   The order of memory tiers is determined by their rank values, not by
>   their memtier device names.
> 
>   - /sys/devices/system/memtier/possible
> 
>     Format: ordered list of "memtier(rank)"
>     Example: 0(64), 1(128), 2(192)
> 
>     Read-only.  When read, list all available memory tiers and their
>     associated ranks, ordered by the rank values (from the highest
>      tier to the lowest tier).

I like the idea of "possible" file.  And I think we can show default
tier too.  That is, if "1(128)" is the default tier (tier with DRAM),
then the list can be,

"
0/64 [1/128] 2/192
"

To make it more easier to be parsed by shell, I will prefer something
like,

"
0	64
1	128	default
2	192
"

But one line format is OK for me too.

> 
> * /sys/devices/system/memtier/memtierN/
> 
>   This is the directory containing the information about a particular
>   memory tier, memtierN, where N is the memtier device ID (e.g. 0, 1).
> 
>   The memtier device ID number itself is just an identifier and has no
>   special meaning, i.e. memtier device ID numbers do not determine the
>   order of memory tiers.
> 
>   - /sys/devices/system/memtier/memtierN/rank
> 
>     Format: int
>     Example: 100
> 
>     Read-only.  When read, list the "rank" value associated with memtierN.
> 
>     "Rank" is an opaque value. Its absolute value doesn't have any
>     special meaning. But the rank values of different memtiers can be
>     compared with each other to determine the memory tier order.
>     For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
>     their rank values are 10, 20, 15, then the memory tier order is:
>     memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
>     and memtier1 is the lowest tier.
> 
>     The rank value of each memtier should be unique.
> 
>   - /sys/devices/system/memtier/memtierN/nodelist
> 
>     Format: node_list
>     Example: 1-2
> 
>     Read-only.  When read, list the memory nodes in the specified tier.
> 
>     If a memory tier has no memory nodes, the kernel can hide the sysfs
>     directory of this memory tier, though the tier itself can still be
>     visible from /sys/devices/system/memtier/possible.
> 
> * /sys/devices/system/node/nodeN/memtier
> 
>   where N = 0, 1, ...
> 
>   Format: int or empty
>   Example: 1
> 
>   When read, list the device ID of the memory tier that the node belongs
>   to.  Its value is empty for a CPU-only NUMA node.
> 
>   When written, the kernel moves the node into the specified memory
>   tier if the move is allowed.  The tier assignment of all other nodes
>   are not affected.
> 
>   Initially, we can make this interface read-only.
> 
> 
> Kernel Representation
> =====================
> 
> * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> 
> * #define MAX_MEMORY_TIERS  3
> 
>   Support 3 memory tiers for now.  This can be a kconfig option.
> 
> * #define MEMORY_DEFAULT_TIER_DEVICE 1
> 
>   The default tier device that a memory node is assigned to.
> 
> * struct memtier_dev {
>       nodemask_t nodelist;
>       int rank;
>       int tier;
>   } memtier_devices[MAX_MEMORY_TIERS]
> 
>   Store memory tiers by device IDs.
> 
> * struct memtier_dev *memory_tier(int tier)
> 
>   Returns the memtier device for a given memory tier.
> 
> * int node_tier_dev_map[MAX_NUMNODES]
> 
>   Map a node to its tier device ID..
> 
>   For each CPU-only node c, node_tier_dev_map[c] = -1.
> 
> 
> Memory Tier Initialization
> ==========================
> 
> By default, all memory nodes are assigned to the default tier
> (MEMORY_DEFAULT_TIER_DEVICE).  The default tier device has a rank value
> in the middle of the possible rank value range (e.g. 127 if the range
> is [0..255]).
> 
> A device driver can move up or down its memory nodes from the default
> tier.  For example, PMEM can move down its memory nodes below the
> default tier, whereas GPU can move up its memory nodes above the
> default tier.
> 
> The kernel initialization code makes the decision on which exact tier
> a memory node should be assigned to based on the requests from the
> device drivers as well as the memory device hardware information
> provided by the firmware.
> 
> 
> Memory Tier Reassignment
> ========================
> 
> After a memory node is hot-removed, it can be hot-added back to a
> different memory tier.  This is useful for supporting dynamically
> provisioned CXL.mem NUMA nodes, which may connect to different
> memory devices across hot-plug events.  Such tier changes should
> be compatible with tier-based memory accounting.
> 
> The userspace may also reassign an existing online memory node to a
> different tier.  However, this should only be allowed when no pages
> are allocated from the memory node or when there are no non-root
> memory cgroups (e.g. during the system boot).  This restriction is
> important for keeping memory tier hierarchy stable enough for
> tier-based memory cgroup accounting.

One way to do this is hot-remove all memory of a node, change its
memtier, then hot-add its memory.

Best Regards,
Huang, Ying

> Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> 
> 
> Memory Allocation for Demotion
> ==============================
> 
> To allocate a new page as the demotion target for a page, the kernel
> calls the allocation function (__alloc_pages_nodemask) with the
> source page node as the preferred node and the union of all lower
> tier nodes as the allowed nodemask.  The actual target node selection
> then follows the allocation fallback order that the kernel has
> already defined.
> 
> The pseudo code looks like:
> 
>     targets = NODE_MASK_NONE;
>     src_nid = page_to_nid(page);
>     src_tier = memtier_devices[node_tier_dev_map[src_nid]].tier;
>     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
>             nodes_or(targets, targets, memory_tier(i)->nodelist);
>     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> 
> The memopolicy of cpuset, vma and owner task of the source page can
> be set to refine the demotion target nodemask, e.g. to prevent
> demotion or select a particular allowed node as the demotion target.
> 
> 
> Memory Allocation for Promotion
> ===============================
> 
> The page allocation for promotion is similar to demotion, except that (1)
> the target nodemask uses the promotion tiers, (2) the preferred node can
> be the accessing CPU node, not the source page node.
> 
> 
> Examples
> ========
> 
> * Example 1:
> 
> Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> 
>                   20
>   Node 0 (DRAM)  ----  Node 1 (DRAM)
>        |        \   /       |
>        | 30    40 X 40      | 30
>        |        /   \       |
>   Node 2 (PMEM)  ----  Node 3 (PMEM)
>                   40
> 
> node distances:
> node   0    1    2    3
>    0  10   20   30   40
>    1  20   10   40   30
>    2  30   40   10   40
>    3  40   30   40   10
> 
> $ cat /sys/devices/system/memtier/possible
> 0(64), 1(128), 2(192)
> 
> $ grep '' /sys/devices/system/memtier/memtier*/rank
> /sys/devices/system/memtier/memtier1/rank:128
> /sys/devices/system/memtier/memtier2/rank:192
> 
> $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> /sys/devices/system/memtier/memtier1/nodelist:0-1
> /sys/devices/system/memtier/memtier2/nodelist:2-3
> 
> $ grep '' /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier:1
> /sys/devices/system/node/node1/memtier:1
> /sys/devices/system/node/node2/memtier:2
> /sys/devices/system/node/node3/memtier:2
> 
> Demotion fallback order:
> node 0: 2, 3
> node 1: 3, 2
> node 2: empty
> node 3: empty
> 
> To prevent cross-socket demotion and memory access, the user can set
> mempolicy, e.g. cpuset.mems=0,2.
> 
> 
> * Example 2:
> 
> Node 0 & 1 are DRAM nodes.
> Node 2 is a PMEM node and closer to node 0.
> 
>                   20
>   Node 0 (DRAM)  ----  Node 1 (DRAM)
>        |            /
>        | 30       / 40
>        |        /
>   Node 2 (PMEM)
> 
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   40
>    2  30   40   10
> 
> $ cat /sys/devices/system/memtier/possible
> 0(64), 1(128), 2(192)
> 
> $ grep '' /sys/devices/system/memtier/memtier*/rank
> /sys/devices/system/memtier/memtier1/rank:128
> /sys/devices/system/memtier/memtier2/rank:192
> 
> $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> /sys/devices/system/memtier/memtier1/nodelist:0-1
> /sys/devices/system/memtier/memtier2/nodelist:2
> 
> $ grep '' /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier:1
> /sys/devices/system/node/node1/memtier:1
> /sys/devices/system/node/node2/memtier:2
> 
> Demotion fallback order:
> node 0: 2
> node 1: 2
> node 2: empty
> 
> 
> * Example 3:
> 
> Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> 
> All nodes are in the same tier.
> 
>                   20
>   Node 0 (DRAM)  ----  Node 1 (DRAM)
>          \                 /
>           \ 30            / 30
>            \             /
>              Node 2 (PMEM)
> 
> node distances:
> node   0    1    2
>    0  10   20   30
>    1  20   10   30
>    2  30   30   10
> 
> $ cat /sys/devices/system/memtier/possible
> 0(64), 1(128), 2(192)
> 
> $ grep '' /sys/devices/system/memtier/memtier*/rank
> /sys/devices/system/memtier/memtier1/rank:128
> 
> $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> /sys/devices/system/memtier/memtier1/nodelist:0-2
> 
> $ grep '' /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier:1
> /sys/devices/system/node/node1/memtier:1
> /sys/devices/system/node/node2/memtier:1
> 
> Demotion fallback order:
> node 0: empty
> node 1: empty
> node 2: empty
> 
> 
> * Example 4:
> 
> Node 0 is a DRAM node with CPU.
> Node 1 is a PMEM node.
> Node 2 is a GPU node.
> 
>                   50
>   Node 0 (DRAM)  ----  Node 2 (GPU)
>          \                 /
>           \ 30            / 60
>            \             /
>              Node 1 (PMEM)
> 
> node distances:
> node   0    1    2
>    0  10   30   50
>    1  30   10   60
>    2  50   60   10
> 
> $ cat /sys/devices/system/memtier/possible
> 0(64), 1(128), 2(192)
> 
> $ grep '' /sys/devices/system/memtier/memtier*/rank
> /sys/devices/system/memtier/memtier0/rank:64
> /sys/devices/system/memtier/memtier1/rank:128
> /sys/devices/system/memtier/memtier2/rank:192
> 
> $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> /sys/devices/system/memtier/memtier0/nodelist:2
> /sys/devices/system/memtier/memtier1/nodelist:0
> /sys/devices/system/memtier/memtier2/nodelist:1
> 
> $ grep '' /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier:1
> /sys/devices/system/node/node1/memtier:2
> /sys/devices/system/node/node2/memtier:0
> 
> Demotion fallback order:
> node 0: 1
> node 1: empty
> node 2: 0, 1
> 
> 
> * Example 5:
> 
> Node 0 is a DRAM node with CPU.
> Node 1 is a GPU node.
> Node 2 is a PMEM node.
> Node 3 is a large, slow DRAM node without CPU.
> 
>                     100
>      Node 0 (DRAM)  ----  Node 1 (GPU)
>     /     |               /    |
>    /40    |30        120 /     | 110
>   |       |             /      |
>   |  Node 2 (PMEM) ----       /
>   |        \                 /
>    \     80 \               /
>     ------- Node 3 (Slow DRAM)
> 
> node distances:
> node    0    1    2    3
>    0   10  100   30   40
>    1  100   10  120  110
>    2   30  120   10   80
>    3   40  110   80   10
> 
> MAX_MEMORY_TIERS=4 (memtier3 is a memory tier added later).
> 
> $ cat /sys/devices/system/memtier/possible
> 0(64), 1(128), 3(160), 2(192)
> 
> $ grep '' /sys/devices/system/memtier/memtier*/rank
> /sys/devices/system/memtier/memtier0/rank:64
> /sys/devices/system/memtier/memtier1/rank:128
> /sys/devices/system/memtier/memtier2/rank:192
> /sys/devices/system/memtier/memtier3/rank:160
> 
> $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> /sys/devices/system/memtier/memtier0/nodelist:1
> /sys/devices/system/memtier/memtier1/nodelist:0
> /sys/devices/system/memtier/memtier2/nodelist:2
> /sys/devices/system/memtier/memtier3/nodelist:3
> 
> $ grep '' /sys/devices/system/node/node*/memtier
> /sys/devices/system/node/node0/memtier:1
> /sys/devices/system/node/node1/memtier:0
> /sys/devices/system/node/node2/memtier:2
> /sys/devices/system/node/node3/memtier:3
> 
> Demotion fallback order:
> node 0: 2, 3
> node 1: 0, 3, 2
> node 2: empty
> node 3: 2



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion
  2022-05-26 21:22 RFC: Memory Tiering Kernel Interfaces (v3) Wei Xu
  2022-05-27  2:58 ` Ying Huang
@ 2022-05-27 12:25 ` Aneesh Kumar K.V
  2022-05-27 12:25   ` [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
                     ` (6 more replies)
  2022-05-27 13:40 ` RFC: Memory Tiering Kernel Interfaces (v3) Aneesh Kumar K V
  2 siblings, 7 replies; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-27 12:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes, Aneesh Kumar K.V

The current kernel has the basic memory tiering support: Inactive
pages on a higher tier NUMA node can be migrated (demoted) to a lower
tier NUMA node to make room for new allocations on the higher tier
NUMA node.  Frequently accessed pages on a lower tier NUMA node can be
migrated (promoted) to a higher tier NUMA node to improve the
performance.

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created during
the kernel initialization and updated when a NUMA node is hot-added or
hot-removed.  The current implementation puts all nodes with CPU into
the top tier, and builds the tier hierarchy tier-by-tier by establishing
the per-node demotion targets based on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases:

* The current tier initialization code always initializes
  each memory-only NUMA node into a lower tier.  But a memory-only
  NUMA node may have a high performance memory device (e.g. a DRAM
  device attached via CXL.mem or a DRAM-backed memory-only node on
  a virtual machine) and should be put into a higher tier.

* The current tier hierarchy always puts CPU nodes into the top
  tier. But on a system with HBM (e.g. GPU memory) devices, these
  memory-only HBM NUMA nodes should be in the top tier, and DRAM nodes
  with CPUs are better to be placed into the next lower tier.

* Also because the current tier hierarchy always puts CPU nodes
  into the top tier, when a CPU is hot-added (or hot-removed) and
  triggers a memory node from CPU-less into a CPU node (or vice
  versa), the memory tier hierarchy gets changed, even though no
  memory node is added or removed.  This can make the tier
  hierarchy unstable and make it difficult to support tier-based
  memory accounting.

* A higher tier node can only be demoted to selected nodes on the
  next lower tier as defined by the demotion path, not any other
  node from any lower tier.  This strict, hard-coded demotion order
  does not work in all use cases (e.g. some use cases may want to
  allow cross-socket demotion to another node in the same demotion
  tier as a fallback when the preferred demotion node is out of
  space), and has resulted in the feature request for an interface to
  override the system-wide, per-node demotion order from the
  userspace.  This demotion order is also inconsistent with the page
  allocation fallback order when all the nodes in a higher tier are
  out of space: The page allocation can fall back to any node from
  any lower tier, whereas the demotion order doesn't allow that.

* There are no interfaces for the userspace to learn about the memory
  tier hierarchy in order to optimize its memory allocations.

This patch series make the creation of memory tiers explicit under
the control of userspace or device driver.

Memory Tier Initialization
==========================

By default, all memory nodes are assigned to the default tier (1).
The default tier device has a rank value (200).

A device driver can move up or down its memory nodes from the default
tier.  For example, PMEM can move down its memory nodes below the
default tier, whereas GPU can move up its memory nodes above the
default tier.

The kernel initialization code makes the decision on which exact tier
a memory node should be assigned to based on the requests from the
device drivers as well as the memory device hardware information
provided by the firmware.

Hot-adding/removing CPUs doesn't affect memory tier hierarchy.

Memory Allocation for Demotion
==============================
This patch series keep the demotion target page allocation logic same.
The demotion page allocation pick the closest NUMA node in the
next lower tier to the current NUMA node allocating pages from.

This will be later improved to use the same page allocation strategy
using fallback list.

Sysfs Interface:
-------------
Listing current list of memory tiers and rank details:

:/sys/devices/system/memtier$ ls
default_rank  max_tier  memtier1  power  uevent
:/sys/devices/system/memtier$ cat default_rank 
200
:/sys/devices/system/memtier$ cat max_tier 
3
:/sys/devices/system/memtier$ 

Per node memory tier details:

For a cpu only NUMA node:

:/sys/devices/system/node# cat node0/memtier 
:/sys/devices/system/node# echo 1 > node0/memtier 
:/sys/devices/system/node# cat node0/memtier 
:/sys/devices/system/node# 

For a NUMA node with memory:
:/sys/devices/system/node# cat node1/memtier 
1
:/sys/devices/system/node# ls ../memtier/
default_rank  max_tier  memtier1  power  uevent
:/sys/devices/system/node# echo 2 > node1/memtier 
:/sys/devices/system/node# 
:/sys/devices/system/node# ls ../memtier/
default_rank  max_tier  memtier1  memtier2  power  uevent
:/sys/devices/system/node# cat node1/memtier 
2
:/sys/devices/system/node# 
:/sys/devices/system/node# cat ../memtier/memtier2/rank 
300
:/sys/devices/system/node# 
:/sys/devices/system/node# cat ../memtier/memtier1/rank 
200
:/sys/devices/system/node#

Removing a NUMA node from demotion:
:/sys/devices/system/node# cat node1/memtier 
2
:/sys/devices/system/node# echo none > node1/memtier 
:/sys/devices/system/node# 
:/sys/devices/system/node# cat node1/memtier 
:/sys/devices/system/node# 
:/sys/devices/system/node# ls ../memtier/
default_rank  max_tier  memtier1  power  uevent
:/sys/devices/system/node# 

The above also resulted in removal of memtier2 which was created in the earlier step.


Changelog
----------

v4:
Add support for explicit memory tiers and ranks.

v3:
- Modify patch 1 subject to make it more specific
- Remove /sys/kernel/mm/numa/demotion_targets interface, use
  /sys/devices/system/node/demotion_targets instead and make
  it writable to override node_states[N_DEMOTION_TARGETS].
- Add support to view per node demotion targets via sysfs

v2:
In v1, only 1st patch of this patch series was sent, which was
implemented to avoid some of the limitations on the demotion
target sharing, however for certain numa topology, the demotion
targets found by that patch was not most optimal, so 1st patch
in this series is modified according to suggestions from Huang
and Baolin. Different examples of demotion list comparasion
between existing implementation and changed implementation can
be found in the commit message of 1st patch.


Aneesh Kumar K.V (2):
  mm/demotion: Add support to associate rank with memory tier
  mm/demotion: Add support for removing node from demotion memory tiers

Jagdish Gediya (5):
  mm/demotion: Add support for explicit memory tiers
  mm/demotion: Expose per node memory tier to sysfs
  mm/demotion: Build demotion targets based on explicit memory tiers
  mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  mm/demotion: Demote pages according to allocation fallback order

 drivers/base/node.c     |  43 +++
 drivers/dax/kmem.c      |   4 +
 include/linux/migrate.h |  39 ++-
 mm/Kconfig              |  11 +
 mm/migrate.c            | 756 ++++++++++++++++++++++++++--------------
 mm/vmscan.c             |  38 +-
 mm/vmstat.c             |   5 -
 7 files changed, 590 insertions(+), 306 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
@ 2022-05-27 12:25   ` Aneesh Kumar K.V
  2022-06-02  6:07     ` Ying Huang
  2022-06-08  7:16     ` Ying Huang
  2022-05-27 12:25   ` [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
                     ` (5 subsequent siblings)
  6 siblings, 2 replies; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-27 12:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes, Aneesh Kumar K . V

From: Jagdish Gediya <jvgediya@linux.ibm.com>

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases,

The current tier initialization code always initializes
each memory-only NUMA node into a lower tier.  But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.

With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier.  This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.

This patch series address the above by defining memory tiers explicitly.

This patch adds below sysfs interface which is read-only and
can be used to read nodes available in specific tier.

/sys/devices/system/memtier/memtierN/nodelist

Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
lowest tier. The absolute value of a tier id number has no specific
meaning. what matters is the relative order of the tier id numbers.

All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
nodes are by default assigned to DEFAULT_MEMORY_TIER(1).

Default memory tier can be read from,
/sys/devices/system/memtier/default_tier

Max memory tier can be read from,
/sys/devices/system/memtier/max_tiers

This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].

[1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/migrate.h |  38 ++++++++----
 mm/Kconfig              |  11 ++++
 mm/migrate.c            | 134 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 170 insertions(+), 13 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 90e75d5a54d6..0ec653623565 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -47,17 +47,8 @@ void folio_migrate_copy(struct folio *newfolio, struct folio *folio);
 int folio_migrate_mapping(struct address_space *mapping,
 		struct folio *newfolio, struct folio *folio, int extra_count);
 
-extern bool numa_demotion_enabled;
-extern void migrate_on_reclaim_init(void);
-#ifdef CONFIG_HOTPLUG_CPU
-extern void set_migration_target_nodes(void);
-#else
-static inline void set_migration_target_nodes(void) {}
-#endif
 #else
 
-static inline void set_migration_target_nodes(void) {}
-
 static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t new,
 		free_page_t free, unsigned long private, enum migrate_mode mode,
@@ -82,7 +73,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 	return -ENOSYS;
 }
 
-#define numa_demotion_enabled	false
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_COMPACTION
@@ -172,15 +162,37 @@ struct migrate_vma {
 int migrate_vma_setup(struct migrate_vma *args);
 void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
-int next_demotion_node(int node);
+#endif /* CONFIG_MIGRATION */
+
+#ifdef CONFIG_TIERED_MEMORY
+
+extern bool numa_demotion_enabled;
+#define DEFAULT_MEMORY_TIER	1
+
+enum memory_tier_type {
+	MEMORY_TIER_HBM_GPU,
+	MEMORY_TIER_DRAM,
+	MEMORY_TIER_PMEM,
+	MAX_MEMORY_TIERS
+};
 
-#else /* CONFIG_MIGRATION disabled: */
+int next_demotion_node(int node);
 
+extern void migrate_on_reclaim_init(void);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void set_migration_target_nodes(void);
+#else
+static inline void set_migration_target_nodes(void) {}
+#endif
+#else
+#define numa_demotion_enabled	false
 static inline int next_demotion_node(int node)
 {
 	return NUMA_NO_NODE;
 }
 
-#endif /* CONFIG_MIGRATION */
+static inline void set_migration_target_nodes(void) {}
+static inline void migrate_on_reclaim_init(void) {}
+#endif	/* CONFIG_TIERED_MEMORY */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 034d87953600..7bfbddef46ed 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -258,6 +258,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
 config ARCH_ENABLE_THP_MIGRATION
 	bool
 
+config TIERED_MEMORY
+	bool "Support for explicit memory tiers"
+	def_bool y
+	depends on MIGRATION && NUMA
+	help
+	  Support to split nodes into memory tiers explicitly and
+	  to demote pages on reclaim to lower tiers. This option
+	  also exposes sysfs interface to read nodes available in
+	  specific tier and to move specific node among different
+	  possible tiers.
+
 config HUGETLB_PAGE_SIZE_VARIABLE
 	def_bool n
 	help
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c31ee1e1c9b..f28ee93fb017 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2118,6 +2118,113 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_TIERED_MEMORY
+
+struct memory_tier {
+	struct device dev;
+	nodemask_t nodelist;
+};
+
+#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
+
+static struct bus_type memory_tier_subsys = {
+	.name = "memtier",
+	.dev_name = "memtier",
+};
+
+static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
+
+static ssize_t nodelist_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	int tier = dev->id;
+
+	return sysfs_emit(buf, "%*pbl\n",
+			  nodemask_pr_args(&memory_tiers[tier]->nodelist));
+
+}
+static DEVICE_ATTR_RO(nodelist);
+
+static struct attribute *memory_tier_dev_attrs[] = {
+	&dev_attr_nodelist.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_dev_group = {
+	.attrs = memory_tier_dev_attrs,
+};
+
+static const struct attribute_group *memory_tier_dev_groups[] = {
+	&memory_tier_dev_group,
+	NULL
+};
+
+static void memory_tier_device_release(struct device *dev)
+{
+	struct memory_tier *tier = to_memory_tier(dev);
+
+	kfree(tier);
+}
+
+static int register_memory_tier(int tier)
+{
+	int error;
+
+	memory_tiers[tier] = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!memory_tiers[tier])
+		return -ENOMEM;
+
+	memory_tiers[tier]->dev.id = tier;
+	memory_tiers[tier]->dev.bus = &memory_tier_subsys;
+	memory_tiers[tier]->dev.release = memory_tier_device_release;
+	memory_tiers[tier]->dev.groups = memory_tier_dev_groups;
+	error = device_register(&memory_tiers[tier]->dev);
+
+	if (error) {
+		put_device(&memory_tiers[tier]->dev);
+		memory_tiers[tier] = NULL;
+	}
+
+	return error;
+}
+
+static void unregister_memory_tier(int tier)
+{
+	device_unregister(&memory_tiers[tier]->dev);
+	memory_tiers[tier] = NULL;
+}
+
+static ssize_t
+max_tiers_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
+}
+
+static DEVICE_ATTR_RO(max_tiers);
+
+static ssize_t
+default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", DEFAULT_MEMORY_TIER);
+}
+
+static DEVICE_ATTR_RO(default_tier);
+
+static struct attribute *memoty_tier_attrs[] = {
+	&dev_attr_max_tiers.attr,
+	&dev_attr_default_tier.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_attr_group = {
+	.attrs = memoty_tier_attrs,
+};
+
+static const struct attribute_group *memory_tier_attr_groups[] = {
+	&memory_tier_attr_group,
+	NULL,
+};
+
 /*
  * node_demotion[] example:
  *
@@ -2569,3 +2676,30 @@ static int __init numa_init_sysfs(void)
 }
 subsys_initcall(numa_init_sysfs);
 #endif
+
+static int __init memory_tier_init(void)
+{
+	int ret;
+
+	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
+	if (ret)
+		panic("%s() failed to register subsystem: %d\n", __func__, ret);
+
+	/*
+	 * Register only default memory tier to hide all empty
+	 * memory tier from sysfs.
+	 */
+	ret = register_memory_tier(DEFAULT_MEMORY_TIER);
+	if (ret)
+		panic("%s() failed to register memory tier: %d\n", __func__, ret);
+
+	/*
+	 * CPU only nodes are not part of memoty tiers.
+	 */
+	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
+
+	return 0;
+}
+subsys_initcall(memory_tier_init);
+
+#endif	/* CONFIG_TIERED_MEMORY */
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
  2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-05-27 12:25   ` [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-05-27 12:25   ` Aneesh Kumar K.V
       [not found]     ` <20220527151531.00002a0c@Huawei.com>
  2022-06-08  7:18     ` Ying Huang
  2022-05-27 12:25   ` [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
                     ` (4 subsequent siblings)
  6 siblings, 2 replies; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-27 12:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes, Aneesh Kumar K . V

From: Jagdish Gediya <jvgediya@linux.ibm.com>

Add support to read/write the memory tierindex for a NUMA node.

/sys/devices/system/node/nodeN/memtier

where N = node id

When read, It list the memory tier that the node belongs to.

When written, the kernel moves the node into the specified
memory tier, the tier assignment of all other nodes are not
affected.

If the memory tier does not exist, writing to the above file
create the tier and assign the NUMA node to that tier.

mutex memory_tier_lock is introduced to protect memory tier
related chanegs as it can happen from sysfs as well on hot
plug events.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/base/node.c     |  35 ++++++++++++++
 include/linux/migrate.h |   4 +-
 mm/migrate.c            | 103 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 141 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ec8bb24a5a22..cf4a58446d8c 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -20,6 +20,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
+#include <linux/migrate.h>
 
 static struct bus_type node_subsys = {
 	.name = "node",
@@ -560,11 +561,45 @@ static ssize_t node_read_distance(struct device *dev,
 }
 static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
 
+#ifdef CONFIG_TIERED_MEMORY
+static ssize_t memtier_show(struct device *dev,
+			    struct device_attribute *attr,
+			    char *buf)
+{
+	int node = dev->id;
+
+	return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
+}
+
+static ssize_t memtier_store(struct device *dev,
+			     struct device_attribute *attr,
+			     const char *buf, size_t count)
+{
+	unsigned long tier;
+	int node = dev->id;
+
+	int ret = kstrtoul(buf, 10, &tier);
+	if (ret)
+		return ret;
+
+	ret = node_reset_memory_tier(node, tier);
+	if (ret)
+		return ret;
+
+	return count;
+}
+
+static DEVICE_ATTR_RW(memtier);
+#endif
+
 static struct attribute *node_dev_attrs[] = {
 	&dev_attr_meminfo.attr,
 	&dev_attr_numastat.attr,
 	&dev_attr_distance.attr,
 	&dev_attr_vmstat.attr,
+#ifdef CONFIG_TIERED_MEMORY
+	&dev_attr_memtier.attr,
+#endif
 	NULL
 };
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0ec653623565..d37d1d5dee82 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -177,13 +177,15 @@ enum memory_tier_type {
 };
 
 int next_demotion_node(int node);
-
 extern void migrate_on_reclaim_init(void);
 #ifdef CONFIG_HOTPLUG_CPU
 extern void set_migration_target_nodes(void);
 #else
 static inline void set_migration_target_nodes(void) {}
 #endif
+int node_get_memory_tier(int node);
+int node_set_memory_tier(int node, int tier);
+int node_reset_memory_tier(int node, int tier);
 #else
 #define numa_demotion_enabled	false
 static inline int next_demotion_node(int node)
diff --git a/mm/migrate.c b/mm/migrate.c
index f28ee93fb017..304559ba3372 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2132,6 +2132,7 @@ static struct bus_type memory_tier_subsys = {
 	.dev_name = "memtier",
 };
 
+DEFINE_MUTEX(memory_tier_lock);
 static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
 
 static ssize_t nodelist_show(struct device *dev,
@@ -2225,6 +2226,108 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
 	NULL,
 };
 
+static int __node_get_memory_tier(int node)
+{
+	int tier;
+
+	for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) {
+		if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist))
+			return tier;
+	}
+
+	return -1;
+}
+
+int node_get_memory_tier(int node)
+{
+	int tier;
+
+	/*
+	 * Make sure memory tier is not unregistered
+	 * while it is being read.
+	 */
+	mutex_lock(&memory_tier_lock);
+
+	tier = __node_get_memory_tier(node);
+
+	mutex_unlock(&memory_tier_lock);
+
+	return tier;
+}
+
+int __node_set_memory_tier(int node, int tier)
+{
+	int ret = 0;
+	/*
+	 * As register_memory_tier() for new tier can fail,
+	 * try it before modifying existing tier. register
+	 * tier makes tier visible in sysfs.
+	 */
+	if (!memory_tiers[tier]) {
+		ret = register_memory_tier(tier);
+		if (ret) {
+			goto out;
+		}
+	}
+
+	node_set(node, memory_tiers[tier]->nodelist);
+
+out:
+	return ret;
+}
+
+int node_reset_memory_tier(int node, int tier)
+{
+	int current_tier, ret = 0;
+
+	mutex_lock(&memory_tier_lock);
+
+	current_tier = __node_get_memory_tier(node);
+	if (current_tier == tier)
+		goto out;
+
+	if (current_tier != -1 )
+		node_clear(node, memory_tiers[current_tier]->nodelist);
+
+	ret = __node_set_memory_tier(node, tier);
+
+	if (!ret) {
+		if (nodes_empty(memory_tiers[current_tier]->nodelist))
+			unregister_memory_tier(current_tier);
+	} else {
+		/* reset it back to older tier */
+		ret = __node_set_memory_tier(node, current_tier);
+	}
+out:
+	mutex_unlock(&memory_tier_lock);
+
+	return ret;
+}
+
+int node_set_memory_tier(int node, int tier)
+{
+	int current_tier, ret = 0;
+
+	if (tier >= MAX_MEMORY_TIERS)
+		return -EINVAL;
+
+	mutex_lock(&memory_tier_lock);
+	current_tier = __node_get_memory_tier(node);
+	/*
+	 * if node is already part of the tier proceed with the
+	 * current tier value, because we might want to establish
+	 * new migration paths now. The node might be added to a tier
+	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
+	 * will have skipped this node.
+	 */
+	if (current_tier != -1)
+		tier = current_tier;
+	ret = __node_set_memory_tier(node, tier);
+	mutex_unlock(&memory_tier_lock);
+
+	return ret;
+}
+
 /*
  * node_demotion[] example:
  *
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers
  2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
  2022-05-27 12:25   ` [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
  2022-05-27 12:25   ` [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
@ 2022-05-27 12:25   ` Aneesh Kumar K.V
  2022-05-30  3:35     ` [mm/demotion] 8ebccd60c2: BUG:sleeping_function_called_from_invalid_context_at_mm/compaction.c kernel test robot
  2022-05-27 12:25   ` [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
                     ` (3 subsequent siblings)
  6 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-27 12:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes, Aneesh Kumar K . V

From: Jagdish Gediya <jvgediya@linux.ibm.com>

This patch switch the demotion target building logic to use memory tiers
instead of NUMA distance. All N_MEMORY NUMA nodes will be placed in the
default tier 1 and additional memory tiers will be added by drivers like
dax kmem.

This patch builds the demotion target for a NUMA node by looking at all
memory tiers below the tier to which the NUMA node belongs. The closest node
in the immediately following memory tier is used as a demotion target.

Since we are now only building demotion target for N_MEMORY NUMA nodes
the CPU hotplug calls are removed in this patch.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/migrate.h |   8 -
 mm/migrate.c            | 460 +++++++++++++++-------------------------
 mm/vmstat.c             |   5 -
 3 files changed, 172 insertions(+), 301 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index d37d1d5dee82..cbef71a499c1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -177,12 +177,6 @@ enum memory_tier_type {
 };
 
 int next_demotion_node(int node);
-extern void migrate_on_reclaim_init(void);
-#ifdef CONFIG_HOTPLUG_CPU
-extern void set_migration_target_nodes(void);
-#else
-static inline void set_migration_target_nodes(void) {}
-#endif
 int node_get_memory_tier(int node);
 int node_set_memory_tier(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
@@ -193,8 +187,6 @@ static inline int next_demotion_node(int node)
 	return NUMA_NO_NODE;
 }
 
-static inline void set_migration_target_nodes(void) {}
-static inline void migrate_on_reclaim_init(void) {}
 #endif	/* CONFIG_TIERED_MEMORY */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 304559ba3372..d819a64db5b1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2125,6 +2125,10 @@ struct memory_tier {
 	nodemask_t nodelist;
 };
 
+struct demotion_nodes {
+	nodemask_t preferred;
+};
+
 #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
 
 static struct bus_type memory_tier_subsys = {
@@ -2132,9 +2136,73 @@ static struct bus_type memory_tier_subsys = {
 	.dev_name = "memtier",
 };
 
+static void establish_migration_targets(void);
+
 DEFINE_MUTEX(memory_tier_lock);
 static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
 
+/*
+ * node_demotion[] examples:
+ *
+ * Example 1:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM nodes.
+ *
+ * node distances:
+ * node   0    1    2    3
+ *    0  10   20   30   40
+ *    1  20   10   40   30
+ *    2  30   40   10   40
+ *    3  40   30   40   10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-1
+ * memory_tiers[2] = 2-3
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 3
+ * node_demotion[2].preferred = <empty>
+ * node_demotion[3].preferred = <empty>
+ *
+ * Example 2:
+ *
+ * Node 0 & 1 are CPU + DRAM nodes, node 2 is memory-only DRAM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   30
+ *    2  30   30   10
+ *
+ * memory_tiers[0] = <empty>
+ * memory_tiers[1] = 0-2
+ * memory_tiers[2] = <empty>
+ *
+ * node_demotion[0].preferred = <empty>
+ * node_demotion[1].preferred = <empty>
+ * node_demotion[2].preferred = <empty>
+ *
+ * Example 3:
+ *
+ * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
+ *
+ * node distances:
+ * node   0    1    2
+ *    0  10   20   30
+ *    1  20   10   40
+ *    2  30   40   10
+ *
+ * memory_tiers[0] = 1
+ * memory_tiers[1] = 0
+ * memory_tiers[2] = 2
+ *
+ * node_demotion[0].preferred = 2
+ * node_demotion[1].preferred = 0
+ * node_demotion[2].preferred = <empty>
+ *
+ */
+static struct demotion_nodes *node_demotion __read_mostly;
+
 static ssize_t nodelist_show(struct device *dev,
 			     struct device_attribute *attr, char *buf)
 {
@@ -2238,6 +2306,28 @@ static int __node_get_memory_tier(int node)
 	return -1;
 }
 
+static void node_remove_from_memory_tier(int node)
+{
+	int tier;
+
+	mutex_lock(&memory_tier_lock);
+
+	tier = __node_get_memory_tier(node);
+
+	/*
+	 * Remove node from tier, if tier becomes
+	 * empty then unregister it to make it invisible
+	 * in sysfs.
+	 */
+	node_clear(node, memory_tiers[tier]->nodelist);
+	if (nodes_empty(memory_tiers[tier]->nodelist))
+		unregister_memory_tier(tier);
+
+	establish_migration_targets();
+
+	mutex_unlock(&memory_tier_lock);
+}
+
 int node_get_memory_tier(int node)
 {
 	int tier;
@@ -2271,6 +2361,7 @@ int __node_set_memory_tier(int node, int tier)
 	}
 
 	node_set(node, memory_tiers[tier]->nodelist);
+	establish_migration_targets();
 
 out:
 	return ret;
@@ -2328,75 +2419,6 @@ int node_set_memory_tier(int node, int tier)
 	return ret;
 }
 
-/*
- * node_demotion[] example:
- *
- * Consider a system with two sockets.  Each socket has
- * three classes of memory attached: fast, medium and slow.
- * Each memory class is placed in its own NUMA node.  The
- * CPUs are placed in the node with the "fast" memory.  The
- * 6 NUMA nodes (0-5) might be split among the sockets like
- * this:
- *
- *	Socket A: 0, 1, 2
- *	Socket B: 3, 4, 5
- *
- * When Node 0 fills up, its memory should be migrated to
- * Node 1.  When Node 1 fills up, it should be migrated to
- * Node 2.  The migration path start on the nodes with the
- * processors (since allocations default to this node) and
- * fast memory, progress through medium and end with the
- * slow memory:
- *
- *	0 -> 1 -> 2 -> stop
- *	3 -> 4 -> 5 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *
- *	{  nr=1, nodes[0]=1 }, // Node 0 migrates to 1
- *	{  nr=1, nodes[0]=2 }, // Node 1 migrates to 2
- *	{  nr=0, nodes[0]=-1 }, // Node 2 does not migrate
- *	{  nr=1, nodes[0]=4 }, // Node 3 migrates to 4
- *	{  nr=1, nodes[0]=5 }, // Node 4 migrates to 5
- *	{  nr=0, nodes[0]=-1 }, // Node 5 does not migrate
- *
- * Moreover some systems may have multiple slow memory nodes.
- * Suppose a system has one socket with 3 memory nodes, node 0
- * is fast memory type, and node 1/2 both are slow memory
- * type, and the distance between fast memory node and slow
- * memory node is same. So the migration path should be:
- *
- *	0 -> 1/2 -> stop
- *
- * This is represented in the node_demotion[] like this:
- *	{ nr=2, {nodes[0]=1, nodes[1]=2} }, // Node 0 migrates to node 1 and node 2
- *	{ nr=0, nodes[0]=-1, }, // Node 1 dose not migrate
- *	{ nr=0, nodes[0]=-1, }, // Node 2 does not migrate
- */
-
-/*
- * Writes to this array occur without locking.  Cycles are
- * not allowed: Node X demotes to Y which demotes to X...
- *
- * If multiple reads are performed, a single rcu_read_lock()
- * must be held over all reads to ensure that no cycles are
- * observed.
- */
-#define DEFAULT_DEMOTION_TARGET_NODES 15
-
-#if MAX_NUMNODES < DEFAULT_DEMOTION_TARGET_NODES
-#define DEMOTION_TARGET_NODES	(MAX_NUMNODES - 1)
-#else
-#define DEMOTION_TARGET_NODES	DEFAULT_DEMOTION_TARGET_NODES
-#endif
-
-struct demotion_nodes {
-	unsigned short nr;
-	short nodes[DEMOTION_TARGET_NODES];
-};
-
-static struct demotion_nodes *node_demotion __read_mostly;
-
 /**
  * next_demotion_node() - Get the next node in the demotion path
  * @node: The starting node to lookup the next node
@@ -2409,8 +2431,7 @@ static struct demotion_nodes *node_demotion __read_mostly;
 int next_demotion_node(int node)
 {
 	struct demotion_nodes *nd;
-	unsigned short target_nr, index;
-	int target;
+	int target, nnodes, i;
 
 	if (!node_demotion)
 		return NUMA_NO_NODE;
@@ -2419,61 +2440,46 @@ int next_demotion_node(int node)
 
 	/*
 	 * node_demotion[] is updated without excluding this
-	 * function from running.  RCU doesn't provide any
-	 * compiler barriers, so the READ_ONCE() is required
-	 * to avoid compiler reordering or read merging.
+	 * function from running.
 	 *
 	 * Make sure to use RCU over entire code blocks if
 	 * node_demotion[] reads need to be consistent.
 	 */
 	rcu_read_lock();
-	target_nr = READ_ONCE(nd->nr);
 
-	switch (target_nr) {
-	case 0:
-		target = NUMA_NO_NODE;
-		goto out;
-	case 1:
-		index = 0;
-		break;
-	default:
-		/*
-		 * If there are multiple target nodes, just select one
-		 * target node randomly.
-		 *
-		 * In addition, we can also use round-robin to select
-		 * target node, but we should introduce another variable
-		 * for node_demotion[] to record last selected target node,
-		 * that may cause cache ping-pong due to the changing of
-		 * last target node. Or introducing per-cpu data to avoid
-		 * caching issue, which seems more complicated. So selecting
-		 * target node randomly seems better until now.
-		 */
-		index = get_random_int() % target_nr;
-		break;
-	}
+	nnodes = nodes_weight(nd->preferred);
+	if (!nnodes)
+		return NUMA_NO_NODE;
 
-	target = READ_ONCE(nd->nodes[index]);
+	/*
+	 * If there are multiple target nodes, just select one
+	 * target node randomly.
+	 *
+	 * In addition, we can also use round-robin to select
+	 * target node, but we should introduce another variable
+	 * for node_demotion[] to record last selected target node,
+	 * that may cause cache ping-pong due to the changing of
+	 * last target node. Or introducing per-cpu data to avoid
+	 * caching issue, which seems more complicated. So selecting
+	 * target node randomly seems better until now.
+	 */
+	nnodes = get_random_int() % nnodes;
+	target = first_node(nd->preferred);
+	for (i = 0; i < nnodes; i++)
+		target = next_node(target, nd->preferred);
 
-out:
 	rcu_read_unlock();
+
 	return target;
 }
 
-#if defined(CONFIG_HOTPLUG_CPU)
 /* Disable reclaim-based migration. */
 static void __disable_all_migrate_targets(void)
 {
-	int node, i;
+	int node;
 
-	if (!node_demotion)
-		return;
-
-	for_each_online_node(node) {
-		node_demotion[node].nr = 0;
-		for (i = 0; i < DEMOTION_TARGET_NODES; i++)
-			node_demotion[node].nodes[i] = NUMA_NO_NODE;
-	}
+	for_each_node_mask(node, node_states[N_MEMORY])
+		node_demotion[node].preferred = NODE_MASK_NONE;
 }
 
 static void disable_all_migrate_targets(void)
@@ -2485,173 +2491,70 @@ static void disable_all_migrate_targets(void)
 	 * Readers will see either a combination of before+disable
 	 * state or disable+after.  They will never see before and
 	 * after state together.
-	 *
-	 * The before+after state together might have cycles and
-	 * could cause readers to do things like loop until this
-	 * function finishes.  This ensures they can only see a
-	 * single "bad" read and would, for instance, only loop
-	 * once.
 	 */
 	synchronize_rcu();
 }
 
 /*
- * Find an automatic demotion target for 'node'.
- * Failing here is OK.  It might just indicate
- * being at the end of a chain.
- */
-static int establish_migrate_target(int node, nodemask_t *used,
-				    int best_distance)
+* Find an automatic demotion target for all memory
+* nodes. Failing here is OK.  It might just indicate
+* being at the end of a chain.
+*/
+static void establish_migration_targets(void)
 {
-	int migration_target, index, val;
 	struct demotion_nodes *nd;
+	int tier, target = NUMA_NO_NODE, node;
+	int distance, best_distance;
+	nodemask_t used;
 
 	if (!node_demotion)
-		return NUMA_NO_NODE;
-
-	nd = &node_demotion[node];
-
-	migration_target = find_next_best_node(node, used);
-	if (migration_target == NUMA_NO_NODE)
-		return NUMA_NO_NODE;
-
-	/*
-	 * If the node has been set a migration target node before,
-	 * which means it's the best distance between them. Still
-	 * check if this node can be demoted to other target nodes
-	 * if they have a same best distance.
-	 */
-	if (best_distance != -1) {
-		val = node_distance(node, migration_target);
-		if (val > best_distance)
-			goto out_clear;
-	}
-
-	index = nd->nr;
-	if (WARN_ONCE(index >= DEMOTION_TARGET_NODES,
-		      "Exceeds maximum demotion target nodes\n"))
-		goto out_clear;
-
-	nd->nodes[index] = migration_target;
-	nd->nr++;
+		return;
 
-	return migration_target;
-out_clear:
-	node_clear(migration_target, *used);
-	return NUMA_NO_NODE;
-}
+	disable_all_migrate_targets();
 
-/*
- * When memory fills up on a node, memory contents can be
- * automatically migrated to another node instead of
- * discarded at reclaim.
- *
- * Establish a "migration path" which will start at nodes
- * with CPUs and will follow the priorities used to build the
- * page allocator zonelists.
- *
- * The difference here is that cycles must be avoided.  If
- * node0 migrates to node1, then neither node1, nor anything
- * node1 migrates to can migrate to node0. Also one node can
- * be migrated to multiple nodes if the target nodes all have
- * a same best-distance against the source node.
- *
- * This function can run simultaneously with readers of
- * node_demotion[].  However, it can not run simultaneously
- * with itself.  Exclusion is provided by memory hotplug events
- * being single-threaded.
- */
-static void __set_migration_target_nodes(void)
-{
-	nodemask_t next_pass	= NODE_MASK_NONE;
-	nodemask_t this_pass	= NODE_MASK_NONE;
-	nodemask_t used_targets = NODE_MASK_NONE;
-	int node, best_distance;
+	for_each_node_mask(node, node_states[N_MEMORY]) {
+		best_distance = -1;
+		nd = &node_demotion[node];
 
-	/*
-	 * Avoid any oddities like cycles that could occur
-	 * from changes in the topology.  This will leave
-	 * a momentary gap when migration is disabled.
-	 */
-	disable_all_migrate_targets();
+		tier = __node_get_memory_tier(node);
+		/*
+		 * Find next tier to demote.
+		 */
+		while (++tier < MAX_MEMORY_TIERS) {
+			if (memory_tiers[tier])
+				break;
+		}
 
-	/*
-	 * Allocations go close to CPUs, first.  Assume that
-	 * the migration path starts at the nodes with CPUs.
-	 */
-	next_pass = node_states[N_CPU];
-again:
-	this_pass = next_pass;
-	next_pass = NODE_MASK_NONE;
-	/*
-	 * To avoid cycles in the migration "graph", ensure
-	 * that migration sources are not future targets by
-	 * setting them in 'used_targets'.  Do this only
-	 * once per pass so that multiple source nodes can
-	 * share a target node.
-	 *
-	 * 'used_targets' will become unavailable in future
-	 * passes.  This limits some opportunities for
-	 * multiple source nodes to share a destination.
-	 */
-	nodes_or(used_targets, used_targets, this_pass);
+		if (tier >= MAX_MEMORY_TIERS)
+			continue;
 
-	for_each_node_mask(node, this_pass) {
-		best_distance = -1;
+		nodes_andnot(used, node_states[N_MEMORY], memory_tiers[tier]->nodelist);
 
 		/*
-		 * Try to set up the migration path for the node, and the target
-		 * migration nodes can be multiple, so doing a loop to find all
-		 * the target nodes if they all have a best node distance.
+		 * Find all the nodes in the memory tier node list of same best distance.
+		 * add add them to the preferred mask. We randomly select between nodes
+		 * in the preferred mask when allocating pages during demotion.
 		 */
 		do {
-			int target_node =
-				establish_migrate_target(node, &used_targets,
-							 best_distance);
-
-			if (target_node == NUMA_NO_NODE)
+			target = find_next_best_node(node, &used);
+			if (target == NUMA_NO_NODE)
 				break;
 
-			if (best_distance == -1)
-				best_distance = node_distance(node, target_node);
-
-			/*
-			 * Visit targets from this pass in the next pass.
-			 * Eventually, every node will have been part of
-			 * a pass, and will become set in 'used_targets'.
-			 */
-			node_set(target_node, next_pass);
+			distance = node_distance(node, target);
+			if (distance == best_distance || best_distance == -1) {
+				best_distance = distance;
+				node_set(target, nd->preferred);
+			} else {
+				break;
+			}
 		} while (1);
 	}
-	/*
-	 * 'next_pass' contains nodes which became migration
-	 * targets in this pass.  Make additional passes until
-	 * no more migrations targets are available.
-	 */
-	if (!nodes_empty(next_pass))
-		goto again;
 }
 
 /*
- * For callers that do not hold get_online_mems() already.
- */
-void set_migration_target_nodes(void)
-{
-	get_online_mems();
-	__set_migration_target_nodes();
-	put_online_mems();
-}
-
-/*
- * This leaves migrate-on-reclaim transiently disabled between
- * the MEM_GOING_OFFLINE and MEM_OFFLINE events.  This runs
- * whether reclaim-based migration is enabled or not, which
- * ensures that the user can turn reclaim-based migration at
- * any time without needing to recalculate migration targets.
- *
- * These callbacks already hold get_online_mems().  That is why
- * __set_migration_target_nodes() can be used as opposed to
- * set_migration_target_nodes().
+ * This runs whether reclaim-based migration is enabled or not,
+ * which ensures that the user can turn reclaim-based migration
+ * at any time without needing to recalculate migration targets.
  */
 static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
 						 unsigned long action, void *_arg)
@@ -2660,64 +2563,44 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
 
 	/*
 	 * Only update the node migration order when a node is
-	 * changing status, like online->offline.  This avoids
-	 * the overhead of synchronize_rcu() in most cases.
+	 * changing status, like online->offline.
 	 */
 	if (arg->status_change_nid < 0)
 		return notifier_from_errno(0);
 
 	switch (action) {
-	case MEM_GOING_OFFLINE:
-		/*
-		 * Make sure there are not transient states where
-		 * an offline node is a migration target.  This
-		 * will leave migration disabled until the offline
-		 * completes and the MEM_OFFLINE case below runs.
-		 */
-		disable_all_migrate_targets();
-		break;
 	case MEM_OFFLINE:
-	case MEM_ONLINE:
 		/*
-		 * Recalculate the target nodes once the node
-		 * reaches its final state (online or offline).
+		 * In case we are moving out of N_MEMORY. Keep the node
+		 * in the memory tier so that when we bring memory online,
+		 * they appear in the right memory tier. We still need
+		 * to rebuild the demotion order.
 		 */
-		__set_migration_target_nodes();
+		mutex_lock(&memory_tier_lock);
+		establish_migration_targets();
+		mutex_unlock(&memory_tier_lock);
 		break;
-	case MEM_CANCEL_OFFLINE:
+	case MEM_ONLINE:
 		/*
-		 * MEM_GOING_OFFLINE disabled all the migration
-		 * targets.  Reenable them.
+		 * We ignore the error here, if the node already have the tier
+		 * registered, we will continue to use that for the new memory
+		 * we are adding here.
 		 */
-		__set_migration_target_nodes();
-		break;
-	case MEM_GOING_ONLINE:
-	case MEM_CANCEL_ONLINE:
+		node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER);
 		break;
 	}
 
 	return notifier_from_errno(0);
 }
 
-void __init migrate_on_reclaim_init(void)
+static void __init migrate_on_reclaim_init(void)
 {
-	node_demotion = kmalloc_array(nr_node_ids,
-				      sizeof(struct demotion_nodes),
-				      GFP_KERNEL);
+	node_demotion = kcalloc(MAX_NUMNODES, sizeof(struct demotion_nodes),
+				GFP_KERNEL);
 	WARN_ON(!node_demotion);
 
 	hotplug_memory_notifier(migrate_on_reclaim_callback, 100);
-	/*
-	 * At this point, all numa nodes with memory/CPus have their state
-	 * properly set, so we can build the demotion order now.
-	 * Let us hold the cpu_hotplug lock just, as we could possibily have
-	 * CPU hotplug events during boot.
-	 */
-	cpus_read_lock();
-	set_migration_target_nodes();
-	cpus_read_unlock();
 }
-#endif /* CONFIG_HOTPLUG_CPU */
 
 bool numa_demotion_enabled = false;
 
@@ -2800,6 +2683,7 @@ static int __init memory_tier_init(void)
 	 * CPU only nodes are not part of memoty tiers.
 	 */
 	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
+	migrate_on_reclaim_init();
 
 	return 0;
 }
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b75b1a64b54c..7815d21345a4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -2053,7 +2053,6 @@ static int vmstat_cpu_online(unsigned int cpu)
 
 	if (!node_state(cpu_to_node(cpu), N_CPU)) {
 		node_set_state(cpu_to_node(cpu), N_CPU);
-		set_migration_target_nodes();
 	}
 
 	return 0;
@@ -2078,7 +2077,6 @@ static int vmstat_cpu_dead(unsigned int cpu)
 		return 0;
 
 	node_clear_state(node, N_CPU);
-	set_migration_target_nodes();
 
 	return 0;
 }
@@ -2111,9 +2109,6 @@ void __init init_mm_internals(void)
 
 	start_shepherd_timer();
 #endif
-#if defined(CONFIG_MIGRATION) && defined(CONFIG_HOTPLUG_CPU)
-	migrate_on_reclaim_init();
-#endif
 #ifdef CONFIG_PROC_FS
 	proc_create_seq("buddyinfo", 0444, NULL, &fragmentation_op);
 	proc_create_seq("pagetypeinfo", 0400, NULL, &pagetypeinfo_op);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                     ` (2 preceding siblings ...)
  2022-05-27 12:25   ` [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-05-27 12:25   ` Aneesh Kumar K.V
  2022-06-01  6:29     ` Bharata B Rao
  2022-05-27 12:25   ` [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier Aneesh Kumar K.V
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-27 12:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes, Aneesh Kumar K . V

From: Jagdish Gediya <jvgediya@linux.ibm.com>

By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
is memory tier 1 which is designated for nodes with DRAM, so it
is not the right tier for dax devices.

Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future,
support should be added to distinguish the dax-devices which should
not be MEMORY_TIER_PMEM and right memory tier should be set for them.

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/dax/kmem.c | 4 ++++
 mm/migrate.c       | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index a37622060fff..991782aa2448 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -11,6 +11,7 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mman.h>
+#include <linux/migrate.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 
 	dev_set_drvdata(dev, data);
 
+#ifdef CONFIG_TIERED_MEMORY
+	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+#endif
 	return 0;
 
 err_request_mem:
diff --git a/mm/migrate.c b/mm/migrate.c
index d819a64db5b1..59d8558dd2ee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2418,6 +2418,8 @@ int node_set_memory_tier(int node, int tier)
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(node_set_memory_tier);
+
 
 /**
  * next_demotion_node() - Get the next node in the demotion path
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier
  2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                     ` (3 preceding siblings ...)
  2022-05-27 12:25   ` [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
@ 2022-05-27 12:25   ` Aneesh Kumar K.V
       [not found]     ` <20220527154557.00002c56@Huawei.com>
  2022-06-02  6:41     ` Ying Huang
  2022-05-27 12:25   ` [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
  2022-05-27 12:25   ` [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
  6 siblings, 2 replies; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-27 12:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes, Aneesh Kumar K.V

The rank approach allows us to keep memory tier device IDs stable even if there
is a need to change the tier ordering among different memory tiers. e.g. DRAM
nodes with CPUs will always be on memtier1, no matter how many tiers are higher
or lower than these nodes. A new memory tier can be inserted into the tier
hierarchy for a new set of nodes without affecting the node assignment of any
existing memtier, provided that there is enough gap in the rank values for the
new memtier.

The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
Its value relative to other memtiers decides the level of this memtier in the tier
hierarchy.

For now, This patch supports hardcoded rank values which are 100, 200, & 300 for
memory tiers 0,1 & 2 respectively.

Below is the sysfs interface to read the rank values of memory tier,
/sys/devices/system/memtier/memtierN/rank

This interface is read only for now, write support can be added when there is
a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
requirement among them, rank can be utilized there as rank decides now memory
tiering ordering and not memory tier device ids.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/base/node.c     |   5 +-
 drivers/dax/kmem.c      |   2 +-
 include/linux/migrate.h |  17 ++--
 mm/migrate.c            | 218 ++++++++++++++++++++++++----------------
 4 files changed, 144 insertions(+), 98 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index cf4a58446d8c..892f7c23c94e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -567,8 +567,11 @@ static ssize_t memtier_show(struct device *dev,
 			    char *buf)
 {
 	int node = dev->id;
+	int tier_index = node_get_memory_tier_id(node);
 
-	return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
+	if (tier_index != -1)
+		return sysfs_emit(buf, "%d\n", tier_index);
+	return 0;
 }
 
 static ssize_t memtier_store(struct device *dev,
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 991782aa2448..79953426ddaf 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -149,7 +149,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	dev_set_drvdata(dev, data);
 
 #ifdef CONFIG_TIERED_MEMORY
-	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+	node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM);
 #endif
 	return 0;
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index cbef71a499c1..fd09fd009a69 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -167,18 +167,19 @@ void migrate_vma_finalize(struct migrate_vma *migrate);
 #ifdef CONFIG_TIERED_MEMORY
 
 extern bool numa_demotion_enabled;
-#define DEFAULT_MEMORY_TIER	1
-
 enum memory_tier_type {
-	MEMORY_TIER_HBM_GPU,
-	MEMORY_TIER_DRAM,
-	MEMORY_TIER_PMEM,
-	MAX_MEMORY_TIERS
+	MEMORY_RANK_HBM_GPU,
+	MEMORY_RANK_DRAM,
+	DEFAULT_MEMORY_RANK = MEMORY_RANK_DRAM,
+	MEMORY_RANK_PMEM
 };
 
+#define DEFAULT_MEMORY_TIER 1
+#define MAX_MEMORY_TIERS  3
+
 int next_demotion_node(int node);
-int node_get_memory_tier(int node);
-int node_set_memory_tier(int node, int tier);
+int node_get_memory_tier_id(int node);
+int node_set_memory_tier_rank(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
 #else
 #define numa_demotion_enabled	false
diff --git a/mm/migrate.c b/mm/migrate.c
index 59d8558dd2ee..f013d14f77ed 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2121,8 +2121,10 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 #ifdef CONFIG_TIERED_MEMORY
 
 struct memory_tier {
+	struct list_head list;
 	struct device dev;
 	nodemask_t nodelist;
+	int rank;
 };
 
 struct demotion_nodes {
@@ -2139,7 +2141,7 @@ static struct bus_type memory_tier_subsys = {
 static void establish_migration_targets(void);
 
 DEFINE_MUTEX(memory_tier_lock);
-static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
+static LIST_HEAD(memory_tiers);
 
 /*
  * node_demotion[] examples:
@@ -2206,16 +2208,25 @@ static struct demotion_nodes *node_demotion __read_mostly;
 static ssize_t nodelist_show(struct device *dev,
 			     struct device_attribute *attr, char *buf)
 {
-	int tier = dev->id;
+	struct memory_tier *memtier = to_memory_tier(dev);
 
 	return sysfs_emit(buf, "%*pbl\n",
-			  nodemask_pr_args(&memory_tiers[tier]->nodelist));
-
+			  nodemask_pr_args(&memtier->nodelist));
 }
 static DEVICE_ATTR_RO(nodelist);
 
+static ssize_t rank_show(struct device *dev,
+			 struct device_attribute *attr, char *buf)
+{
+	struct memory_tier *memtier = to_memory_tier(dev);
+
+	return sysfs_emit(buf, "%d\n", memtier->rank);
+}
+static DEVICE_ATTR_RO(rank);
+
 static struct attribute *memory_tier_dev_attrs[] = {
 	&dev_attr_nodelist.attr,
+	&dev_attr_rank.attr,
 	NULL
 };
 
@@ -2235,53 +2246,79 @@ static void memory_tier_device_release(struct device *dev)
 	kfree(tier);
 }
 
-static int register_memory_tier(int tier)
+static void insert_memory_tier(struct memory_tier *memtier)
+{
+	struct list_head *ent;
+	struct memory_tier *tmp_memtier;
+
+	list_for_each(ent, &memory_tiers) {
+		tmp_memtier = list_entry(ent, struct memory_tier, list);
+		if (tmp_memtier->rank > memtier->rank) {
+			list_add_tail(&memtier->list, ent);
+			return;
+		}
+	}
+	list_add_tail(&memtier->list, &memory_tiers);
+}
+
+static struct memory_tier *register_memory_tier(unsigned int tier)
 {
 	int error;
+	struct memory_tier *memtier;
 
-	memory_tiers[tier] = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
-	if (!memory_tiers[tier])
-		return -ENOMEM;
+	if (tier >= MAX_MEMORY_TIERS)
+		return NULL;
 
-	memory_tiers[tier]->dev.id = tier;
-	memory_tiers[tier]->dev.bus = &memory_tier_subsys;
-	memory_tiers[tier]->dev.release = memory_tier_device_release;
-	memory_tiers[tier]->dev.groups = memory_tier_dev_groups;
-	error = device_register(&memory_tiers[tier]->dev);
+	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!memtier)
+		return NULL;
 
+	memtier->dev.id = tier;
+	/*
+	 * For now we only supported hardcoded rank value which
+	 * 100, 200, 300 with no special meaning.
+	 */
+	memtier->rank = 100 + 100 * tier;
+	memtier->dev.bus = &memory_tier_subsys;
+	memtier->dev.release = memory_tier_device_release;
+	memtier->dev.groups = memory_tier_dev_groups;
+
+	insert_memory_tier(memtier);
+
+	error = device_register(&memtier->dev);
 	if (error) {
-		put_device(&memory_tiers[tier]->dev);
-		memory_tiers[tier] = NULL;
+		list_del(&memtier->list);
+		put_device(&memtier->dev);
+		return NULL;
 	}
-
-	return error;
+	return memtier;
 }
 
-static void unregister_memory_tier(int tier)
+static void unregister_memory_tier(struct memory_tier *memtier)
 {
-	device_unregister(&memory_tiers[tier]->dev);
-	memory_tiers[tier] = NULL;
+	list_del(&memtier->list);
+	device_unregister(&memtier->dev);
 }
 
 static ssize_t
-max_tiers_show(struct device *dev, struct device_attribute *attr, char *buf)
+max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
 	return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
 }
 
-static DEVICE_ATTR_RO(max_tiers);
+static DEVICE_ATTR_RO(max_tier);
 
 static ssize_t
-default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+default_rank_show(struct device *dev, struct device_attribute *attr, char *buf)
 {
-	return sysfs_emit(buf, "%d\n", DEFAULT_MEMORY_TIER);
+	return sysfs_emit(buf, "%d\n",  100 + 100 * DEFAULT_MEMORY_TIER);
 }
 
-static DEVICE_ATTR_RO(default_tier);
+static DEVICE_ATTR_RO(default_rank);
 
 static struct attribute *memoty_tier_attrs[] = {
-	&dev_attr_max_tiers.attr,
-	&dev_attr_default_tier.attr,
+	&dev_attr_max_tier.attr,
+	&dev_attr_default_rank.attr,
 	NULL
 };
 
@@ -2294,52 +2331,61 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
 	NULL,
 };
 
-static int __node_get_memory_tier(int node)
+static struct memory_tier *__node_get_memory_tier(int node)
 {
-	int tier;
+	struct memory_tier *memtier;
 
-	for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) {
-		if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist))
-			return tier;
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (node_isset(node, memtier->nodelist))
+			return memtier;
 	}
+	return NULL;
+}
 
-	return -1;
+static struct memory_tier *__get_memory_tier_from_id(int id)
+{
+	struct memory_tier *memtier;
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		if (memtier->dev.id == id)
+			return memtier;
+	}
+	return NULL;
 }
 
+
 static void node_remove_from_memory_tier(int node)
 {
-	int tier;
+	struct memory_tier *memtier;
 
 	mutex_lock(&memory_tier_lock);
 
-	tier = __node_get_memory_tier(node);
-
+	memtier = __node_get_memory_tier(node);
 	/*
 	 * Remove node from tier, if tier becomes
 	 * empty then unregister it to make it invisible
 	 * in sysfs.
 	 */
-	node_clear(node, memory_tiers[tier]->nodelist);
-	if (nodes_empty(memory_tiers[tier]->nodelist))
-		unregister_memory_tier(tier);
+	node_clear(node, memtier->nodelist);
+	if (nodes_empty(memtier->nodelist))
+		unregister_memory_tier(memtier);
 
 	establish_migration_targets();
-
 	mutex_unlock(&memory_tier_lock);
 }
 
-int node_get_memory_tier(int node)
+int node_get_memory_tier_id(int node)
 {
-	int tier;
-
+	int tier = -1;
+	struct memory_tier *memtier;
 	/*
 	 * Make sure memory tier is not unregistered
 	 * while it is being read.
 	 */
 	mutex_lock(&memory_tier_lock);
-
-	tier = __node_get_memory_tier(node);
-
+	memtier = __node_get_memory_tier(node);
+	if (memtier)
+		tier = memtier->dev.id;
 	mutex_unlock(&memory_tier_lock);
 
 	return tier;
@@ -2348,46 +2394,43 @@ int node_get_memory_tier(int node)
 int __node_set_memory_tier(int node, int tier)
 {
 	int ret = 0;
-	/*
-	 * As register_memory_tier() for new tier can fail,
-	 * try it before modifying existing tier. register
-	 * tier makes tier visible in sysfs.
-	 */
-	if (!memory_tiers[tier]) {
-		ret = register_memory_tier(tier);
-		if (ret) {
+	struct memory_tier *memtier;
+
+	memtier = __get_memory_tier_from_id(tier);
+	if (!memtier) {
+		memtier = register_memory_tier(tier);
+		if (!memtier) {
+			ret = -EINVAL;
 			goto out;
 		}
 	}
-
-	node_set(node, memory_tiers[tier]->nodelist);
+	node_set(node, memtier->nodelist);
 	establish_migration_targets();
-
 out:
 	return ret;
 }
 
 int node_reset_memory_tier(int node, int tier)
 {
-	int current_tier, ret = 0;
+	struct memory_tier *current_tier;
+	int ret = 0;
 
 	mutex_lock(&memory_tier_lock);
 
 	current_tier = __node_get_memory_tier(node);
-	if (current_tier == tier)
+	if (!current_tier || current_tier->dev.id == tier)
 		goto out;
 
-	if (current_tier != -1 )
-		node_clear(node, memory_tiers[current_tier]->nodelist);
+	node_clear(node, current_tier->nodelist);
 
 	ret = __node_set_memory_tier(node, tier);
 
 	if (!ret) {
-		if (nodes_empty(memory_tiers[current_tier]->nodelist))
+		if (nodes_empty(current_tier->nodelist))
 			unregister_memory_tier(current_tier);
 	} else {
 		/* reset it back to older tier */
-		ret = __node_set_memory_tier(node, current_tier);
+		node_set(node, current_tier->nodelist);
 	}
 out:
 	mutex_unlock(&memory_tier_lock);
@@ -2395,15 +2438,13 @@ int node_reset_memory_tier(int node, int tier)
 	return ret;
 }
 
-int node_set_memory_tier(int node, int tier)
+int node_set_memory_tier_rank(int node, int rank)
 {
-	int current_tier, ret = 0;
-
-	if (tier >= MAX_MEMORY_TIERS)
-		return -EINVAL;
+	struct memory_tier *memtier;
+	int ret = 0;
 
 	mutex_lock(&memory_tier_lock);
-	current_tier = __node_get_memory_tier(node);
+	memtier = __node_get_memory_tier(node);
 	/*
 	 * if node is already part of the tier proceed with the
 	 * current tier value, because we might want to establish
@@ -2411,15 +2452,17 @@ int node_set_memory_tier(int node, int tier)
 	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
 	 * will have skipped this node.
 	 */
-	if (current_tier != -1)
-		tier = current_tier;
-	ret = __node_set_memory_tier(node, tier);
+	if (memtier)
+		establish_migration_targets();
+	else {
+		/* For now rank value and tier value is same. */
+		ret = __node_set_memory_tier(node, rank);
+	}
 	mutex_unlock(&memory_tier_lock);
 
 	return ret;
 }
-EXPORT_SYMBOL_GPL(node_set_memory_tier);
-
+EXPORT_SYMBOL_GPL(node_set_memory_tier_rank);
 
 /**
  * next_demotion_node() - Get the next node in the demotion path
@@ -2504,6 +2547,8 @@ static void disable_all_migrate_targets(void)
 */
 static void establish_migration_targets(void)
 {
+	struct list_head *ent;
+	struct memory_tier *memtier;
 	struct demotion_nodes *nd;
 	int tier, target = NUMA_NO_NODE, node;
 	int distance, best_distance;
@@ -2518,19 +2563,15 @@ static void establish_migration_targets(void)
 		best_distance = -1;
 		nd = &node_demotion[node];
 
-		tier = __node_get_memory_tier(node);
+		memtier = __node_get_memory_tier(node);
+		if (!memtier || list_is_last(&memtier->list, &memory_tiers))
+			continue;
 		/*
-		 * Find next tier to demote.
+		 * Get the next memtier to find the  demotion node list.
 		 */
-		while (++tier < MAX_MEMORY_TIERS) {
-			if (memory_tiers[tier])
-				break;
-		}
+		memtier = list_next_entry(memtier, list);
 
-		if (tier >= MAX_MEMORY_TIERS)
-			continue;
-
-		nodes_andnot(used, node_states[N_MEMORY], memory_tiers[tier]->nodelist);
+		nodes_andnot(used, node_states[N_MEMORY], memtier->nodelist);
 
 		/*
 		 * Find all the nodes in the memory tier node list of same best distance.
@@ -2588,7 +2629,7 @@ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
 		 * registered, we will continue to use that for the new memory
 		 * we are adding here.
 		 */
-		node_set_memory_tier(arg->status_change_nid, DEFAULT_MEMORY_TIER);
+		node_set_memory_tier_rank(arg->status_change_nid, DEFAULT_MEMORY_RANK);
 		break;
 	}
 
@@ -2668,6 +2709,7 @@ subsys_initcall(numa_init_sysfs);
 static int __init memory_tier_init(void)
 {
 	int ret;
+	struct memory_tier *memtier;
 
 	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
 	if (ret)
@@ -2677,14 +2719,14 @@ static int __init memory_tier_init(void)
 	 * Register only default memory tier to hide all empty
 	 * memory tier from sysfs.
 	 */
-	ret = register_memory_tier(DEFAULT_MEMORY_TIER);
-	if (ret)
+	memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
+	if (!memtier)
 		panic("%s() failed to register memory tier: %d\n", __func__, ret);
 
 	/*
 	 * CPU only nodes are not part of memoty tiers.
 	 */
-	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
+	memtier->nodelist = node_states[N_MEMORY];
 	migrate_on_reclaim_init();
 
 	return 0;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers
  2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                     ` (4 preceding siblings ...)
  2022-05-27 12:25   ` [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier Aneesh Kumar K.V
@ 2022-05-27 12:25   ` Aneesh Kumar K.V
  2022-06-02  6:43     ` Ying Huang
  2022-05-27 12:25   ` [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
  6 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-27 12:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes, Aneesh Kumar K.V

This patch adds the special string "none" as a supported memtier value
that we can use to remove a specific node from being using as demotion target.

For ex:
:/sys/devices/system/node/node1# cat memtier
1
:/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
1-3
:/sys/devices/system/node/node1# echo none > memtier
:/sys/devices/system/node/node1#
:/sys/devices/system/node/node1# cat memtier
:/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
2-3
:/sys/devices/system/node/node1#

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 drivers/base/node.c     |  7 ++++++-
 include/linux/migrate.h |  1 +
 mm/migrate.c            | 15 +++++++++++++--
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 892f7c23c94e..5311cf1db500 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -578,10 +578,15 @@ static ssize_t memtier_store(struct device *dev,
 			     struct device_attribute *attr,
 			     const char *buf, size_t count)
 {
+	int ret;
 	unsigned long tier;
 	int node = dev->id;
 
-	int ret = kstrtoul(buf, 10, &tier);
+	if (!strncmp(buf, "none", strlen("none"))) {
+		node_remove_from_memory_tier(node);
+		return count;
+	}
+	ret = kstrtoul(buf, 10, &tier);
 	if (ret)
 		return ret;
 
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index fd09fd009a69..77c581f47953 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -178,6 +178,7 @@ enum memory_tier_type {
 #define MAX_MEMORY_TIERS  3
 
 int next_demotion_node(int node);
+void node_remove_from_memory_tier(int node);
 int node_get_memory_tier_id(int node);
 int node_set_memory_tier_rank(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
diff --git a/mm/migrate.c b/mm/migrate.c
index f013d14f77ed..114c7428b9f3 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2354,7 +2354,7 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
 }
 
 
-static void node_remove_from_memory_tier(int node)
+void node_remove_from_memory_tier(int node)
 {
 	struct memory_tier *memtier;
 
@@ -2418,7 +2418,18 @@ int node_reset_memory_tier(int node, int tier)
 	mutex_lock(&memory_tier_lock);
 
 	current_tier = __node_get_memory_tier(node);
-	if (!current_tier || current_tier->dev.id == tier)
+	if (!current_tier) {
+		/*
+		 * If a N_MEMORY node doesn't have a tier index, then
+		 * we removed it from demotion earlier and we are trying
+		 * add it back. Just add the node to requested tier.
+		 */
+		if (node_state(node, N_MEMORY))
+			ret = __node_set_memory_tier(node, tier);
+		goto out;
+	}
+
+	if (current_tier->dev.id == tier)
 		goto out;
 
 	node_clear(node, current_tier->nodelist);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
                     ` (5 preceding siblings ...)
  2022-05-27 12:25   ` [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
@ 2022-05-27 12:25   ` Aneesh Kumar K.V
  2022-06-02  7:35     ` Ying Huang
  6 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-05-27 12:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes, Aneesh Kumar K . V

From: Jagdish Gediya <jvgediya@linux.ibm.com>

currently, a higher tier node can only be demoted to selected
nodes on the next lower tier as defined by the demotion path,
not any other node from any lower tier.  This strict, hard-coded
demotion order does not work in all use cases (e.g. some use cases
may want to allow cross-socket demotion to another node in the same
demotion tier as a fallback when the preferred demotion node is out
of space). This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from any
lower tier, whereas the demotion order doesn't allow that currently.

This patch adds support to get all the allowed demotion targets mask
for node, also demote_page_list() function is modified to utilize this
allowed node mask by filling it in migration_target_control structure
before passing it to migrate_pages().

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/migrate.h |  5 ++++
 mm/migrate.c            | 52 +++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c             | 38 ++++++++++++++----------------
 3 files changed, 71 insertions(+), 24 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 77c581f47953..1f3cbd5185ca 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -182,6 +182,7 @@ void node_remove_from_memory_tier(int node);
 int node_get_memory_tier_id(int node);
 int node_set_memory_tier_rank(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
+void node_get_allowed_targets(int node, nodemask_t *targets);
 #else
 #define numa_demotion_enabled	false
 static inline int next_demotion_node(int node)
@@ -189,6 +190,10 @@ static inline int next_demotion_node(int node)
 	return NUMA_NO_NODE;
 }
 
+static inline void node_get_allowed_targets(int node, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
 #endif	/* CONFIG_TIERED_MEMORY */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 114c7428b9f3..84fac477538c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2129,6 +2129,7 @@ struct memory_tier {
 
 struct demotion_nodes {
 	nodemask_t preferred;
+	nodemask_t allowed;
 };
 
 #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
@@ -2475,6 +2476,25 @@ int node_set_memory_tier_rank(int node, int rank)
 }
 EXPORT_SYMBOL_GPL(node_set_memory_tier_rank);
 
+void node_get_allowed_targets(int node, nodemask_t *targets)
+{
+	/*
+	 * node_demotion[] is updated without excluding this
+	 * function from running.
+	 *
+	 * If any node is moving to lower tiers then modifications
+	 * in node_demotion[] are still valid for this node, if any
+	 * node is moving to higher tier then moving node may be
+	 * used once for demotion which should be ok so rcu should
+	 * be enough here.
+	 */
+	rcu_read_lock();
+
+	*targets = node_demotion[node].allowed;
+
+	rcu_read_unlock();
+}
+
 /**
  * next_demotion_node() - Get the next node in the demotion path
  * @node: The starting node to lookup the next node
@@ -2534,8 +2554,10 @@ static void __disable_all_migrate_targets(void)
 {
 	int node;
 
-	for_each_node_mask(node, node_states[N_MEMORY])
+	for_each_node_mask(node, node_states[N_MEMORY]) {
 		node_demotion[node].preferred = NODE_MASK_NONE;
+		node_demotion[node].allowed = NODE_MASK_NONE;
+	}
 }
 
 static void disable_all_migrate_targets(void)
@@ -2558,12 +2580,11 @@ static void disable_all_migrate_targets(void)
 */
 static void establish_migration_targets(void)
 {
-	struct list_head *ent;
 	struct memory_tier *memtier;
 	struct demotion_nodes *nd;
-	int tier, target = NUMA_NO_NODE, node;
+	int target = NUMA_NO_NODE, node;
 	int distance, best_distance;
-	nodemask_t used;
+	nodemask_t used, allowed = NODE_MASK_NONE;
 
 	if (!node_demotion)
 		return;
@@ -2603,6 +2624,29 @@ static void establish_migration_targets(void)
 			}
 		} while (1);
 	}
+	/*
+	 * Now build the allowed mask for each node collecting node mask from
+	 * all memory tier below it. This allows us to fallback demotion page
+	 * allocation to a set of nodes that is closer the above selected
+	 * perferred node.
+	 */
+	list_for_each_entry(memtier, &memory_tiers, list)
+		nodes_or(allowed, allowed, memtier->nodelist);
+	/*
+	 * Removes nodes not yet in N_MEMORY.
+	 */
+	nodes_and(allowed, node_states[N_MEMORY], allowed);
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		/*
+		 * Keep removing current tier from allowed nodes,
+		 * This will remove all nodes in current and above
+		 * memory tier from the allowed mask.
+		 */
+		nodes_andnot(allowed, allowed, memtier->nodelist);
+		for_each_node_mask(node, memtier->nodelist)
+			node_demotion[node].allowed = allowed;
+	}
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1678802e03e7..feb994589481 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1454,23 +1454,6 @@ static void folio_check_dirty_writeback(struct folio *folio,
 		mapping->a_ops->is_dirty_writeback(&folio->page, dirty, writeback);
 }
 
-static struct page *alloc_demote_page(struct page *page, unsigned long node)
-{
-	struct migration_target_control mtc = {
-		/*
-		 * Allocate from 'node', or fail quickly and quietly.
-		 * When this happens, 'page' will likely just be discarded
-		 * instead of migrated.
-		 */
-		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
-			    __GFP_THISNODE  | __GFP_NOWARN |
-			    __GFP_NOMEMALLOC | GFP_NOWAIT,
-		.nid = node
-	};
-
-	return alloc_migration_target(page, (unsigned long)&mtc);
-}
-
 /*
  * Take pages on @demote_list and attempt to demote them to
  * another node.  Pages which are not demoted are left on
@@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 {
 	int target_nid = next_demotion_node(pgdat->node_id);
 	unsigned int nr_succeeded;
+	nodemask_t allowed_mask;
+
+	struct migration_target_control mtc = {
+		/*
+		 * Allocate from 'node', or fail quickly and quietly.
+		 * When this happens, 'page' will likely just be discarded
+		 * instead of migrated.
+		 */
+		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
+			__GFP_NOMEMALLOC | GFP_NOWAIT,
+		.nid = target_nid,
+		.nmask = &allowed_mask
+	};
 
 	if (list_empty(demote_pages))
 		return 0;
@@ -1488,10 +1484,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 	if (target_nid == NUMA_NO_NODE)
 		return 0;
 
+	node_get_allowed_targets(pgdat->node_id, &allowed_mask);
+
 	/* Demotion ignores all cpuset and mempolicy settings */
-	migrate_pages(demote_pages, alloc_demote_page, NULL,
-			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
-			    &nr_succeeded);
+	migrate_pages(demote_pages, alloc_migration_target, NULL,
+		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
+		      &nr_succeeded);
 
 	if (current_is_kswapd())
 		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v3)
  2022-05-26 21:22 RFC: Memory Tiering Kernel Interfaces (v3) Wei Xu
  2022-05-27  2:58 ` Ying Huang
  2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
@ 2022-05-27 13:40 ` Aneesh Kumar K V
  2022-05-27 16:30   ` Wei Xu
  2 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-05-27 13:40 UTC (permalink / raw)
  To: Wei Xu, Huang Ying, Andrew Morton, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Linux MM, Jagdish Gediya, Baolin Wang, David Rientjes

On 5/27/22 2:52 AM, Wei Xu wrote:

>    The order of memory tiers is determined by their rank values, not by
>    their memtier device names.
> 
>    - /sys/devices/system/memtier/possible
> 
>      Format: ordered list of "memtier(rank)"
>      Example: 0(64), 1(128), 2(192)
> 
>      Read-only.  When read, list all available memory tiers and their
>      associated ranks, ordered by the rank values (from the highest
>       tier to the lowest tier).
> 

Did we discuss the need for this? I haven't done this in the patch 
series I sent across. We do have 
/sys/devices/system/memtier/default_rank which should allow user to 
identify the default rank to which memory would get added via hotplug if 
the NUMA node is not part of any memory tier.


-aneesh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v3)
  2022-05-27  2:58 ` Ying Huang
@ 2022-05-27 14:05   ` Hesham Almatary
  2022-05-27 16:25     ` Wei Xu
  0 siblings, 1 reply; 66+ messages in thread
From: Hesham Almatary @ 2022-05-27 14:05 UTC (permalink / raw)
  To: Ying Huang
  Cc: Wei Xu, Andrew Morton, Greg Thelen, Yang Shi, Aneesh Kumar K.V,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Dave Hansen, Jonathan Cameron,
	Alistair Popple, Dan Williams, Feng Tang, Linux MM,
	Jagdish Gediya, Baolin Wang, David Rientjes, linuxarm

Hello Wei and Ying,

Please find my comments below based on a discussion with Jonathan.

On Fri, 27 May 2022 10:58:39 +0800
Ying Huang <ying.huang@intel.com> wrote:

> On Thu, 2022-05-26 at 14:22 -0700, Wei Xu wrote:
> > Changes since v2
> > ================
> > * Updated the design and examples to use "rank" instead of device ID
> >   to determine the order between memory tiers for better
> > flexibility.
> > 
> > Overview
> > ========
> > 
> > The current kernel has the basic memory tiering support: Inactive
> > pages on a higher tier NUMA node can be migrated (demoted) to a
> > lower tier NUMA node to make room for new allocations on the higher
> > tier NUMA node.  Frequently accessed pages on a lower tier NUMA
> > node can be migrated (promoted) to a higher tier NUMA node to
> > improve the performance.
> > 
> > In the current kernel, memory tiers are defined implicitly via a
> > demotion path relationship between NUMA nodes, which is created
> > during the kernel initialization and updated when a NUMA node is
> > hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and builds the tier hierarchy
> > tier-by-tier by establishing the per-node demotion targets based on
> > the distances between nodes.
> > 
> > This current memory tier kernel interface needs to be improved for
> > several important use cases:
> > 
> > * The current tier initialization code always initializes
> >   each memory-only NUMA node into a lower tier.  But a memory-only
> >   NUMA node may have a high performance memory device (e.g. a DRAM
> >   device attached via CXL.mem or a DRAM-backed memory-only node on
> >   a virtual machine) and should be put into a higher tier.
> > 
> > * The current tier hierarchy always puts CPU nodes into the top
> >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> >   memory-only HBM NUMA nodes should be in the top tier, and DRAM
> > nodes with CPUs are better to be placed into the next lower tier.
> > 
> > * Also because the current tier hierarchy always puts CPU nodes
> >   into the top tier, when a CPU is hot-added (or hot-removed) and
> >   triggers a memory node from CPU-less into a CPU node (or vice
> >   versa), the memory tier hierarchy gets changed, even though no
> >   memory node is added or removed.  This can make the tier
> >   hierarchy unstable and make it difficult to support tier-based
> >   memory accounting.
> > 
> > * A higher tier node can only be demoted to selected nodes on the
> >   next lower tier as defined by the demotion path, not any other
> >   node from any lower tier.  This strict, hard-coded demotion order
> >   does not work in all use cases (e.g. some use cases may want to
> >   allow cross-socket demotion to another node in the same demotion
> >   tier as a fallback when the preferred demotion node is out of
> >   space), and has resulted in the feature request for an interface
> > to override the system-wide, per-node demotion order from the
> >   userspace.  This demotion order is also inconsistent with the page
> >   allocation fallback order when all the nodes in a higher tier are
> >   out of space: The page allocation can fall back to any node from
> >   any lower tier, whereas the demotion order doesn't allow that.
> > 
> > * There are no interfaces for the userspace to learn about the
> > memory tier hierarchy in order to optimize its memory allocations.
> > 
> > I'd like to propose revised memory tier kernel interfaces based on
> > the discussions in the threads:
> > 
> > -
> > https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > -
> > https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > -
> > https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > -
> > https://lore.kernel.org/linux-mm/d6314cfe1c7898a6680bed1e7cc93b0ab93e3155.camel@intel.com/T/
> > 
> > 
> > High-level Design Ideas
> > =======================
> > 
> > * Define memory tiers explicitly, not implicitly.
> > 
> > * Memory tiers are defined based on hardware capabilities of memory
> >   nodes, not their relative node distances between each other.
> > 
> > * The tier assignment of each node is independent from each other.
> >   Moving a node from one tier to another tier doesn't affect the
> > tier assignment of any other node.
> > 
> > * The node-tier association is stable. A node can be reassigned to a
> >   different tier only under the specific conditions that don't block
> >   future tier-based memory cgroup accounting.
> > 
> > * A node can demote its pages to any nodes of any lower tiers. The
> >   demotion target node selection follows the allocation fallback
> > order of the source node, which is built based on node distances.
> > The demotion targets are also restricted to only the nodes from the
> > tiers lower than the source node.  We no longer need to maintain a
> > separate per-node demotion order (node_demotion[]).
> > 
> > 
> > Sysfs Interfaces
> > ================
> > 
> > * /sys/devices/system/memtier/
> > 
> >   This is the directory containing the information about memory
> > tiers.
> > 
> >   Each memory tier has its own subdirectory.
> > 
> >   The order of memory tiers is determined by their rank values, not
> > by their memtier device names.
> > 
> >   - /sys/devices/system/memtier/possible
> > 
> >     Format: ordered list of "memtier(rank)"
> >     Example: 0(64), 1(128), 2(192)
> > 
> >     Read-only.  When read, list all available memory tiers and their
> >     associated ranks, ordered by the rank values (from the highest
> >      tier to the lowest tier).
> 
> I like the idea of "possible" file.  And I think we can show default
> tier too.  That is, if "1(128)" is the default tier (tier with DRAM),
> then the list can be,
> 
> "
> 0/64 [1/128] 2/192
> "
> 
> To make it more easier to be parsed by shell, I will prefer something
> like,
> 
> "
> 0	64
> 1	128	default
> 2	192
> "
> 
> But one line format is OK for me too.
> 
I wonder if there's a good argument to have this "possible" file at all?
My thinking is that, 1) all the details can be scripted at
user-level by reading memtierN/nodeN, offloading some work from the
kernel side, and 2) the format/numbers are confusing anyway; it could
get tricky when/if tier device IDs are similar to ranks.

The other thing is whether we should have a file called "default"
containing the default tier value for the user to read?

> > 
> > * /sys/devices/system/memtier/memtierN/
> > 
> >   This is the directory containing the information about a
> > particular memory tier, memtierN, where N is the memtier device ID
> > (e.g. 0, 1).
> > 
> >   The memtier device ID number itself is just an identifier and has
> > no special meaning, i.e. memtier device ID numbers do not determine
> > the order of memory tiers.
> > 
> >   - /sys/devices/system/memtier/memtierN/rank
> > 
> >     Format: int
> >     Example: 100
> > 
> >     Read-only.  When read, list the "rank" value associated with
> > memtierN.
> > 
> >     "Rank" is an opaque value. Its absolute value doesn't have any
> >     special meaning. But the rank values of different memtiers can
> > be compared with each other to determine the memory tier order.
> >     For example, if we have 3 memtiers: memtier0, memtier1,
> > memiter2, and their rank values are 10, 20, 15, then the memory
> > tier order is: memtier0 -> memtier2 -> memtier1, where memtier0 is
> > the highest tier and memtier1 is the lowest tier.
> > 
> >     The rank value of each memtier should be unique.
> > 
> >   - /sys/devices/system/memtier/memtierN/nodelist
> > 
> >     Format: node_list
> >     Example: 1-2
> > 
> >     Read-only.  When read, list the memory nodes in the specified
> > tier.
> > 
> >     If a memory tier has no memory nodes, the kernel can hide the
> > sysfs directory of this memory tier, though the tier itself can
> > still be visible from /sys/devices/system/memtier/possible.
> > 
Is there a good reason why the kernel needs to hide this directory?

> > * /sys/devices/system/node/nodeN/memtier
> > 
> >   where N = 0, 1, ...
> > 
> >   Format: int or empty
> >   Example: 1
> > 
> >   When read, list the device ID of the memory tier that the node
> > belongs to.  Its value is empty for a CPU-only NUMA node.
> > 
> >   When written, the kernel moves the node into the specified memory
> >   tier if the move is allowed.  The tier assignment of all other
> > nodes are not affected.
> > 
Who decides if the move is allowed or not? Might need to explicitly
mention that?

> >   Initially, we can make this interface read-only.
> > 
> > 
> > Kernel Representation
> > =====================
> > 
> > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > 
> > * #define MAX_MEMORY_TIERS  3
> > 
> >   Support 3 memory tiers for now.  This can be a kconfig option.
> > 
> > * #define MEMORY_DEFAULT_TIER_DEVICE 1
> > 
> >   The default tier device that a memory node is assigned to.
> > 
> > * struct memtier_dev {
> >       nodemask_t nodelist;
> >       int rank;
> >       int tier;
> >   } memtier_devices[MAX_MEMORY_TIERS]
> > 
> >   Store memory tiers by device IDs.
> > 
> > * struct memtier_dev *memory_tier(int tier)
> > 
> >   Returns the memtier device for a given memory tier.
> > 
Might need to define the case where there's no memory tier device for a
specific tier number. For example, we can return NULL or an error code
when an invalid tier number is passed (e.g., -1 for CPU-only nodes).

> > * int node_tier_dev_map[MAX_NUMNODES]
> > 
> >   Map a node to its tier device ID..
> > 
> >   For each CPU-only node c, node_tier_dev_map[c] = -1.
> > 
> > 
> > Memory Tier Initialization
> > ==========================
> > 
> > By default, all memory nodes are assigned to the default tier
> > (MEMORY_DEFAULT_TIER_DEVICE).  The default tier device has a rank
> > value in the middle of the possible rank value range (e.g. 127 if
> > the range is [0..255]).
> > 
> > A device driver can move up or down its memory nodes from the
> > default tier.  For example, PMEM can move down its memory nodes
> > below the default tier, whereas GPU can move up its memory nodes
> > above the default tier.
> > 
Is "up/down" here still relative after the rank addition?

> > The kernel initialization code makes the decision on which exact
> > tier a memory node should be assigned to based on the requests from
> > the device drivers as well as the memory device hardware information
> > provided by the firmware.
> > 
> > 
> > Memory Tier Reassignment
> > ========================
> > 
> > After a memory node is hot-removed, it can be hot-added back to a
> > different memory tier.  This is useful for supporting dynamically
> > provisioned CXL.mem NUMA nodes, which may connect to different
> > memory devices across hot-plug events.  Such tier changes should
> > be compatible with tier-based memory accounting.
> > 
> > The userspace may also reassign an existing online memory node to a
> > different tier.  However, this should only be allowed when no pages
> > are allocated from the memory node or when there are no non-root
> > memory cgroups (e.g. during the system boot).  This restriction is
> > important for keeping memory tier hierarchy stable enough for
> > tier-based memory cgroup accounting.
> 
> One way to do this is hot-remove all memory of a node, change its
> memtier, then hot-add its memory.
> 
> Best Regards,
> Huang, Ying
> 
> > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > 
> > 
> > Memory Allocation for Demotion
> > ==============================
> > 
> > To allocate a new page as the demotion target for a page, the kernel
> > calls the allocation function (__alloc_pages_nodemask) with the
> > source page node as the preferred node and the union of all lower
> > tier nodes as the allowed nodemask.  The actual target node
> > selection then follows the allocation fallback order that the
> > kernel has already defined.
> > 
> > The pseudo code looks like:
> > 
> >     targets = NODE_MASK_NONE;
> >     src_nid = page_to_nid(page);
> >     src_tier = memtier_devices[node_tier_dev_map[src_nid]].tier;
> >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> >             nodes_or(targets, targets, memory_tier(i)->nodelist);
> >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > 
> > The memopolicy of cpuset, vma and owner task of the source page can
> > be set to refine the demotion target nodemask, e.g. to prevent
> > demotion or select a particular allowed node as the demotion target.
> > 
> > 
> > Memory Allocation for Promotion
> > ===============================
> > 
> > The page allocation for promotion is similar to demotion, except
> > that (1) the target nodemask uses the promotion tiers, (2) the
> > preferred node can be the accessing CPU node, not the source page
> > node.
> > 
> > 
> > Examples
> > ========
> > 
> > * Example 1:
> > 
> > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> > 
> >                   20
> >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> >        |        \   /       |
> >        | 30    40 X 40      | 30
> >        |        /   \       |
> >   Node 2 (PMEM)  ----  Node 3 (PMEM)
> >                   40
> > 
> > node distances:
> > node   0    1    2    3
> >    0  10   20   30   40
> >    1  20   10   40   30
> >    2  30   40   10   40
> >    3  40   30   40   10
> > 
> > $ cat /sys/devices/system/memtier/possible
> > 0(64), 1(128), 2(192)
> > 
> > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > /sys/devices/system/memtier/memtier1/rank:128
> > /sys/devices/system/memtier/memtier2/rank:192
> > 
> > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > /sys/devices/system/memtier/memtier1/nodelist:0-1
> > /sys/devices/system/memtier/memtier2/nodelist:2-3
> > 
> > $ grep '' /sys/devices/system/node/node*/memtier
> > /sys/devices/system/node/node0/memtier:1
> > /sys/devices/system/node/node1/memtier:1
> > /sys/devices/system/node/node2/memtier:2
> > /sys/devices/system/node/node3/memtier:2
> > 
> > Demotion fallback order:
> > node 0: 2, 3
> > node 1: 3, 2
> > node 2: empty
> > node 3: empty
> > 
> > To prevent cross-socket demotion and memory access, the user can set
> > mempolicy, e.g. cpuset.mems=0,2.
> > 
> > 
> > * Example 2:
> > 
> > Node 0 & 1 are DRAM nodes.
> > Node 2 is a PMEM node and closer to node 0.
> > 
> >                   20
> >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> >        |            /
> >        | 30       / 40
> >        |        /
> >   Node 2 (PMEM)
> > 
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   40
> >    2  30   40   10
> > 
> > $ cat /sys/devices/system/memtier/possible
> > 0(64), 1(128), 2(192)
> > 
> > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > /sys/devices/system/memtier/memtier1/rank:128
> > /sys/devices/system/memtier/memtier2/rank:192
> > 
> > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > /sys/devices/system/memtier/memtier1/nodelist:0-1
> > /sys/devices/system/memtier/memtier2/nodelist:2
> > 
> > $ grep '' /sys/devices/system/node/node*/memtier
> > /sys/devices/system/node/node0/memtier:1
> > /sys/devices/system/node/node1/memtier:1
> > /sys/devices/system/node/node2/memtier:2
> > 
> > Demotion fallback order:
> > node 0: 2
> > node 1: 2
> > node 2: empty
> > 
> > 
> > * Example 3:
> > 
> > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > 
np: PMEM instead of memory-only DRAM?

> > All nodes are in the same tier.
> > 
> >                   20
> >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> >          \                 /
> >           \ 30            / 30
> >            \             /
> >              Node 2 (PMEM)
> > 
> > node distances:
> > node   0    1    2
> >    0  10   20   30
> >    1  20   10   30
> >    2  30   30   10
> > 
> > $ cat /sys/devices/system/memtier/possible
> > 0(64), 1(128), 2(192)
> > 
> > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > /sys/devices/system/memtier/memtier1/rank:128
> > 
> > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > /sys/devices/system/memtier/memtier1/nodelist:0-2
> > 
> > $ grep '' /sys/devices/system/node/node*/memtier
> > /sys/devices/system/node/node0/memtier:1
> > /sys/devices/system/node/node1/memtier:1
> > /sys/devices/system/node/node2/memtier:1
> > 
> > Demotion fallback order:
> > node 0: empty
> > node 1: empty
> > node 2: empty
> > 
> > 
> > * Example 4:
> > 
> > Node 0 is a DRAM node with CPU.
> > Node 1 is a PMEM node.
> > Node 2 is a GPU node.
> > 
> >                   50
> >   Node 0 (DRAM)  ----  Node 2 (GPU)
> >          \                 /
> >           \ 30            / 60
> >            \             /
> >              Node 1 (PMEM)
> > 
> > node distances:
> > node   0    1    2
> >    0  10   30   50
> >    1  30   10   60
> >    2  50   60   10
> > 
> > $ cat /sys/devices/system/memtier/possible
> > 0(64), 1(128), 2(192)
> > 
> > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > /sys/devices/system/memtier/memtier0/rank:64
> > /sys/devices/system/memtier/memtier1/rank:128
> > /sys/devices/system/memtier/memtier2/rank:192
> > 
> > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > /sys/devices/system/memtier/memtier0/nodelist:2
> > /sys/devices/system/memtier/memtier1/nodelist:0
> > /sys/devices/system/memtier/memtier2/nodelist:1
> > 
> > $ grep '' /sys/devices/system/node/node*/memtier
> > /sys/devices/system/node/node0/memtier:1
> > /sys/devices/system/node/node1/memtier:2
> > /sys/devices/system/node/node2/memtier:0
> > 
> > Demotion fallback order:
> > node 0: 1
> > node 1: empty
> > node 2: 0, 1
> > 
> > 
> > * Example 5:
> > 
> > Node 0 is a DRAM node with CPU.
> > Node 1 is a GPU node.
> > Node 2 is a PMEM node.
> > Node 3 is a large, slow DRAM node without CPU.
> > 
> >                     100
> >      Node 0 (DRAM)  ----  Node 1 (GPU)
> >     /     |               /    |
> >    /40    |30        120 /     | 110
> >   |       |             /      |
> >   |  Node 2 (PMEM) ----       /
> >   |        \                 /
> >    \     80 \               /
> >     ------- Node 3 (Slow DRAM)
> > 
> > node distances:
> > node    0    1    2    3
> >    0   10  100   30   40
> >    1  100   10  120  110
> >    2   30  120   10   80
> >    3   40  110   80   10
> > 
> > MAX_MEMORY_TIERS=4 (memtier3 is a memory tier added later).
> > 
> > $ cat /sys/devices/system/memtier/possible
> > 0(64), 1(128), 3(160), 2(192)
> > 
> > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > /sys/devices/system/memtier/memtier0/rank:64
> > /sys/devices/system/memtier/memtier1/rank:128
> > /sys/devices/system/memtier/memtier2/rank:192
> > /sys/devices/system/memtier/memtier3/rank:160
> > 
> > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > /sys/devices/system/memtier/memtier0/nodelist:1
> > /sys/devices/system/memtier/memtier1/nodelist:0
> > /sys/devices/system/memtier/memtier2/nodelist:2
> > /sys/devices/system/memtier/memtier3/nodelist:3
> > 
> > $ grep '' /sys/devices/system/node/node*/memtier
> > /sys/devices/system/node/node0/memtier:1
> > /sys/devices/system/node/node1/memtier:0
> > /sys/devices/system/node/node2/memtier:2
> > /sys/devices/system/node/node3/memtier:3
> > 
> > Demotion fallback order:
> > node 0: 2, 3
> > node 1: 0, 3, 2
> > node 2: empty
> > node 3: 2
> 
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier
       [not found]     ` <20220527154557.00002c56@Huawei.com>
@ 2022-05-27 15:45       ` Aneesh Kumar K V
  2022-05-30 12:36         ` Jonathan Cameron
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-05-27 15:45 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-mm, akpm, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On 5/27/22 8:15 PM, Jonathan Cameron wrote:
> On Fri, 27 May 2022 17:55:26 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> 
>> The rank approach allows us to keep memory tier device IDs stable even if there
>> is a need to change the tier ordering among different memory tiers. e.g. DRAM
>> nodes with CPUs will always be on memtier1, no matter how many tiers are higher
>> or lower than these nodes. A new memory tier can be inserted into the tier
>> hierarchy for a new set of nodes without affecting the node assignment of any
>> existing memtier, provided that there is enough gap in the rank values for the
>> new memtier.
>>
>> The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
>> Its value relative to other memtiers decides the level of this memtier in the tier
>> hierarchy.
>>
>> For now, This patch supports hardcoded rank values which are 100, 200, & 300 for
>> memory tiers 0,1 & 2 respectively.
>>
>> Below is the sysfs interface to read the rank values of memory tier,
>> /sys/devices/system/memtier/memtierN/rank
>>
>> This interface is read only for now, write support can be added when there is
>> a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
>> requirement among them, rank can be utilized there as rank decides now memory
>> tiering ordering and not memory tier device ids.
>>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> 
> I'd squash a lot of this with the original patch introducing tiers. As things
> stand we have 2 tricky to follow patches covering the same code rather than
> one that would be simpler.
> 

Sure. Will do that in the next update.

> Jonathan
> 
>> ---
>>   drivers/base/node.c     |   5 +-
>>   drivers/dax/kmem.c      |   2 +-
>>   include/linux/migrate.h |  17 ++--
>>   mm/migrate.c            | 218 ++++++++++++++++++++++++----------------
>>   4 files changed, 144 insertions(+), 98 deletions(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index cf4a58446d8c..892f7c23c94e 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -567,8 +567,11 @@ static ssize_t memtier_show(struct device *dev,
>>   			    char *buf)
>>   {
>>   	int node = dev->id;
>> +	int tier_index = node_get_memory_tier_id(node);
>>   
>> -	return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
>> +	if (tier_index != -1)
>> +		return sysfs_emit(buf, "%d\n", tier_index);
> I think failure to get a tier is an error. So if it happens, return an error code.
> Also prefered to handle errors out of line as more idiomatic so reviewers
> read it quicker.
> 
> 	if (tier_index == -1)
> 		return -EINVAL;
> 
> 	return sysfs_emit()...
> 
>> +	return 0;
>>   }
>>   


That was needed to handle NUMA nodes that is not part of any memory 
tiers, like CPU only NUMA node or NUMA node that doesn't want to 
participate in memory demotion.



>>   static ssize_t memtier_store(struct device *dev,
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 991782aa2448..79953426ddaf 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -149,7 +149,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   	dev_set_drvdata(dev, data);
>>   


...

>>   
>> -static DEVICE_ATTR_RO(default_tier);
>> +static DEVICE_ATTR_RO(default_rank);
>>   
>>   static struct attribute *memoty_tier_attrs[] = {
>> -	&dev_attr_max_tiers.attr,
>> -	&dev_attr_default_tier.attr,
>> +	&dev_attr_max_tier.attr,
>> +	&dev_attr_default_rank.attr,
> 
> hmm. Not sure why rename to tier rather than tiers.
> 
> Also, I think we default should be tier, not rank.  If someone later
> wants to change the rank of tier1 that's up to them, but any new hotplugged
> memory should still end up in their by default.
> 

Didn't we say, the tier index/device id is a meaning less entity that 
control just the naming. ie, for memtier128, 128 doesn't mean anything.
Instead it is the rank value associated with memtier128 that control the 
demotion order? If so what we want to update the userspace is max tier 
index userspace can expect and what is the default rank value to which 
memory will be added by hotplug.

But yes. tierindex 1 and default rank 200 are reserved and created by 
default.


....

>>   	/*
>>   	 * if node is already part of the tier proceed with the
>>   	 * current tier value, because we might want to establish
>> @@ -2411,15 +2452,17 @@ int node_set_memory_tier(int node, int tier)
>>   	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
>>   	 * will have skipped this node.
>>   	 */
>> -	if (current_tier != -1)
>> -		tier = current_tier;
>> -	ret = __node_set_memory_tier(node, tier);
>> +	if (memtier)
>> +		establish_migration_targets();
>> +	else {
>> +		/* For now rank value and tier value is same. */
> 
> We should avoid baking that in...


Making it dynamic adds lots of complexity such as an ida alloc for tier 
index etc. I didn't want to get there unless we are sure we need dynamic 
number of tiers.

-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v3)
  2022-05-27 14:05   ` Hesham Almatary
@ 2022-05-27 16:25     ` Wei Xu
  0 siblings, 0 replies; 66+ messages in thread
From: Wei Xu @ 2022-05-27 16:25 UTC (permalink / raw)
  To: Hesham Almatary
  Cc: Ying Huang, Andrew Morton, Greg Thelen, Yang Shi,
	Aneesh Kumar K.V, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Linux MM, Jagdish Gediya, Baolin Wang, David Rientjes, linuxarm

On Fri, May 27, 2022 at 7:05 AM Hesham Almatary
<hesham.almatary@huawei.com> wrote:
>
> Hello Wei and Ying,
>
> Please find my comments below based on a discussion with Jonathan.
>
> On Fri, 27 May 2022 10:58:39 +0800
> Ying Huang <ying.huang@intel.com> wrote:
>
> > On Thu, 2022-05-26 at 14:22 -0700, Wei Xu wrote:
> > > Changes since v2
> > > ================
> > > * Updated the design and examples to use "rank" instead of device ID
> > >   to determine the order between memory tiers for better
> > > flexibility.
> > >
> > > Overview
> > > ========
> > >
> > > The current kernel has the basic memory tiering support: Inactive
> > > pages on a higher tier NUMA node can be migrated (demoted) to a
> > > lower tier NUMA node to make room for new allocations on the higher
> > > tier NUMA node.  Frequently accessed pages on a lower tier NUMA
> > > node can be migrated (promoted) to a higher tier NUMA node to
> > > improve the performance.
> > >
> > > In the current kernel, memory tiers are defined implicitly via a
> > > demotion path relationship between NUMA nodes, which is created
> > > during the kernel initialization and updated when a NUMA node is
> > > hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based on
> > > the distances between nodes.
> > >
> > > This current memory tier kernel interface needs to be improved for
> > > several important use cases:
> > >
> > > * The current tier initialization code always initializes
> > >   each memory-only NUMA node into a lower tier.  But a memory-only
> > >   NUMA node may have a high performance memory device (e.g. a DRAM
> > >   device attached via CXL.mem or a DRAM-backed memory-only node on
> > >   a virtual machine) and should be put into a higher tier.
> > >
> > > * The current tier hierarchy always puts CPU nodes into the top
> > >   tier. But on a system with HBM (e.g. GPU memory) devices, these
> > >   memory-only HBM NUMA nodes should be in the top tier, and DRAM
> > > nodes with CPUs are better to be placed into the next lower tier.
> > >
> > > * Also because the current tier hierarchy always puts CPU nodes
> > >   into the top tier, when a CPU is hot-added (or hot-removed) and
> > >   triggers a memory node from CPU-less into a CPU node (or vice
> > >   versa), the memory tier hierarchy gets changed, even though no
> > >   memory node is added or removed.  This can make the tier
> > >   hierarchy unstable and make it difficult to support tier-based
> > >   memory accounting.
> > >
> > > * A higher tier node can only be demoted to selected nodes on the
> > >   next lower tier as defined by the demotion path, not any other
> > >   node from any lower tier.  This strict, hard-coded demotion order
> > >   does not work in all use cases (e.g. some use cases may want to
> > >   allow cross-socket demotion to another node in the same demotion
> > >   tier as a fallback when the preferred demotion node is out of
> > >   space), and has resulted in the feature request for an interface
> > > to override the system-wide, per-node demotion order from the
> > >   userspace.  This demotion order is also inconsistent with the page
> > >   allocation fallback order when all the nodes in a higher tier are
> > >   out of space: The page allocation can fall back to any node from
> > >   any lower tier, whereas the demotion order doesn't allow that.
> > >
> > > * There are no interfaces for the userspace to learn about the
> > > memory tier hierarchy in order to optimize its memory allocations.
> > >
> > > I'd like to propose revised memory tier kernel interfaces based on
> > > the discussions in the threads:
> > >
> > > -
> > > https://lore.kernel.org/lkml/20220425201728.5kzm4seu7rep7ndr@offworld/T/
> > > -
> > > https://lore.kernel.org/linux-mm/20220426114300.00003ad8@Huawei.com/t/
> > > -
> > > https://lore.kernel.org/linux-mm/867bc216386eb6cbf54648f23e5825830f5b922e.camel@intel.com/T/
> > > -
> > > https://lore.kernel.org/linux-mm/d6314cfe1c7898a6680bed1e7cc93b0ab93e3155.camel@intel.com/T/
> > >
> > >
> > > High-level Design Ideas
> > > =======================
> > >
> > > * Define memory tiers explicitly, not implicitly.
> > >
> > > * Memory tiers are defined based on hardware capabilities of memory
> > >   nodes, not their relative node distances between each other.
> > >
> > > * The tier assignment of each node is independent from each other.
> > >   Moving a node from one tier to another tier doesn't affect the
> > > tier assignment of any other node.
> > >
> > > * The node-tier association is stable. A node can be reassigned to a
> > >   different tier only under the specific conditions that don't block
> > >   future tier-based memory cgroup accounting.
> > >
> > > * A node can demote its pages to any nodes of any lower tiers. The
> > >   demotion target node selection follows the allocation fallback
> > > order of the source node, which is built based on node distances.
> > > The demotion targets are also restricted to only the nodes from the
> > > tiers lower than the source node.  We no longer need to maintain a
> > > separate per-node demotion order (node_demotion[]).
> > >
> > >
> > > Sysfs Interfaces
> > > ================
> > >
> > > * /sys/devices/system/memtier/
> > >
> > >   This is the directory containing the information about memory
> > > tiers.
> > >
> > >   Each memory tier has its own subdirectory.
> > >
> > >   The order of memory tiers is determined by their rank values, not
> > > by their memtier device names.
> > >
> > >   - /sys/devices/system/memtier/possible
> > >
> > >     Format: ordered list of "memtier(rank)"
> > >     Example: 0(64), 1(128), 2(192)
> > >
> > >     Read-only.  When read, list all available memory tiers and their
> > >     associated ranks, ordered by the rank values (from the highest
> > >      tier to the lowest tier).
> >
> > I like the idea of "possible" file.  And I think we can show default
> > tier too.  That is, if "1(128)" is the default tier (tier with DRAM),
> > then the list can be,
> >
> > "
> > 0/64 [1/128] 2/192
> > "
> >
> > To make it more easier to be parsed by shell, I will prefer something
> > like,
> >
> > "
> > 0     64
> > 1     128     default
> > 2     192
> > "
> >
> > But one line format is OK for me too.
> >
> I wonder if there's a good argument to have this "possible" file at all?
> My thinking is that, 1) all the details can be scripted at
> user-level by reading memtierN/nodeN, offloading some work from the
> kernel side, and 2) the format/numbers are confusing anyway; it could
> get tricky when/if tier device IDs are similar to ranks.

If we don't hide memtiers that have no nodes, we don't need this
"possible" file. I am fine either way.  Given that there should not be
too many tiers, it doesn't add much value to hide the empty tiers.  We
can go without this "possible" file.

> The other thing is whether we should have a file called "default"
> containing the default tier value for the user to read?

Sure, we can have a default_tier or default_rank file for this.

> > >
> > > * /sys/devices/system/memtier/memtierN/
> > >
> > >   This is the directory containing the information about a
> > > particular memory tier, memtierN, where N is the memtier device ID
> > > (e.g. 0, 1).
> > >
> > >   The memtier device ID number itself is just an identifier and has
> > > no special meaning, i.e. memtier device ID numbers do not determine
> > > the order of memory tiers.
> > >
> > >   - /sys/devices/system/memtier/memtierN/rank
> > >
> > >     Format: int
> > >     Example: 100
> > >
> > >     Read-only.  When read, list the "rank" value associated with
> > > memtierN.
> > >
> > >     "Rank" is an opaque value. Its absolute value doesn't have any
> > >     special meaning. But the rank values of different memtiers can
> > > be compared with each other to determine the memory tier order.
> > >     For example, if we have 3 memtiers: memtier0, memtier1,
> > > memiter2, and their rank values are 10, 20, 15, then the memory
> > > tier order is: memtier0 -> memtier2 -> memtier1, where memtier0 is
> > > the highest tier and memtier1 is the lowest tier.
> > >
> > >     The rank value of each memtier should be unique.
> > >
> > >   - /sys/devices/system/memtier/memtierN/nodelist
> > >
> > >     Format: node_list
> > >     Example: 1-2
> > >
> > >     Read-only.  When read, list the memory nodes in the specified
> > > tier.
> > >
> > >     If a memory tier has no memory nodes, the kernel can hide the
> > > sysfs directory of this memory tier, though the tier itself can
> > > still be visible from /sys/devices/system/memtier/possible.
> > >
> Is there a good reason why the kernel needs to hide this directory?

It is just to reduce the clutter of empty tiers.  Given that there
should not be too many tiers, we can revert this and always show all
tiers.

> > > * /sys/devices/system/node/nodeN/memtier
> > >
> > >   where N = 0, 1, ...
> > >
> > >   Format: int or empty
> > >   Example: 1
> > >
> > >   When read, list the device ID of the memory tier that the node
> > > belongs to.  Its value is empty for a CPU-only NUMA node.
> > >
> > >   When written, the kernel moves the node into the specified memory
> > >   tier if the move is allowed.  The tier assignment of all other
> > > nodes are not affected.
> > >
> Who decides if the move is allowed or not? Might need to explicitly
> mention that?

"memory tier reassignment" discusses the conditions when the move is allowed.

> > >   Initially, we can make this interface read-only.
> > >
> > >
> > > Kernel Representation
> > > =====================
> > >
> > > * All memory tiering code is guarded by CONFIG_TIERED_MEMORY.
> > >
> > > * #define MAX_MEMORY_TIERS  3
> > >
> > >   Support 3 memory tiers for now.  This can be a kconfig option.
> > >
> > > * #define MEMORY_DEFAULT_TIER_DEVICE 1
> > >
> > >   The default tier device that a memory node is assigned to.
> > >
> > > * struct memtier_dev {
> > >       nodemask_t nodelist;
> > >       int rank;
> > >       int tier;
> > >   } memtier_devices[MAX_MEMORY_TIERS]
> > >
> > >   Store memory tiers by device IDs.
> > >
> > > * struct memtier_dev *memory_tier(int tier)
> > >
> > >   Returns the memtier device for a given memory tier.
> > >
> Might need to define the case where there's no memory tier device for a
> specific tier number. For example, we can return NULL or an error code
> when an invalid tier number is passed (e.g., -1 for CPU-only nodes).

Sure.

> > > * int node_tier_dev_map[MAX_NUMNODES]
> > >
> > >   Map a node to its tier device ID..
> > >
> > >   For each CPU-only node c, node_tier_dev_map[c] = -1.
> > >
> > >
> > > Memory Tier Initialization
> > > ==========================
> > >
> > > By default, all memory nodes are assigned to the default tier
> > > (MEMORY_DEFAULT_TIER_DEVICE).  The default tier device has a rank
> > > value in the middle of the possible rank value range (e.g. 127 if
> > > the range is [0..255]).
> > >
> > > A device driver can move up or down its memory nodes from the
> > > default tier.  For example, PMEM can move down its memory nodes
> > > below the default tier, whereas GPU can move up its memory nodes
> > > above the default tier.
> > >
> Is "up/down" here still relative after the rank addition?

Good point. I think we should reverse the definition of rank: a higher
rank value means a higher tier, to avoid this kind of confusion.

> > > The kernel initialization code makes the decision on which exact
> > > tier a memory node should be assigned to based on the requests from
> > > the device drivers as well as the memory device hardware information
> > > provided by the firmware.
> > >
> > >
> > > Memory Tier Reassignment
> > > ========================
> > >
> > > After a memory node is hot-removed, it can be hot-added back to a
> > > different memory tier.  This is useful for supporting dynamically
> > > provisioned CXL.mem NUMA nodes, which may connect to different
> > > memory devices across hot-plug events.  Such tier changes should
> > > be compatible with tier-based memory accounting.
> > >
> > > The userspace may also reassign an existing online memory node to a
> > > different tier.  However, this should only be allowed when no pages
> > > are allocated from the memory node or when there are no non-root
> > > memory cgroups (e.g. during the system boot).  This restriction is
> > > important for keeping memory tier hierarchy stable enough for
> > > tier-based memory cgroup accounting.
> >
> > One way to do this is hot-remove all memory of a node, change its
> > memtier, then hot-add its memory.
> >
> > Best Regards,
> > Huang, Ying
> >
> > > Hot-adding/removing CPUs doesn't affect memory tier hierarchy.
> > >
> > >
> > > Memory Allocation for Demotion
> > > ==============================
> > >
> > > To allocate a new page as the demotion target for a page, the kernel
> > > calls the allocation function (__alloc_pages_nodemask) with the
> > > source page node as the preferred node and the union of all lower
> > > tier nodes as the allowed nodemask.  The actual target node
> > > selection then follows the allocation fallback order that the
> > > kernel has already defined.
> > >
> > > The pseudo code looks like:
> > >
> > >     targets = NODE_MASK_NONE;
> > >     src_nid = page_to_nid(page);
> > >     src_tier = memtier_devices[node_tier_dev_map[src_nid]].tier;
> > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++)
> > >             nodes_or(targets, targets, memory_tier(i)->nodelist);
> > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets);
> > >
> > > The memopolicy of cpuset, vma and owner task of the source page can
> > > be set to refine the demotion target nodemask, e.g. to prevent
> > > demotion or select a particular allowed node as the demotion target.
> > >
> > >
> > > Memory Allocation for Promotion
> > > ===============================
> > >
> > > The page allocation for promotion is similar to demotion, except
> > > that (1) the target nodemask uses the promotion tiers, (2) the
> > > preferred node can be the accessing CPU node, not the source page
> > > node.
> > >
> > >
> > > Examples
> > > ========
> > >
> > > * Example 1:
> > >
> > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> > >
> > >                   20
> > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > >        |        \   /       |
> > >        | 30    40 X 40      | 30
> > >        |        /   \       |
> > >   Node 2 (PMEM)  ----  Node 3 (PMEM)
> > >                   40
> > >
> > > node distances:
> > > node   0    1    2    3
> > >    0  10   20   30   40
> > >    1  20   10   40   30
> > >    2  30   40   10   40
> > >    3  40   30   40   10
> > >
> > > $ cat /sys/devices/system/memtier/possible
> > > 0(64), 1(128), 2(192)
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > > /sys/devices/system/memtier/memtier1/rank:128
> > > /sys/devices/system/memtier/memtier2/rank:192
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > > /sys/devices/system/memtier/memtier1/nodelist:0-1
> > > /sys/devices/system/memtier/memtier2/nodelist:2-3
> > >
> > > $ grep '' /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier:1
> > > /sys/devices/system/node/node1/memtier:1
> > > /sys/devices/system/node/node2/memtier:2
> > > /sys/devices/system/node/node3/memtier:2
> > >
> > > Demotion fallback order:
> > > node 0: 2, 3
> > > node 1: 3, 2
> > > node 2: empty
> > > node 3: empty
> > >
> > > To prevent cross-socket demotion and memory access, the user can set
> > > mempolicy, e.g. cpuset.mems=0,2.
> > >
> > >
> > > * Example 2:
> > >
> > > Node 0 & 1 are DRAM nodes.
> > > Node 2 is a PMEM node and closer to node 0.
> > >
> > >                   20
> > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > >        |            /
> > >        | 30       / 40
> > >        |        /
> > >   Node 2 (PMEM)
> > >
> > > node distances:
> > > node   0    1    2
> > >    0  10   20   30
> > >    1  20   10   40
> > >    2  30   40   10
> > >
> > > $ cat /sys/devices/system/memtier/possible
> > > 0(64), 1(128), 2(192)
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > > /sys/devices/system/memtier/memtier1/rank:128
> > > /sys/devices/system/memtier/memtier2/rank:192
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > > /sys/devices/system/memtier/memtier1/nodelist:0-1
> > > /sys/devices/system/memtier/memtier2/nodelist:2
> > >
> > > $ grep '' /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier:1
> > > /sys/devices/system/node/node1/memtier:1
> > > /sys/devices/system/node/node2/memtier:2
> > >
> > > Demotion fallback order:
> > > node 0: 2
> > > node 1: 2
> > > node 2: empty
> > >
> > >
> > > * Example 3:
> > >
> > > Node 0 & 1 are DRAM nodes, Node 2 is a memory-only DRAM node.
> > >
> np: PMEM instead of memory-only DRAM?
>
> > > All nodes are in the same tier.
> > >
> > >                   20
> > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > >          \                 /
> > >           \ 30            / 30
> > >            \             /
> > >              Node 2 (PMEM)
> > >
> > > node distances:
> > > node   0    1    2
> > >    0  10   20   30
> > >    1  20   10   30
> > >    2  30   30   10
> > >
> > > $ cat /sys/devices/system/memtier/possible
> > > 0(64), 1(128), 2(192)
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > > /sys/devices/system/memtier/memtier1/rank:128
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > > /sys/devices/system/memtier/memtier1/nodelist:0-2
> > >
> > > $ grep '' /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier:1
> > > /sys/devices/system/node/node1/memtier:1
> > > /sys/devices/system/node/node2/memtier:1
> > >
> > > Demotion fallback order:
> > > node 0: empty
> > > node 1: empty
> > > node 2: empty
> > >
> > >
> > > * Example 4:
> > >
> > > Node 0 is a DRAM node with CPU.
> > > Node 1 is a PMEM node.
> > > Node 2 is a GPU node.
> > >
> > >                   50
> > >   Node 0 (DRAM)  ----  Node 2 (GPU)
> > >          \                 /
> > >           \ 30            / 60
> > >            \             /
> > >              Node 1 (PMEM)
> > >
> > > node distances:
> > > node   0    1    2
> > >    0  10   30   50
> > >    1  30   10   60
> > >    2  50   60   10
> > >
> > > $ cat /sys/devices/system/memtier/possible
> > > 0(64), 1(128), 2(192)
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > > /sys/devices/system/memtier/memtier0/rank:64
> > > /sys/devices/system/memtier/memtier1/rank:128
> > > /sys/devices/system/memtier/memtier2/rank:192
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > > /sys/devices/system/memtier/memtier0/nodelist:2
> > > /sys/devices/system/memtier/memtier1/nodelist:0
> > > /sys/devices/system/memtier/memtier2/nodelist:1
> > >
> > > $ grep '' /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier:1
> > > /sys/devices/system/node/node1/memtier:2
> > > /sys/devices/system/node/node2/memtier:0
> > >
> > > Demotion fallback order:
> > > node 0: 1
> > > node 1: empty
> > > node 2: 0, 1
> > >
> > >
> > > * Example 5:
> > >
> > > Node 0 is a DRAM node with CPU.
> > > Node 1 is a GPU node.
> > > Node 2 is a PMEM node.
> > > Node 3 is a large, slow DRAM node without CPU.
> > >
> > >                     100
> > >      Node 0 (DRAM)  ----  Node 1 (GPU)
> > >     /     |               /    |
> > >    /40    |30        120 /     | 110
> > >   |       |             /      |
> > >   |  Node 2 (PMEM) ----       /
> > >   |        \                 /
> > >    \     80 \               /
> > >     ------- Node 3 (Slow DRAM)
> > >
> > > node distances:
> > > node    0    1    2    3
> > >    0   10  100   30   40
> > >    1  100   10  120  110
> > >    2   30  120   10   80
> > >    3   40  110   80   10
> > >
> > > MAX_MEMORY_TIERS=4 (memtier3 is a memory tier added later).
> > >
> > > $ cat /sys/devices/system/memtier/possible
> > > 0(64), 1(128), 3(160), 2(192)
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/rank
> > > /sys/devices/system/memtier/memtier0/rank:64
> > > /sys/devices/system/memtier/memtier1/rank:128
> > > /sys/devices/system/memtier/memtier2/rank:192
> > > /sys/devices/system/memtier/memtier3/rank:160
> > >
> > > $ grep '' /sys/devices/system/memtier/memtier*/nodelist
> > > /sys/devices/system/memtier/memtier0/nodelist:1
> > > /sys/devices/system/memtier/memtier1/nodelist:0
> > > /sys/devices/system/memtier/memtier2/nodelist:2
> > > /sys/devices/system/memtier/memtier3/nodelist:3
> > >
> > > $ grep '' /sys/devices/system/node/node*/memtier
> > > /sys/devices/system/node/node0/memtier:1
> > > /sys/devices/system/node/node1/memtier:0
> > > /sys/devices/system/node/node2/memtier:2
> > > /sys/devices/system/node/node3/memtier:3
> > >
> > > Demotion fallback order:
> > > node 0: 2, 3
> > > node 1: 0, 3, 2
> > > node 2: empty
> > > node 3: 2
> >
> >
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v3)
  2022-05-27 13:40 ` RFC: Memory Tiering Kernel Interfaces (v3) Aneesh Kumar K V
@ 2022-05-27 16:30   ` Wei Xu
  2022-05-29  4:31     ` Ying Huang
  0 siblings, 1 reply; 66+ messages in thread
From: Wei Xu @ 2022-05-27 16:30 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Huang Ying, Andrew Morton, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Linux MM, Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, May 27, 2022 at 6:41 AM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 5/27/22 2:52 AM, Wei Xu wrote:
>
> >    The order of memory tiers is determined by their rank values, not by
> >    their memtier device names.
> >
> >    - /sys/devices/system/memtier/possible
> >
> >      Format: ordered list of "memtier(rank)"
> >      Example: 0(64), 1(128), 2(192)
> >
> >      Read-only.  When read, list all available memory tiers and their
> >      associated ranks, ordered by the rank values (from the highest
> >       tier to the lowest tier).
> >
>
> Did we discuss the need for this? I haven't done this in the patch
> series I sent across.

The "possible" file is only needed if we decide to hide the
directories of memtiers that have no nodes.  We can remove this
interface and always show all memtier directories to keep things
simpler.

> We do have
> /sys/devices/system/memtier/default_rank which should allow user to
> identify the default rank to which memory would get added via hotplug if
> the NUMA node is not part of any memory tier.

Sounds good to me to have it.

>
> -aneesh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v3)
  2022-05-27 16:30   ` Wei Xu
@ 2022-05-29  4:31     ` Ying Huang
  2022-05-30 12:50       ` Jonathan Cameron
  0 siblings, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-05-29  4:31 UTC (permalink / raw)
  To: Wei Xu, Aneesh Kumar K V
  Cc: Andrew Morton, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Linux MM, Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-05-27 at 09:30 -0700, Wei Xu wrote:
> On Fri, May 27, 2022 at 6:41 AM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:
> > 
> > On 5/27/22 2:52 AM, Wei Xu wrote:
> > 
> > >    The order of memory tiers is determined by their rank values, not by
> > >    their memtier device names.
> > > 
> > >    - /sys/devices/system/memtier/possible
> > > 
> > >      Format: ordered list of "memtier(rank)"
> > >      Example: 0(64), 1(128), 2(192)
> > > 
> > >      Read-only.  When read, list all available memory tiers and their
> > >      associated ranks, ordered by the rank values (from the highest
> > >       tier to the lowest tier).
> > > 
> > 
> > Did we discuss the need for this? I haven't done this in the patch
> > series I sent across.
> 
> The "possible" file is only needed if we decide to hide the
> directories of memtiers that have no nodes.  We can remove this
> interface and always show all memtier directories to keep things
> simpler.

When discussed offline, Tim Chen pointed out that with the proposed
interface, it's unconvenient to know the position of a given memory tier
in all memory tiers.  We must sort "rank" of all memory tiers to know
that.  "possible" file can be used for that.  Although "possible" file
can be generated with a shell script, it's more convenient to show it
directly.

Another way to address the issue is to add memtierN/pos for each memory
tier as suggested by Tim.  It's readonly and will show position of
"memtierN" in all memory tiers.  It's even better to show the relative
postion to the default memory tier (DRAM with CPU). That is, the
position of DRAM memory tier is 0.

Unlike memory tier device ID or rank, the position is relative and
dynamic.

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [mm/demotion]  8ebccd60c2: BUG:sleeping_function_called_from_invalid_context_at_mm/compaction.c
  2022-05-27 12:25   ` [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
@ 2022-05-30  3:35     ` kernel test robot
  0 siblings, 0 replies; 66+ messages in thread
From: kernel test robot @ 2022-05-30  3:35 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: 0day robot, Aneesh Kumar K.V, LKML, linux-mm, lkp, akpm,
	Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

[-- Attachment #1: Type: text/plain, Size: 5344 bytes --]



Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 8ebccd60c2db6beefef2f39b05a95024be0c39eb ("[RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers")
url: https://github.com/intel-lab-lkp/linux/commits/Aneesh-Kumar-K-V/mm-demotion-Add-support-for-explicit-memory-tiers/20220527-212536
base: https://git.kernel.org/cgit/linux/kernel/git/gregkh/driver-core.git b232b02bf3c205b13a26dcec08e53baddd8e59ed
patch link: https://lore.kernel.org/linux-mm/20220527122528.129445-4-aneesh.kumar@linux.ibm.com

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <oliver.sang@intel.com>


[    2.576581][    T1] debug_vm_pgtable: [debug_vm_pgtable         ]: Validating architecture page table helpers
[    2.584367][    T1] BUG: sleeping function called from invalid context at mm/compaction.c:540
[    2.585275][    T1] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0
[    2.586166][    T1] preempt_count: 1, expected: 0
[    2.586668][    T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc5-00059-g8ebccd60c2db #1
[    2.587562][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[    2.588577][    T1] Call Trace:
[    2.588948][    T1]  <TASK>
[    2.589284][    T1]  dump_stack_lvl+0x34/0x44
[    2.589765][    T1]  __might_resched+0x134/0x149
[    2.590253][    T1]  isolate_freepages_block+0xe6/0x2d3
[    2.590794][    T1]  isolate_freepages_range+0xc5/0x118
[    2.591342][    T1]  alloc_contig_range+0x2dd/0x350
[    2.591858][    T1]  ? alloc_contig_pages+0x170/0x194
[    2.592384][    T1]  alloc_contig_pages+0x170/0x194
[    2.592896][    T1]  init_args+0x3d0/0x44e
[    2.593345][    T1]  ? init_args+0x44e/0x44e
[    2.593816][    T1]  debug_vm_pgtable+0x46/0x809
[    2.594312][    T1]  ? alloc_inode+0x37/0x8e
[    2.594774][    T1]  ? init_args+0x44e/0x44e
[    2.595235][    T1]  do_one_initcall+0x83/0x187
[    2.595729][    T1]  do_initcalls+0xc6/0xdf
[    2.596190][    T1]  kernel_init_freeable+0x10d/0x13c
[    2.596721][    T1]  ? rest_init+0xcd/0xcd
[    2.597170][    T1]  kernel_init+0x16/0x11a
[    2.597636][    T1]  ret_from_fork+0x22/0x30
[    2.598097][    T1]  </TASK>
[    2.626547][    T1] ------------[ cut here ]------------
[    2.627157][    T1] initcall debug_vm_pgtable+0x0/0x809 returned with preemption imbalance
[    2.628019][    T1] WARNING: CPU: 0 PID: 1 at init/main.c:1311 do_one_initcall+0x140/0x187
[    2.628863][    T1] Modules linked in:
[    2.629280][    T1] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W         5.18.0-rc5-00059-g8ebccd60c2db #1
[    2.630295][    T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[    2.631306][    T1] RIP: 0010:do_one_initcall+0x140/0x187
[    2.631867][    T1] Code: 00 00 48 c7 c6 ca b6 2c 82 48 89 e7 e8 80 ca 44 00 fb 80 3c 24 00 74 14 48 89 e2 48 89 ee 48 c7 c7 df b6 2c 82 e8 b3 d6 a2 00 <0f> 0b 48 8b 44 24 40 65 48 2b 04 25 28 00 00 00 74 05 e8 d8 cd a4
[    2.633713][    T1] RSP: 0000:ffffc90000013ea8 EFLAGS: 00010286
[    2.634312][    T1] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000003
[    2.635123][    T1] RDX: 0000000000000216 RSI: 0000000000000001 RDI: 0000000000000001
[    2.635932][    T1] RBP: ffffffff82f3b694 R08: 0000000000000000 R09: 0000000000000019
[    2.636735][    T1] R10: 0000000000000000 R11: 0000000074696e69 R12: 0000000000000000
[    2.637538][    T1] R13: ffff88810cba0000 R14: 0000000000000000 R15: 0000000000000000
[    2.638353][    T1] FS:  0000000000000000(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
[    2.639253][    T1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.639901][    T1] CR2: ffff88843ffff000 CR3: 0000000002612000 CR4: 00000000000406f0
[    2.640711][    T1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    2.641526][    T1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    2.642341][    T1] Call Trace:
[    2.642707][    T1]  <TASK>
[    2.643051][    T1]  do_initcalls+0xc6/0xdf
[    2.643512][    T1]  kernel_init_freeable+0x10d/0x13c
[    2.644045][    T1]  ? rest_init+0xcd/0xcd
[    2.644498][    T1]  kernel_init+0x16/0x11a
[    2.644956][    T1]  ret_from_fork+0x22/0x30
[    2.645417][    T1]  </TASK>
[    2.645764][    T1] ---[ end trace 0000000000000000 ]---



To reproduce:

        # build kernel
	cd linux
	cp config-5.18.0-rc5-00059-g8ebccd60c2db .config
	make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
	make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
	cd <mod-install-dir>
	find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

        # if come across any failure that blocks the test,
        # please remove ~/.lkp and /lkp dir to run from a clean state.



-- 
0-DAY CI Kernel Test Service
https://01.org/lkp



[-- Attachment #2: config-5.18.0-rc5-00059-g8ebccd60c2db --]
[-- Type: text/plain, Size: 123273 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 5.18.0-rc5 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc-11 (Debian 11.3.0-1) 11.3.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=110300
CONFIG_CLANG_VERSION=0
CONFIG_AS_IS_GNU=y
CONFIG_AS_VERSION=23800
CONFIG_LD_IS_BFD=y
CONFIG_LD_VERSION=23800
CONFIG_LLD_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_GOTO_OUTPUT=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_CC_HAS_NO_PROFILE_FN_ATTR=y
CONFIG_PAHOLE_VERSION=123
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
# CONFIG_WERROR is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_IRQ_MSI_IOMMU=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
CONFIG_CLOCKSOURCE_WATCHDOG_MAX_SKEW_US=100
# end of Timers subsystem

CONFIG_BPF=y
CONFIG_HAVE_EBPF_JIT=y
CONFIG_ARCH_WANT_DEFAULT_BPF_JIT=y

#
# BPF subsystem
#
CONFIG_BPF_SYSCALL=y
# CONFIG_BPF_JIT is not set
CONFIG_BPF_UNPRIV_DEFAULT_OFF=y
# CONFIG_BPF_PRELOAD is not set
# end of BPF subsystem

CONFIG_PREEMPT_VOLUNTARY_BUILD=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
# CONFIG_PREEMPT_DYNAMIC is not set
# CONFIG_SCHED_CORE is not set

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=20
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
# CONFIG_PRINTK_INDEX is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# CONFIG_UCLAMP_TASK is not set
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_CC_IMPLICIT_FALLTHROUGH="-Wimplicit-fallthrough=5"
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
CONFIG_RT_GROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
# CONFIG_CGROUP_CPUACCT is not set
# CONFIG_CGROUP_PERF is not set
CONFIG_CGROUP_BPF=y
# CONFIG_CGROUP_MISC is not set
CONFIG_CGROUP_DEBUG=y
CONFIG_SOCK_CGROUP_DATA=y
# CONFIG_NAMESPACES is not set
CONFIG_CHECKPOINT_RESTORE=y
# CONFIG_SCHED_AUTOGROUP is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
# CONFIG_RD_BZIP2 is not set
# CONFIG_RD_LZMA is not set
# CONFIG_RD_XZ is not set
# CONFIG_RD_LZO is not set
# CONFIG_RD_LZ4 is not set
CONFIG_RD_ZSTD=y
# CONFIG_BOOT_CONFIG is not set
# CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_LD_ORPHAN_WARN=y
CONFIG_SYSCTL=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_EXPERT=y
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_IO_URING=y
CONFIG_ADVISE_SYSCALLS=y
CONFIG_HAVE_ARCH_USERFAULTFD_WP=y
CONFIG_HAVE_ARCH_USERFAULTFD_MINOR=y
CONFIG_MEMBARRIER=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_USERFAULTFD=y
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE=y
CONFIG_KCMP=y
CONFIG_RSEQ=y
# CONFIG_DEBUG_RSEQ is not set
CONFIG_EMBEDDED=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_GUEST_PERF_EVENTS=y
# CONFIG_PC104 is not set

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
# end of Kernel Performance Events And Counters

CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
CONFIG_COMPAT_BRK=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLAB_MERGE_DEFAULT=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
# CONFIG_SLAB_FREELIST_HARDENED is not set
# CONFIG_SHUFFLE_PAGE_ALLOCATOR is not set
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SYSTEM_DATA_VERIFICATION=y
# CONFIG_PROFILING is not set
CONFIG_TRACEPOINTS=y
# end of General setup

CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_NR_GPIO=1024
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_AUDIT_ARCH=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=5
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y

#
# Processor type and features
#
CONFIG_SMP=y
CONFIG_X86_FEATURE_NAMES=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
# CONFIG_GOLDFISH is not set
CONFIG_RETPOLINE=y
CONFIG_CC_HAS_SLS=y
# CONFIG_SLS is not set
# CONFIG_X86_CPU_RESCTRL is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_NUMACHIP is not set
# CONFIG_X86_VSMP is not set
# CONFIG_X86_UV is not set
# CONFIG_X86_GOLDFISH is not set
# CONFIG_X86_INTEL_LPSS is not set
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
CONFIG_IOSF_MBI=y
# CONFIG_IOSF_MBI_DEBUG is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
# CONFIG_SCHED_OMIT_FRAME_POINTER is not set
CONFIG_HYPERVISOR_GUEST=y
CONFIG_PARAVIRT=y
# CONFIG_PARAVIRT_DEBUG is not set
# CONFIG_PARAVIRT_SPINLOCKS is not set
CONFIG_X86_HV_CALLBACK_VECTOR=y
# CONFIG_XEN is not set
CONFIG_KVM_GUEST=y
CONFIG_ARCH_CPUIDLE_HALTPOLL=y
# CONFIG_PVH is not set
# CONFIG_PARAVIRT_TIME_ACCOUNTING is not set
CONFIG_PARAVIRT_CLOCK=y
# CONFIG_JAILHOUSE_GUEST is not set
# CONFIG_ACRN_GUEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_IA32_FEAT_CTL=y
CONFIG_X86_VMX_FEATURE_NAMES=y
CONFIG_PROCESSOR_SELECT=y
CONFIG_CPU_SUP_INTEL=y
# CONFIG_CPU_SUP_AMD is not set
# CONFIG_CPU_SUP_HYGON is not set
# CONFIG_CPU_SUP_CENTAUR is not set
# CONFIG_CPU_SUP_ZHAOXIN is not set
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS_RANGE_BEGIN=2
CONFIG_NR_CPUS_RANGE_END=512
CONFIG_NR_CPUS_DEFAULT=64
CONFIG_NR_CPUS=512
CONFIG_SCHED_CLUSTER=y
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCELOG_LEGACY=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_THRESHOLD=y
CONFIG_X86_MCE_INJECT=m

#
# Performance monitoring
#
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
# CONFIG_PERF_EVENTS_INTEL_RAPL is not set
# CONFIG_PERF_EVENTS_INTEL_CSTATE is not set
# end of Performance monitoring

CONFIG_X86_VSYSCALL_EMULATION=y
CONFIG_X86_IOPL_IOPERM=y
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
# CONFIG_MICROCODE_AMD is not set
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
CONFIG_X86_5LEVEL=y
CONFIG_X86_DIRECT_GBPAGES=y
# CONFIG_X86_CPA_STATISTICS is not set
CONFIG_NUMA=y
# CONFIG_AMD_NUMA is not set
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
# CONFIG_ARCH_MEMORY_PROBE is not set
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_X86_PMEM_LEGACY_DEVICE=y
CONFIG_X86_PMEM_LEGACY=m
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=0
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_X86_UMIP=y
CONFIG_CC_HAS_IBT=y
# CONFIG_X86_KERNEL_IBT is not set
# CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is not set
CONFIG_X86_INTEL_TSX_MODE_OFF=y
# CONFIG_X86_INTEL_TSX_MODE_ON is not set
# CONFIG_X86_INTEL_TSX_MODE_AUTO is not set
# CONFIG_X86_SGX is not set
CONFIG_EFI=y
CONFIG_EFI_STUB=y
CONFIG_EFI_MIXED=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_KEXEC_FILE=y
CONFIG_ARCH_HAS_KEXEC_PURGATORY=y
# CONFIG_KEXEC_SIG is not set
# CONFIG_CRASH_DUMP is not set
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
# CONFIG_RANDOMIZE_BASE is not set
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_DYNAMIC_MEMORY_LAYOUT=y
CONFIG_HOTPLUG_CPU=y
CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
# CONFIG_DEBUG_HOTPLUG_CPU0 is not set
CONFIG_COMPAT_VDSO=y
CONFIG_LEGACY_VSYSCALL_EMULATE=y
# CONFIG_LEGACY_VSYSCALL_XONLY is not set
# CONFIG_LEGACY_VSYSCALL_NONE is not set
# CONFIG_CMDLINE_BOOL is not set
# CONFIG_MODIFY_LDT_SYSCALL is not set
# CONFIG_STRICT_SIGALTSTACK_SIZE is not set
CONFIG_HAVE_LIVEPATCH=y
# end of Processor type and features

CONFIG_ARCH_HAS_ADD_PAGES=y
CONFIG_ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_SUSPEND_SKIP_SYNC=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_HIBERNATION_SNAPSHOT_DEV=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM=y
CONFIG_PM_DEBUG=y
# CONFIG_PM_ADVANCED_DEBUG is not set
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
# CONFIG_DPM_WATCHDOG is not set
# CONFIG_PM_TRACE_RTC is not set
CONFIG_PM_CLK=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
# CONFIG_ENERGY_MODEL is not set
CONFIG_ARCH_SUPPORTS_ACPI=y
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
CONFIG_ACPI_SPCR_TABLE=y
# CONFIG_ACPI_FPDT is not set
CONFIG_ACPI_LPIT=y
CONFIG_ACPI_SLEEP=y
# CONFIG_ACPI_REV_OVERRIDE_POSSIBLE is not set
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
# CONFIG_ACPI_VIDEO is not set
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_TAD is not set
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_PROCESSOR=y
# CONFIG_ACPI_IPMI is not set
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_CUSTOM_DSDT_FILE=""
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_TABLE_UPGRADE is not set
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_HOTPLUG_MEMORY is not set
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
CONFIG_ACPI_HED=y
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set
CONFIG_ACPI_NFIT=m
# CONFIG_NFIT_SECURITY_DEBUG is not set
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_HMAT is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
CONFIG_ACPI_APEI_EINJ=m
CONFIG_ACPI_APEI_ERST_DEBUG=y
# CONFIG_ACPI_DPTF is not set
# CONFIG_ACPI_CONFIGFS is not set
# CONFIG_ACPI_PFRUT is not set
CONFIG_ACPI_PCC=y
# CONFIG_PMIC_OPREGION is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_PRMT=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
CONFIG_CPU_FREQ_GOV_SCHEDUTIL=y

#
# CPU frequency scaling drivers
#
CONFIG_X86_INTEL_PSTATE=y
# CONFIG_X86_PCC_CPUFREQ is not set
# CONFIG_X86_AMD_PSTATE is not set
# CONFIG_X86_ACPI_CPUFREQ is not set
CONFIG_X86_SPEEDSTEP_CENTRINO=y
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# end of CPU Frequency scaling

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_CPU_IDLE_GOV_TEO is not set
# CONFIG_CPU_IDLE_GOV_HALTPOLL is not set
CONFIG_HALTPOLL_CPUIDLE=y
# end of CPU Idle

# CONFIG_INTEL_IDLE is not set
# end of Power management and ACPI options

#
# Bus options (PCI etc.)
#
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_MMCONF_FAM10H=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
# CONFIG_ISA_BUS is not set
CONFIG_ISA_DMA_API=y
# end of Bus options (PCI etc.)

#
# Binary Emulations
#
CONFIG_IA32_EMULATION=y
# CONFIG_X86_X32_ABI is not set
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
# end of Binary Emulations

CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_PFNCACHE=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQFD=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_DIRTY_RING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y
CONFIG_KVM_COMPAT=y
CONFIG_HAVE_KVM_IRQ_BYPASS=y
CONFIG_HAVE_KVM_NO_POLL=y
CONFIG_KVM_XFER_TO_GUEST_WORK=y
CONFIG_HAVE_KVM_PM_NOTIFIER=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
# CONFIG_KVM_WERROR is not set
CONFIG_KVM_INTEL=m
# CONFIG_KVM_AMD is not set
# CONFIG_KVM_XEN is not set
CONFIG_AS_AVX512=y
CONFIG_AS_SHA1_NI=y
CONFIG_AS_SHA256_NI=y
CONFIG_AS_TPAUSE=y

#
# General architecture-dependent options
#
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
CONFIG_HOTPLUG_SMT=y
CONFIG_GENERIC_ENTRY=y
CONFIG_KPROBES=y
# CONFIG_JUMP_LABEL is not set
# CONFIG_STATIC_CALL_SELFTEST is not set
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_KRETPROBE_ON_RETHOOK=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_ARCH_CORRECT_STACKTRACE_ON_KRETPROBE=y
CONFIG_HAVE_FUNCTION_ERROR_INJECTION=y
CONFIG_HAVE_NMI=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_HAS_SET_DIRECT_MAP=y
CONFIG_HAVE_ARCH_THREAD_STRUCT_WHITELIST=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_ARCH_WANTS_NO_INSTR=y
CONFIG_HAVE_ASM_MODVERSIONS=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_RSEQ=y
CONFIG_HAVE_FUNCTION_ARG_ACCESS_API=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
CONFIG_MMU_GATHER_TABLE_FREE=y
CONFIG_MMU_GATHER_RCU_TABLE_FREE=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y
# CONFIG_SECCOMP_CACHE_DEBUG is not set
CONFIG_HAVE_ARCH_STACKLEAK=y
CONFIG_HAVE_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG=y
CONFIG_ARCH_SUPPORTS_LTO_CLANG_THIN=y
CONFIG_LTO_NONE=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_MOVE_PUD=y
CONFIG_HAVE_MOVE_PMD=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_HUGE_VMALLOC=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_PAGE_SIZE_LESS_THAN_64KB=y
CONFIG_PAGE_SIZE_LESS_THAN_256KB=y
CONFIG_HAVE_STACK_VALIDATION=y
CONFIG_HAVE_RELIABLE_STACKTRACE=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
CONFIG_COMPAT_32BIT_TIME=y
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
CONFIG_HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET=y
CONFIG_RANDOMIZE_KSTACK_OFFSET=y
# CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT is not set
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS=y
CONFIG_ARCH_USE_MEMREMAP_PROT=y
# CONFIG_LOCK_EVENT_COUNTS is not set
CONFIG_ARCH_HAS_MEM_ENCRYPT=y
CONFIG_HAVE_STATIC_CALL=y
CONFIG_HAVE_STATIC_CALL_INLINE=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_ARCH_WANT_LD_ORPHAN_WARN=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_SUPPORTS_PAGE_TABLE_CHECK=y
CONFIG_ARCH_HAS_ELFCORE_COMPAT=y
CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH=y
CONFIG_DYNAMIC_SIGFRAME=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# end of GCOV-based kernel profiling

CONFIG_HAVE_GCC_PLUGINS=y
# CONFIG_GCC_PLUGINS is not set
# end of General architecture-dependent options

CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_FORCE_LOAD=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_MODULE_SIG is not set
CONFIG_MODULE_COMPRESS_NONE=y
# CONFIG_MODULE_COMPRESS_GZIP is not set
# CONFIG_MODULE_COMPRESS_XZ is not set
# CONFIG_MODULE_COMPRESS_ZSTD is not set
# CONFIG_MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS is not set
CONFIG_MODPROBE_PATH="/sbin/modprobe"
# CONFIG_TRIM_UNUSED_KSYMS is not set
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLOCK_LEGACY_AUTOLOAD=y
CONFIG_BLK_DEV_BSG_COMMON=y
CONFIG_BLK_DEV_BSGLIB=y
# CONFIG_BLK_DEV_INTEGRITY is not set
# CONFIG_BLK_DEV_ZONED is not set
# CONFIG_BLK_DEV_THROTTLING is not set
# CONFIG_BLK_WBT is not set
# CONFIG_BLK_CGROUP_IOLATENCY is not set
# CONFIG_BLK_CGROUP_IOCOST is not set
# CONFIG_BLK_CGROUP_IOPRIO is not set
CONFIG_BLK_DEBUG_FS=y
# CONFIG_BLK_SED_OPAL is not set
# CONFIG_BLK_INLINE_ENCRYPTION is not set

#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_EFI_PARTITION=y
# end of Partition Types

CONFIG_BLOCK_COMPAT=y
CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_PM=y
CONFIG_BLOCK_HOLDER_DEPRECATED=y
CONFIG_BLK_MQ_STACKING=y

#
# IO Schedulers
#
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
# CONFIG_IOSCHED_BFQ is not set
# end of IO Schedulers

CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_ASN1=y
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
CONFIG_FREEZER=y

#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_BINFMT_SCRIPT=y
# CONFIG_BINFMT_MISC is not set
CONFIG_COREDUMP=y
# end of Executable file formats

#
# Memory Management options
#
CONFIG_SPARSEMEM=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_FAST_GUP=y
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_ISOLATION=y
CONFIG_HAVE_BOOTMEM_INFO_NODE=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG=y
# CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_MHP_MEMMAP_ON_MEMORY=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_COMPACTION=y
# CONFIG_PAGE_REPORTING is not set
CONFIG_MIGRATION=y
CONFIG_DEVICE_MIGRATION=y
CONFIG_ARCH_ENABLE_THP_MIGRATION=y
CONFIG_TIERED_MEMORY=y
CONFIG_CONTIG_ALLOC=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
# CONFIG_KSM is not set
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_HWPOISON_INJECT=m
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ARCH_WANTS_THP_SWAP=y
CONFIG_THP_SWAP=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_CMA=y
# CONFIG_CMA_DEBUG is not set
# CONFIG_CMA_DEBUGFS is not set
# CONFIG_CMA_SYSFS is not set
CONFIG_CMA_AREAS=7
# CONFIG_MEM_SOFT_DIRTY is not set
# CONFIG_ZSWAP is not set
# CONFIG_ZPOOL is not set
# CONFIG_ZSMALLOC is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
# CONFIG_DEFERRED_STRUCT_PAGE_INIT is not set
# CONFIG_IDLE_PAGE_TRACKING is not set
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CURRENT_STACK_POINTER=y
CONFIG_ARCH_HAS_FILTER_PGPROT=y
CONFIG_ARCH_HAS_PTE_DEVMAP=y
CONFIG_ARCH_HAS_ZONE_DMA_SET=y
CONFIG_ZONE_DMA=y
CONFIG_ZONE_DMA32=y
CONFIG_ZONE_DEVICE=y
# CONFIG_DEVICE_PRIVATE is not set
# CONFIG_PERCPU_STATS is not set
# CONFIG_GUP_TEST is not set
# CONFIG_READ_ONLY_THP_FOR_FS is not set
CONFIG_ARCH_HAS_PTE_SPECIAL=y
# CONFIG_ANON_VMA_NAME is not set

#
# Data Access Monitoring
#
# CONFIG_DAMON is not set
# end of Data Access Monitoring
# end of Memory Management options

CONFIG_NET=y
CONFIG_COMPAT_NETLINK_MESSAGES=y
CONFIG_SKB_EXTENSIONS=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
CONFIG_UNIX_SCM=y
CONFIG_AF_UNIX_OOB=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_TLS is not set
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
# CONFIG_NET_KEY is not set
# CONFIG_XDP_SOCKETS is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_IP_PNP_BOOTP is not set
# CONFIG_IP_PNP_RARP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
CONFIG_NET_IP_TUNNEL=m
# CONFIG_IP_MROUTE is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_NET_IPVTI is not set
CONFIG_NET_UDP_TUNNEL=m
CONFIG_NET_FOU=m
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_INET_UDP_DIAG is not set
# CONFIG_INET_RAW_DIAG is not set
# CONFIG_INET_DIAG_DESTROY is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IPV6 is not set
# CONFIG_MPTCP is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NET_PTP_CLASSIFY=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
# CONFIG_NETFILTER is not set
# CONFIG_BPFILTER is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
# CONFIG_BRIDGE is not set
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
# CONFIG_NET_SCHED is not set
# CONFIG_DCB is not set
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_MPLS is not set
# CONFIG_NET_NSH is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
CONFIG_NET_L3_MASTER_DEV=y
# CONFIG_QRTR is not set
# CONFIG_NET_NCSI is not set
CONFIG_PCPU_DEV_REFCNT=y
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_SOCK_RX_QUEUE_MAPPING=y
CONFIG_XPS=y
# CONFIG_CGROUP_NET_PRIO is not set
# CONFIG_CGROUP_NET_CLASSID is not set
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_BPF_STREAM_PARSER=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_DROP_MONITOR is not set
# end of Network testing
# end of Networking options

# CONFIG_HAMRADIO is not set
CONFIG_CAN=m
CONFIG_CAN_RAW=m
CONFIG_CAN_BCM=m
CONFIG_CAN_GW=m
# CONFIG_CAN_J1939 is not set
# CONFIG_CAN_ISOTP is not set

#
# CAN Device Drivers
#
CONFIG_CAN_VCAN=m
# CONFIG_CAN_VXCAN is not set
# CONFIG_CAN_SLCAN is not set
CONFIG_CAN_DEV=m
CONFIG_CAN_CALC_BITTIMING=y
# CONFIG_CAN_KVASER_PCIEFD is not set
# CONFIG_CAN_C_CAN is not set
# CONFIG_CAN_CC770 is not set
# CONFIG_CAN_IFI_CANFD is not set
# CONFIG_CAN_M_CAN is not set
# CONFIG_CAN_PEAK_PCIEFD is not set
# CONFIG_CAN_SJA1000 is not set
# CONFIG_CAN_SOFTING is not set

#
# CAN USB interfaces
#
# CONFIG_CAN_8DEV_USB is not set
# CONFIG_CAN_EMS_USB is not set
# CONFIG_CAN_ESD_USB2 is not set
# CONFIG_CAN_ETAS_ES58X is not set
# CONFIG_CAN_GS_USB is not set
# CONFIG_CAN_KVASER_USB is not set
# CONFIG_CAN_MCBA_USB is not set
# CONFIG_CAN_PEAK_USB is not set
# CONFIG_CAN_UCAN is not set
# end of CAN USB interfaces

# CONFIG_CAN_DEBUG_DEVICES is not set
# end of CAN Device Drivers

# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
CONFIG_STREAM_PARSER=y
# CONFIG_MCTP is not set
CONFIG_WIRELESS=y
CONFIG_WEXT_CORE=y
CONFIG_WEXT_PROC=y
CONFIG_CFG80211=m
# CONFIG_NL80211_TESTMODE is not set
# CONFIG_CFG80211_DEVELOPER_WARNINGS is not set
# CONFIG_CFG80211_CERTIFICATION_ONUS is not set
CONFIG_CFG80211_REQUIRE_SIGNED_REGDB=y
CONFIG_CFG80211_USE_KERNEL_REGDB_KEYS=y
CONFIG_CFG80211_DEFAULT_PS=y
# CONFIG_CFG80211_DEBUGFS is not set
CONFIG_CFG80211_CRDA_SUPPORT=y
CONFIG_CFG80211_WEXT=y
CONFIG_MAC80211=m
CONFIG_MAC80211_HAS_RC=y
CONFIG_MAC80211_RC_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT="minstrel_ht"
CONFIG_MAC80211_MESH=y
CONFIG_MAC80211_LEDS=y
CONFIG_MAC80211_DEBUGFS=y
# CONFIG_MAC80211_MESSAGE_TRACING is not set
# CONFIG_MAC80211_DEBUG_MENU is not set
CONFIG_MAC80211_STA_HASH_MAX_SIZE=0
# CONFIG_RFKILL is not set
CONFIG_NET_9P=y
CONFIG_NET_9P_FD=y
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
CONFIG_LWTUNNEL=y
CONFIG_LWTUNNEL_BPF=y
CONFIG_DST_CACHE=y
CONFIG_GRO_CELLS=y
CONFIG_NET_SELFTESTS=y
CONFIG_NET_SOCK_MSG=y
CONFIG_NET_DEVLINK=y
CONFIG_PAGE_POOL=y
# CONFIG_PAGE_POOL_STATS is not set
# CONFIG_FAILOVER is not set
CONFIG_ETHTOOL_NETLINK=y

#
# Device Drivers
#
CONFIG_HAVE_EISA=y
# CONFIG_EISA is not set
CONFIG_HAVE_PCI=y
CONFIG_PCI=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
# CONFIG_PCIEAER_INJECT is not set
# CONFIG_PCIE_ECRC is not set
CONFIG_PCIEASPM=y
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
# CONFIG_PCIE_DPC is not set
# CONFIG_PCIE_PTM is not set
CONFIG_PCI_MSI=y
CONFIG_PCI_MSI_IRQ_DOMAIN=y
CONFIG_PCI_QUIRKS=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
# CONFIG_PCI_STUB is not set
# CONFIG_PCI_PF_STUB is not set
CONFIG_PCI_ATS=y
CONFIG_PCI_LOCKLESS_CONFIG=y
CONFIG_PCI_IOV=y
# CONFIG_PCI_PRI is not set
# CONFIG_PCI_PASID is not set
# CONFIG_PCI_P2PDMA is not set
CONFIG_PCI_LABEL=y
# CONFIG_PCIE_BUS_TUNE_OFF is not set
CONFIG_PCIE_BUS_DEFAULT=y
# CONFIG_PCIE_BUS_SAFE is not set
# CONFIG_PCIE_BUS_PERFORMANCE is not set
# CONFIG_PCIE_BUS_PEER2PEER is not set
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
# CONFIG_HOTPLUG_PCI is not set

#
# PCI controller drivers
#
# CONFIG_VMD is not set

#
# DesignWare PCI Core Support
#
# CONFIG_PCIE_DW_PLAT_HOST is not set
# CONFIG_PCI_MESON is not set
# end of DesignWare PCI Core Support

#
# Mobiveil PCIe Core Support
#
# end of Mobiveil PCIe Core Support

#
# Cadence PCIe controllers support
#
# end of Cadence PCIe controllers support
# end of PCI controller drivers

#
# PCI Endpoint
#
# CONFIG_PCI_ENDPOINT is not set
# end of PCI Endpoint

#
# PCI switch controller drivers
#
# CONFIG_PCI_SW_SWITCHTEC is not set
# end of PCI switch controller drivers

# CONFIG_CXL_BUS is not set
# CONFIG_PCCARD is not set
# CONFIG_RAPIDIO is not set

#
# Generic Driver Options
#
CONFIG_AUXILIARY_BUS=y
CONFIG_UEVENT_HELPER=y
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_DEVTMPFS=y
# CONFIG_DEVTMPFS_MOUNT is not set
# CONFIG_DEVTMPFS_SAFE is not set
# CONFIG_STANDALONE is not set
# CONFIG_PREVENT_FIRMWARE_BUILD is not set

#
# Firmware loader
#
CONFIG_FW_LOADER=y
CONFIG_FW_LOADER_PAGED_BUF=y
CONFIG_FW_LOADER_SYSFS=y
CONFIG_EXTRA_FIRMWARE=""
CONFIG_FW_LOADER_USER_HELPER=y
CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y
# CONFIG_FW_LOADER_COMPRESS is not set
CONFIG_FW_CACHE=y
# CONFIG_FW_UPLOAD is not set
# end of Firmware loader

CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
# CONFIG_TEST_ASYNC_DRIVER_PROBE is not set
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_GENERIC_CPU_VULNERABILITIES=y
CONFIG_REGMAP=y
CONFIG_REGMAP_MMIO=y
CONFIG_DMA_SHARED_BUFFER=y
# CONFIG_DMA_FENCE_TRACE is not set
# end of Generic Driver Options

#
# Bus devices
#
# CONFIG_MHI_BUS is not set
# end of Bus devices

CONFIG_CONNECTOR=m

#
# Firmware Drivers
#

#
# ARM System Control and Management Interface Protocol
#
# end of ARM System Control and Management Interface Protocol

# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
CONFIG_DMIID=y
# CONFIG_DMI_SYSFS is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
# CONFIG_ISCSI_IBFT is not set
# CONFIG_FW_CFG_SYSFS is not set
CONFIG_SYSFB=y
# CONFIG_SYSFB_SIMPLEFB is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# EFI (Extensible Firmware Interface) Support
#
# CONFIG_EFI_VARS is not set
CONFIG_EFI_ESRT=y
CONFIG_EFI_VARS_PSTORE=y
# CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_WRAPPERS=y
CONFIG_EFI_GENERIC_STUB_INITRD_CMDLINE_LOADER=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_APPLE_PROPERTIES is not set
# CONFIG_RESET_ATTACK_MITIGATION is not set
# CONFIG_EFI_RCI2_TABLE is not set
# CONFIG_EFI_DISABLE_PCI_DMA is not set
# end of EFI (Extensible Firmware Interface) Support

CONFIG_UEFI_CPER=y
CONFIG_UEFI_CPER_X86=y
CONFIG_EFI_EARLYCON=y
# CONFIG_EFI_CUSTOM_SSDT_OVERLAYS is not set

#
# Tegra firmware driver
#
# end of Tegra firmware driver
# end of Firmware Drivers

# CONFIG_GNSS is not set
# CONFIG_MTD is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
# CONFIG_BLK_DEV_FD is not set
CONFIG_CDROM=m
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_DRBD is not set
CONFIG_BLK_DEV_NBD=m
# CONFIG_BLK_DEV_SX8 is not set
CONFIG_BLK_DEV_RAM=m
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=65536
# CONFIG_CDROM_PKTCDVD is not set
CONFIG_ATA_OVER_ETH=y
# CONFIG_BLK_DEV_RBD is not set

#
# NVME Support
#
CONFIG_NVME_CORE=m
CONFIG_BLK_DEV_NVME=m
CONFIG_NVME_MULTIPATH=y
# CONFIG_NVME_VERBOSE_ERRORS is not set
# CONFIG_NVME_HWMON is not set
CONFIG_NVME_FABRICS=m
# CONFIG_NVME_FC is not set
# CONFIG_NVME_TCP is not set
CONFIG_NVME_TARGET=m
# CONFIG_NVME_TARGET_PASSTHRU is not set
CONFIG_NVME_TARGET_LOOP=m
# CONFIG_NVME_TARGET_FC is not set
# CONFIG_NVME_TARGET_TCP is not set
# end of NVME Support

#
# Misc devices
#
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_SRAM is not set
# CONFIG_DW_XDATA_PCIE is not set
# CONFIG_PCI_ENDPOINT_TEST is not set
# CONFIG_XILINX_SDFEC is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
CONFIG_EEPROM_93CX6=y
# CONFIG_EEPROM_IDT_89HPESX is not set
# CONFIG_EEPROM_EE1004 is not set
# end of EEPROM support

# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# end of Texas Instruments shared transport line discipline

# CONFIG_SENSORS_LIS3_I2C is not set
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
# CONFIG_INTEL_MEI_ME is not set
# CONFIG_INTEL_MEI_TXE is not set
# CONFIG_VMWARE_VMCI is not set
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_BCM_VK is not set
# CONFIG_MISC_ALCOR_PCI is not set
# CONFIG_MISC_RTSX_PCI is not set
# CONFIG_MISC_RTSX_USB is not set
# CONFIG_HABANA_AI is not set
# CONFIG_UACCE is not set
# CONFIG_PVPANIC is not set
# end of Misc devices

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
CONFIG_RAID_ATTRS=m
CONFIG_SCSI_COMMON=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_NETLINK=y
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=m
CONFIG_BLK_DEV_SR=m
CONFIG_CHR_DEV_SG=m
CONFIG_BLK_DEV_BSG=y
CONFIG_CHR_DEV_SCH=m
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=m
CONFIG_SCSI_FC_ATTRS=m
CONFIG_SCSI_ISCSI_ATTRS=m
CONFIG_SCSI_SAS_ATTRS=m
CONFIG_SCSI_SAS_LIBSAS=m
CONFIG_SCSI_SAS_ATA=y
CONFIG_SCSI_SAS_HOST_SMP=y
CONFIG_SCSI_SRP_ATTRS=m
# end of SCSI Transports

CONFIG_SCSI_LOWLEVEL=y
CONFIG_ISCSI_TCP=m
CONFIG_ISCSI_BOOT_SYSFS=m
CONFIG_SCSI_CXGB3_ISCSI=m
CONFIG_SCSI_CXGB4_ISCSI=m
CONFIG_SCSI_BNX2_ISCSI=m
CONFIG_SCSI_BNX2X_FCOE=m
CONFIG_BE2ISCSI=m
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
CONFIG_SCSI_HPSA=m
CONFIG_SCSI_3W_9XXX=m
CONFIG_SCSI_3W_SAS=m
# CONFIG_SCSI_ACARD is not set
CONFIG_SCSI_AACRAID=m
# CONFIG_SCSI_AIC7XXX is not set
CONFIG_SCSI_AIC79XX=m
CONFIG_AIC79XX_CMDS_PER_DEVICE=4
CONFIG_AIC79XX_RESET_DELAY_MS=15000
# CONFIG_AIC79XX_BUILD_FIRMWARE is not set
# CONFIG_AIC79XX_DEBUG_ENABLE is not set
CONFIG_AIC79XX_DEBUG_MASK=0
# CONFIG_AIC79XX_REG_PRETTY_PRINT is not set
# CONFIG_SCSI_AIC94XX is not set
CONFIG_SCSI_MVSAS=m
# CONFIG_SCSI_MVSAS_DEBUG is not set
CONFIG_SCSI_MVSAS_TASKLET=y
CONFIG_SCSI_MVUMI=m
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
CONFIG_SCSI_ARCMSR=m
# CONFIG_SCSI_ESAS2R is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
CONFIG_MEGARAID_SAS=m
CONFIG_SCSI_MPT3SAS=m
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
CONFIG_SCSI_MPT3SAS_MAX_SGE=128
CONFIG_SCSI_MPT2SAS=m
# CONFIG_SCSI_MPI3MR is not set
# CONFIG_SCSI_SMARTPQI is not set
CONFIG_SCSI_UFSHCD=m
CONFIG_SCSI_UFSHCD_PCI=m
# CONFIG_SCSI_UFS_DWC_TC_PCI is not set
# CONFIG_SCSI_UFSHCD_PLATFORM is not set
# CONFIG_SCSI_UFS_BSG is not set
# CONFIG_SCSI_UFS_HPB is not set
# CONFIG_SCSI_UFS_FAULT_INJECTION is not set
# CONFIG_SCSI_UFS_HWMON is not set
CONFIG_SCSI_HPTIOP=m
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_MYRB is not set
# CONFIG_SCSI_MYRS is not set
CONFIG_VMWARE_PVSCSI=m
CONFIG_LIBFC=m
CONFIG_LIBFCOE=m
CONFIG_FCOE=m
CONFIG_FCOE_FNIC=m
# CONFIG_SCSI_SNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_FDOMAIN_PCI is not set
CONFIG_SCSI_ISCI=m
# CONFIG_SCSI_IPS is not set
CONFIG_SCSI_INITIO=m
# CONFIG_SCSI_INIA100 is not set
CONFIG_SCSI_STEX=m
# CONFIG_SCSI_SYM53C8XX_2 is not set
CONFIG_SCSI_IPR=m
CONFIG_SCSI_IPR_TRACE=y
CONFIG_SCSI_IPR_DUMP=y
# CONFIG_SCSI_QLOGIC_1280 is not set
CONFIG_SCSI_QLA_FC=m
CONFIG_SCSI_QLA_ISCSI=m
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_WD719X is not set
CONFIG_SCSI_DEBUG=m
CONFIG_SCSI_PMCRAID=m
CONFIG_SCSI_PM8001=m
# CONFIG_SCSI_BFA_FC is not set
CONFIG_SCSI_CHELSIO_FCOE=m
CONFIG_SCSI_DH=y
CONFIG_SCSI_DH_RDAC=y
CONFIG_SCSI_DH_HP_SW=y
CONFIG_SCSI_DH_EMC=y
CONFIG_SCSI_DH_ALUA=y
# end of SCSI device support

CONFIG_ATA=y
CONFIG_SATA_HOST=y
CONFIG_PATA_TIMINGS=y
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_FORCE=y
CONFIG_ATA_ACPI=y
# CONFIG_SATA_ZPODD is not set
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
CONFIG_SATA_MOBILE_LPM_POLICY=0
CONFIG_SATA_AHCI_PLATFORM=y
# CONFIG_SATA_INIC162X is not set
CONFIG_SATA_ACARD_AHCI=m
CONFIG_SATA_SIL24=m
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
CONFIG_PDC_ADMA=m
CONFIG_SATA_QSTOR=m
CONFIG_SATA_SX4=m
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
CONFIG_SATA_MV=m
CONFIG_SATA_NV=m
CONFIG_SATA_PROMISE=m
CONFIG_SATA_SIL=m
CONFIG_SATA_SIS=m
CONFIG_SATA_SVW=m
CONFIG_SATA_ULI=m
CONFIG_SATA_VIA=m
CONFIG_SATA_VITESSE=m

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SCH is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
CONFIG_PATA_SIS=m
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PLATFORM is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
# CONFIG_PATA_ACPI is not set
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=m
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_MD_RAID10=m
CONFIG_MD_RAID456=m
CONFIG_MD_MULTIPATH=m
CONFIG_MD_FAULTY=m
# CONFIG_BCACHE is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=m
# CONFIG_DM_DEBUG is not set
CONFIG_DM_BUFIO=m
# CONFIG_DM_DEBUG_BLOCK_MANAGER_LOCKING is not set
CONFIG_DM_BIO_PRISON=m
CONFIG_DM_PERSISTENT_DATA=m
# CONFIG_DM_UNSTRIPED is not set
CONFIG_DM_CRYPT=m
CONFIG_DM_SNAPSHOT=m
CONFIG_DM_THIN_PROVISIONING=m
CONFIG_DM_CACHE=m
CONFIG_DM_CACHE_SMQ=m
# CONFIG_DM_WRITECACHE is not set
# CONFIG_DM_EBS is not set
# CONFIG_DM_ERA is not set
# CONFIG_DM_CLONE is not set
CONFIG_DM_MIRROR=m
CONFIG_DM_LOG_USERSPACE=m
CONFIG_DM_RAID=m
CONFIG_DM_ZERO=m
CONFIG_DM_MULTIPATH=m
CONFIG_DM_MULTIPATH_QL=m
CONFIG_DM_MULTIPATH_ST=m
# CONFIG_DM_MULTIPATH_HST is not set
# CONFIG_DM_MULTIPATH_IOA is not set
CONFIG_DM_DELAY=m
# CONFIG_DM_DUST is not set
# CONFIG_DM_UEVENT is not set
CONFIG_DM_FLAKEY=m
CONFIG_DM_VERITY=m
# CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG is not set
# CONFIG_DM_VERITY_FEC is not set
CONFIG_DM_SWITCH=m
CONFIG_DM_LOG_WRITES=m
# CONFIG_DM_INTEGRITY is not set
# CONFIG_DM_AUDIT is not set
# CONFIG_TARGET_CORE is not set
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# end of IEEE 1394 (FireWire) support

# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
CONFIG_MII=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
CONFIG_DUMMY=y
# CONFIG_WIREGUARD is not set
# CONFIG_EQUALIZER is not set
CONFIG_NET_FC=y
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_IPVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_GENEVE is not set
# CONFIG_BAREUDP is not set
# CONFIG_GTP is not set
# CONFIG_AMT is not set
CONFIG_MACSEC=y
CONFIG_NETCONSOLE=y
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
CONFIG_NET_POLL_CONTROLLER=y
CONFIG_TUN=m
# CONFIG_TUN_VNET_CROSS_LE is not set
CONFIG_VETH=y
# CONFIG_NLMON is not set
# CONFIG_ARCNET is not set
CONFIG_ETHERNET=y
CONFIG_MDIO=y
# CONFIG_NET_VENDOR_3COM is not set
CONFIG_NET_VENDOR_ADAPTEC=y
CONFIG_ADAPTEC_STARFIRE=y
CONFIG_NET_VENDOR_AGERE=y
# CONFIG_ET131X is not set
CONFIG_NET_VENDOR_ALACRITECH=y
# CONFIG_SLICOSS is not set
CONFIG_NET_VENDOR_ALTEON=y
# CONFIG_ACENIC is not set
# CONFIG_ALTERA_TSE is not set
CONFIG_NET_VENDOR_AMAZON=y
# CONFIG_ENA_ETHERNET is not set
# CONFIG_NET_VENDOR_AMD is not set
CONFIG_NET_VENDOR_AQUANTIA=y
# CONFIG_AQTION is not set
CONFIG_NET_VENDOR_ARC=y
CONFIG_NET_VENDOR_ASIX=y
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_ALX is not set
# CONFIG_CX_ECAT is not set
CONFIG_NET_VENDOR_BROADCOM=y
CONFIG_B44=y
CONFIG_B44_PCI_AUTOSELECT=y
CONFIG_B44_PCICORE_AUTOSELECT=y
CONFIG_B44_PCI=y
# CONFIG_BCMGENET is not set
CONFIG_BNX2=y
CONFIG_CNIC=y
# CONFIG_TIGON3 is not set
CONFIG_BNX2X=y
CONFIG_BNX2X_SRIOV=y
# CONFIG_SYSTEMPORT is not set
# CONFIG_BNXT is not set
CONFIG_NET_VENDOR_CADENCE=y
# CONFIG_MACB is not set
# CONFIG_NET_VENDOR_CAVIUM is not set
CONFIG_NET_VENDOR_CHELSIO=y
CONFIG_CHELSIO_T1=y
# CONFIG_CHELSIO_T1_1G is not set
CONFIG_CHELSIO_T3=y
CONFIG_CHELSIO_T4=y
# CONFIG_CHELSIO_T4VF is not set
CONFIG_CHELSIO_LIB=m
CONFIG_CHELSIO_INLINE_CRYPTO=y
CONFIG_NET_VENDOR_CISCO=y
CONFIG_ENIC=y
CONFIG_NET_VENDOR_CORTINA=y
CONFIG_NET_VENDOR_DAVICOM=y
CONFIG_DNET=y
CONFIG_NET_VENDOR_DEC=y
# CONFIG_NET_TULIP is not set
CONFIG_NET_VENDOR_DLINK=y
# CONFIG_DL2K is not set
CONFIG_SUNDANCE=y
CONFIG_SUNDANCE_MMIO=y
CONFIG_NET_VENDOR_EMULEX=y
CONFIG_BE2NET=y
# CONFIG_BE2NET_HWMON is not set
CONFIG_BE2NET_BE2=y
CONFIG_BE2NET_BE3=y
CONFIG_BE2NET_LANCER=y
CONFIG_BE2NET_SKYHAWK=y
CONFIG_NET_VENDOR_ENGLEDER=y
# CONFIG_TSNEP is not set
# CONFIG_NET_VENDOR_EZCHIP is not set
CONFIG_NET_VENDOR_FUNGIBLE=y
# CONFIG_FUN_ETH is not set
CONFIG_NET_VENDOR_GOOGLE=y
# CONFIG_GVE is not set
CONFIG_NET_VENDOR_HUAWEI=y
# CONFIG_HINIC is not set
CONFIG_NET_VENDOR_I825XX=y
CONFIG_NET_VENDOR_INTEL=y
CONFIG_E100=y
CONFIG_E1000=y
CONFIG_E1000E=y
# CONFIG_E1000E_HWTS is not set
CONFIG_IGB=y
CONFIG_IGB_HWMON=y
CONFIG_IGBVF=y
# CONFIG_IXGB is not set
CONFIG_IXGBE=y
CONFIG_IXGBE_HWMON=y
CONFIG_IXGBEVF=m
CONFIG_I40E=y
# CONFIG_I40EVF is not set
# CONFIG_ICE is not set
# CONFIG_FM10K is not set
CONFIG_IGC=y
CONFIG_JME=y
CONFIG_NET_VENDOR_LITEX=y
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_MVMDIO is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
CONFIG_NET_VENDOR_MELLANOX=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX5_CORE is not set
# CONFIG_MLXSW_CORE is not set
# CONFIG_MLXFW is not set
CONFIG_NET_VENDOR_MICREL=y
CONFIG_KS8851_MLL=y
CONFIG_KSZ884X_PCI=y
CONFIG_NET_VENDOR_MICROCHIP=y
# CONFIG_LAN743X is not set
CONFIG_NET_VENDOR_MICROSEMI=y
CONFIG_NET_VENDOR_MICROSOFT=y
CONFIG_NET_VENDOR_MYRI=y
CONFIG_MYRI10GE=y
CONFIG_FEALNX=y
CONFIG_NET_VENDOR_NI=y
# CONFIG_NI_XGE_MANAGEMENT_ENET is not set
CONFIG_NET_VENDOR_NATSEMI=y
CONFIG_NATSEMI=y
CONFIG_NS83820=y
CONFIG_NET_VENDOR_NETERION=y
CONFIG_S2IO=y
CONFIG_VXGE=y
# CONFIG_VXGE_DEBUG_TRACE_ALL is not set
# CONFIG_NET_VENDOR_NETRONOME is not set
CONFIG_NET_VENDOR_8390=y
# CONFIG_NE2K_PCI is not set
CONFIG_NET_VENDOR_NVIDIA=y
# CONFIG_FORCEDETH is not set
CONFIG_NET_VENDOR_OKI=y
CONFIG_ETHOC=y
CONFIG_NET_VENDOR_PACKET_ENGINES=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_PENSANDO=y
# CONFIG_IONIC is not set
CONFIG_NET_VENDOR_QLOGIC=y
CONFIG_QLA3XXX=y
CONFIG_QLCNIC=y
CONFIG_QLCNIC_SRIOV=y
CONFIG_QLCNIC_HWMON=y
CONFIG_NETXEN_NIC=y
# CONFIG_QED is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
CONFIG_NET_VENDOR_QUALCOMM=y
# CONFIG_QCOM_EMAC is not set
# CONFIG_RMNET is not set
CONFIG_NET_VENDOR_RDC=y
CONFIG_R6040=y
CONFIG_NET_VENDOR_REALTEK=y
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
CONFIG_R8169=y
# CONFIG_NET_VENDOR_RENESAS is not set
CONFIG_NET_VENDOR_ROCKER=y
CONFIG_NET_VENDOR_SAMSUNG=y
# CONFIG_SXGBE_ETH is not set
CONFIG_NET_VENDOR_SEEQ=y
CONFIG_NET_VENDOR_SILAN=y
CONFIG_SC92031=y
CONFIG_NET_VENDOR_SIS=y
# CONFIG_SIS900 is not set
CONFIG_SIS190=y
CONFIG_NET_VENDOR_SOLARFLARE=y
# CONFIG_SFC is not set
# CONFIG_SFC_FALCON is not set
CONFIG_NET_VENDOR_SMSC=y
CONFIG_EPIC100=y
# CONFIG_SMSC911X is not set
CONFIG_SMSC9420=y
CONFIG_NET_VENDOR_SOCIONEXT=y
CONFIG_NET_VENDOR_STMICRO=y
CONFIG_STMMAC_ETH=y
# CONFIG_STMMAC_SELFTESTS is not set
CONFIG_STMMAC_PLATFORM=y
# CONFIG_DWMAC_GENERIC is not set
CONFIG_DWMAC_INTEL=y
# CONFIG_DWMAC_LOONGSON is not set
# CONFIG_STMMAC_PCI is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
CONFIG_NIU=y
# CONFIG_NET_VENDOR_SYNOPSYS is not set
CONFIG_NET_VENDOR_TEHUTI=y
CONFIG_TEHUTI=y
CONFIG_NET_VENDOR_TI=y
# CONFIG_TI_CPSW_PHY_SEL is not set
CONFIG_TLAN=y
CONFIG_NET_VENDOR_VERTEXCOM=y
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
CONFIG_NET_VENDOR_XILINX=y
# CONFIG_XILINX_EMACLITE is not set
# CONFIG_XILINX_AXI_EMAC is not set
# CONFIG_XILINX_LL_TEMAC is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLINK=y
CONFIG_PHYLIB=y
CONFIG_SWPHY=y
# CONFIG_LED_TRIGGER_PHY is not set
CONFIG_FIXED_PHY=y
# CONFIG_SFP is not set

#
# MII PHY device drivers
#
# CONFIG_AMD_PHY is not set
# CONFIG_ADIN_PHY is not set
# CONFIG_AQUANTIA_PHY is not set
CONFIG_AX88796B_PHY=y
CONFIG_BROADCOM_PHY=y
# CONFIG_BCM54140_PHY is not set
# CONFIG_BCM7XXX_PHY is not set
# CONFIG_BCM84881_PHY is not set
# CONFIG_BCM87XX_PHY is not set
CONFIG_BCM_NET_PHYLIB=y
CONFIG_CICADA_PHY=y
# CONFIG_CORTINA_PHY is not set
CONFIG_DAVICOM_PHY=y
CONFIG_ICPLUS_PHY=y
CONFIG_LXT_PHY=y
# CONFIG_INTEL_XWAY_PHY is not set
# CONFIG_LSI_ET1011C_PHY is not set
CONFIG_MARVELL_PHY=y
# CONFIG_MARVELL_10G_PHY is not set
# CONFIG_MARVELL_88X2222_PHY is not set
# CONFIG_MAXLINEAR_GPHY is not set
# CONFIG_MEDIATEK_GE_PHY is not set
CONFIG_MICREL_PHY=y
# CONFIG_MICROCHIP_PHY is not set
# CONFIG_MICROCHIP_T1_PHY is not set
# CONFIG_MICROSEMI_PHY is not set
# CONFIG_MOTORCOMM_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_NXP_C45_TJA11XX_PHY is not set
# CONFIG_NXP_TJA11XX_PHY is not set
CONFIG_QSEMI_PHY=y
CONFIG_REALTEK_PHY=y
# CONFIG_RENESAS_PHY is not set
# CONFIG_ROCKCHIP_PHY is not set
CONFIG_SMSC_PHY=y
# CONFIG_STE10XP is not set
# CONFIG_TERANETICS_PHY is not set
# CONFIG_DP83822_PHY is not set
# CONFIG_DP83TC811_PHY is not set
# CONFIG_DP83848_PHY is not set
# CONFIG_DP83867_PHY is not set
# CONFIG_DP83869_PHY is not set
CONFIG_VITESSE_PHY=y
# CONFIG_XILINX_GMII2RGMII is not set
CONFIG_MDIO_DEVICE=y
CONFIG_MDIO_BUS=y
CONFIG_FWNODE_MDIO=y
CONFIG_ACPI_MDIO=y
CONFIG_MDIO_DEVRES=y
# CONFIG_MDIO_BITBANG is not set
# CONFIG_MDIO_BCM_UNIMAC is not set
# CONFIG_MDIO_MVUSB is not set
# CONFIG_MDIO_MSCC_MIIM is not set
# CONFIG_MDIO_THUNDER is not set

#
# MDIO Multiplexers
#

#
# PCS device drivers
#
CONFIG_PCS_XPCS=y
# end of PCS device drivers

CONFIG_PPP=y
CONFIG_PPP_BSDCOMP=y
CONFIG_PPP_DEFLATE=y
# CONFIG_PPP_FILTER is not set
CONFIG_PPP_MPPE=y
CONFIG_PPP_MULTILINK=y
CONFIG_PPPOE=y
CONFIG_PPP_ASYNC=y
CONFIG_PPP_SYNC_TTY=y
CONFIG_SLIP=y
CONFIG_SLHC=y
# CONFIG_SLIP_COMPRESSED is not set
CONFIG_SLIP_SMART=y
CONFIG_SLIP_MODE_SLIP6=y
CONFIG_USB_NET_DRIVERS=y
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
CONFIG_USB_RTL8152=y
# CONFIG_USB_LAN78XX is not set
CONFIG_USB_USBNET=y
CONFIG_USB_NET_AX8817X=y
CONFIG_USB_NET_AX88179_178A=y
CONFIG_USB_NET_CDCETHER=m
# CONFIG_USB_NET_CDC_EEM is not set
# CONFIG_USB_NET_CDC_NCM is not set
# CONFIG_USB_NET_HUAWEI_CDC_NCM is not set
# CONFIG_USB_NET_CDC_MBIM is not set
# CONFIG_USB_NET_DM9601 is not set
# CONFIG_USB_NET_SR9700 is not set
# CONFIG_USB_NET_SR9800 is not set
# CONFIG_USB_NET_SMSC75XX is not set
# CONFIG_USB_NET_SMSC95XX is not set
# CONFIG_USB_NET_GL620A is not set
# CONFIG_USB_NET_NET1080 is not set
# CONFIG_USB_NET_PLUSB is not set
# CONFIG_USB_NET_MCS7830 is not set
CONFIG_USB_NET_RNDIS_HOST=m
# CONFIG_USB_NET_CDC_SUBSET is not set
# CONFIG_USB_NET_ZAURUS is not set
CONFIG_USB_NET_CX82310_ETH=y
CONFIG_USB_NET_KALMIA=y
# CONFIG_USB_NET_QMI_WWAN is not set
# CONFIG_USB_NET_INT51X1 is not set
# CONFIG_USB_IPHETH is not set
# CONFIG_USB_SIERRA_NET is not set
CONFIG_USB_VL600=m
# CONFIG_USB_NET_CH9200 is not set
# CONFIG_USB_NET_AQC111 is not set
# CONFIG_USB_RTL8153_ECM is not set
CONFIG_WLAN=y
# CONFIG_WLAN_VENDOR_ADMTEK is not set
# CONFIG_WLAN_VENDOR_ATH is not set
# CONFIG_WLAN_VENDOR_ATMEL is not set
# CONFIG_WLAN_VENDOR_BROADCOM is not set
# CONFIG_WLAN_VENDOR_CISCO is not set
# CONFIG_WLAN_VENDOR_INTEL is not set
# CONFIG_WLAN_VENDOR_INTERSIL is not set
# CONFIG_WLAN_VENDOR_MARVELL is not set
# CONFIG_WLAN_VENDOR_MEDIATEK is not set
CONFIG_WLAN_VENDOR_MICROCHIP=y
# CONFIG_WLAN_VENDOR_RALINK is not set
# CONFIG_WLAN_VENDOR_REALTEK is not set
# CONFIG_WLAN_VENDOR_RSI is not set
# CONFIG_WLAN_VENDOR_ST is not set
# CONFIG_WLAN_VENDOR_TI is not set
# CONFIG_WLAN_VENDOR_ZYDAS is not set
CONFIG_WLAN_VENDOR_QUANTENNA=y
# CONFIG_QTNFMAC_PCIE is not set
CONFIG_MAC80211_HWSIM=m
CONFIG_USB_NET_RNDIS_WLAN=m
# CONFIG_VIRT_WIFI is not set
# CONFIG_WAN is not set

#
# Wireless WAN
#
# CONFIG_WWAN is not set
# end of Wireless WAN

# CONFIG_VMXNET3 is not set
# CONFIG_FUJITSU_ES is not set
CONFIG_NETDEVSIM=m
# CONFIG_NET_FAILOVER is not set
# CONFIG_ISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_LEDS is not set
# CONFIG_INPUT_FF_MEMLESS is not set
CONFIG_INPUT_SPARSEKMAP=y
CONFIG_INPUT_MATRIXKMAP=y
CONFIG_INPUT_VIVALDIFMAP=y

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ADP5588=y
CONFIG_KEYBOARD_ADP5589=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1050 is not set
CONFIG_KEYBOARD_QT1070=y
CONFIG_KEYBOARD_QT2160=y
# CONFIG_KEYBOARD_DLINK_DIR685 is not set
CONFIG_KEYBOARD_LKKBD=y
CONFIG_KEYBOARD_TCA6416=y
# CONFIG_KEYBOARD_TCA8418 is not set
CONFIG_KEYBOARD_LM8323=y
# CONFIG_KEYBOARD_LM8333 is not set
CONFIG_KEYBOARD_MAX7359=y
CONFIG_KEYBOARD_MCS=y
CONFIG_KEYBOARD_MPR121=y
CONFIG_KEYBOARD_NEWTON=y
CONFIG_KEYBOARD_OPENCORES=y
# CONFIG_KEYBOARD_SAMSUNG is not set
CONFIG_KEYBOARD_STOWAWAY=y
CONFIG_KEYBOARD_SUNKBD=y
# CONFIG_KEYBOARD_TM2_TOUCHKEY is not set
CONFIG_KEYBOARD_XTKBD=y
# CONFIG_KEYBOARD_CYPRESS_SF is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
# CONFIG_MOUSE_PS2_BYD is not set
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_SYNAPTICS_SMBUS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_SENTELIC is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_PS2_FOCALTECH=y
# CONFIG_MOUSE_PS2_VMMOUSE is not set
CONFIG_MOUSE_PS2_SMBUS=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_CYAPA is not set
# CONFIG_MOUSE_ELAN_I2C is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
# CONFIG_INPUT_E3X0_BUTTON is not set
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=y
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_DA7280_HAPTICS is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_IMS_PCU is not set
# CONFIG_INPUT_IQS269A is not set
# CONFIG_INPUT_IQS626A is not set
# CONFIG_INPUT_CMA3000 is not set
# CONFIG_INPUT_IDEAPAD_SLIDEBAR is not set
# CONFIG_INPUT_DRV2665_HAPTICS is not set
# CONFIG_INPUT_DRV2667_HAPTICS is not set
# CONFIG_RMI4_CORE is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_USERIO is not set
# CONFIG_GAMEPORT is not set
# end of Hardware I/O ports
# end of Input device support

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_LDISC_AUTOLOAD=y

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_16550A_VARIANTS is not set
# CONFIG_SERIAL_8250_FINTEK is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_EXAR=y
CONFIG_SERIAL_8250_NR_UARTS=16
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
CONFIG_SERIAL_8250_DWLIB=y
# CONFIG_SERIAL_8250_DW is not set
# CONFIG_SERIAL_8250_RT288X is not set
CONFIG_SERIAL_8250_LPSS=y
# CONFIG_SERIAL_8250_MID is not set
CONFIG_SERIAL_8250_PERICOM=y

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_UARTLITE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_LANTIQ is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_SC16IS7XX is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_SERIAL_FSL_LINFLEXUART is not set
# CONFIG_SERIAL_SPRD is not set
# end of Serial drivers

# CONFIG_SERIAL_NONSTANDARD is not set
# CONFIG_N_GSM is not set
CONFIG_NOZOMI=y
# CONFIG_NULL_TTY is not set
# CONFIG_SERIAL_DEV_BUS is not set
# CONFIG_TTY_PRINTK is not set
# CONFIG_VIRTIO_CONSOLE is not set
CONFIG_IPMI_HANDLER=m
CONFIG_IPMI_DMI_DECODE=y
CONFIG_IPMI_PLAT_DATA=y
# CONFIG_IPMI_PANIC_EVENT is not set
# CONFIG_IPMI_DEVICE_INTERFACE is not set
CONFIG_IPMI_SI=m
# CONFIG_IPMI_SSIF is not set
CONFIG_IPMI_WATCHDOG=m
CONFIG_IPMI_POWEROFF=m
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
CONFIG_HW_RANDOM_INTEL=y
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_BA431 is not set
CONFIG_HW_RANDOM_VIA=y
# CONFIG_HW_RANDOM_XIPHERA is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
CONFIG_DEVMEM=y
CONFIG_NVRAM=y
CONFIG_DEVPORT=y
# CONFIG_HPET is not set
CONFIG_HANGCHECK_TIMER=y
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
# CONFIG_XILLYBUS is not set
# CONFIG_XILLYUSB is not set
# CONFIG_RANDOM_TRUST_CPU is not set
# CONFIG_RANDOM_TRUST_BOOTLOADER is not set
# end of Character devices

#
# I2C support
#
CONFIG_I2C=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_AMD_MP2 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_NVIDIA_GPU is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PLATFORM is not set
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EMEV2 is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_CP2615 is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_MLXCPLD is not set
# CONFIG_I2C_VIRTIO is not set
# end of I2C Hardware Bus support

# CONFIG_I2C_STUB is not set
# CONFIG_I2C_SLAVE is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# end of I2C support

# CONFIG_I3C is not set
# CONFIG_SPI is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set
CONFIG_PPS=y
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
# CONFIG_PPS_CLIENT_KTIMER is not set
# CONFIG_PPS_CLIENT_LDISC is not set
# CONFIG_PPS_CLIENT_GPIO is not set

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=y
CONFIG_PTP_1588_CLOCK_OPTIONAL=y

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
CONFIG_PTP_1588_CLOCK_KVM=y
# CONFIG_PTP_1588_CLOCK_IDT82P33 is not set
# CONFIG_PTP_1588_CLOCK_IDTCM is not set
# CONFIG_PTP_1588_CLOCK_VMW is not set
# end of PTP clock support

# CONFIG_PINCTRL is not set
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
# CONFIG_POWER_RESET is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
CONFIG_POWER_SUPPLY_HWMON=y
# CONFIG_PDA_POWER is not set
# CONFIG_IP5XXX_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_CHARGER_ADP5061 is not set
# CONFIG_BATTERY_CW2015 is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SAMSUNG_SDI is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_CHARGER_SBS is not set
# CONFIG_BATTERY_BQ27XXX is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_LTC4162L is not set
# CONFIG_CHARGER_MAX77976 is not set
# CONFIG_CHARGER_BQ2415X is not set
# CONFIG_BATTERY_GAUGE_LTC2941 is not set
# CONFIG_BATTERY_GOLDFISH is not set
# CONFIG_BATTERY_RT5033 is not set
# CONFIG_CHARGER_BD99954 is not set
# CONFIG_BATTERY_UG3105 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM1177 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_AHT10 is not set
# CONFIG_SENSORS_AQUACOMPUTER_D5NEXT is not set
# CONFIG_SENSORS_AS370 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_AXI_FAN_CONTROL is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ASPEED is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_CORSAIR_CPRO is not set
# CONFIG_SENSORS_CORSAIR_PSU is not set
# CONFIG_SENSORS_DRIVETEMP is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_DELL_SMM is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_IBMAEM is not set
# CONFIG_SENSORS_IBMPEX is not set
# CONFIG_SENSORS_I5500 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_POWR1220 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LTC2945 is not set
# CONFIG_SENSORS_LTC2947_I2C is not set
# CONFIG_SENSORS_LTC2990 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4222 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4260 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_MAX127 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX31730 is not set
# CONFIG_SENSORS_MAX6620 is not set
# CONFIG_SENSORS_MAX6621 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_MAX31790 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_TC654 is not set
# CONFIG_SENSORS_TPS23861 is not set
# CONFIG_SENSORS_MR75203 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LM95234 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_NCT6683 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NCT7802 is not set
# CONFIG_SENSORS_NPCM7XX is not set
# CONFIG_SENSORS_NZXT_KRAKEN2 is not set
# CONFIG_SENSORS_NZXT_SMART2 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SBTSI is not set
# CONFIG_SENSORS_SBRMI is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SHT3x is not set
# CONFIG_SENSORS_SHT4x is not set
# CONFIG_SENSORS_SHTC1 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_SY7636A is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_STTS751 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_ADC128D818 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_INA238 is not set
# CONFIG_SENSORS_INA3221 is not set
# CONFIG_SENSORS_TC74 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP103 is not set
# CONFIG_SENSORS_TMP108 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_TMP464 is not set
# CONFIG_SENSORS_TMP513 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83773G is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_XGENE is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
# CONFIG_SENSORS_ASUS_EC is not set
CONFIG_THERMAL=y
# CONFIG_THERMAL_NETLINK is not set
# CONFIG_THERMAL_STATISTICS is not set
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_BANG_BANG is not set
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_DEVFREQ_THERMAL is not set
# CONFIG_THERMAL_EMULATION is not set

#
# Intel thermal drivers
#
CONFIG_INTEL_POWERCLAMP=m
CONFIG_X86_THERMAL_VECTOR=y
CONFIG_X86_PKG_TEMP_THERMAL=m
# CONFIG_INTEL_SOC_DTS_THERMAL is not set

#
# ACPI INT340X thermal drivers
#
# CONFIG_INT340X_THERMAL is not set
# end of ACPI INT340X thermal drivers

CONFIG_INTEL_PCH_THERMAL=m
# CONFIG_INTEL_TCC_COOLING is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_INTEL_HFI_THERMAL is not set
# end of Intel thermal drivers

# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y
CONFIG_SSB=y
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y
CONFIG_BCMA_POSSIBLE=y
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_AS3711 is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_BCM590XX is not set
# CONFIG_MFD_BD9571MWV is not set
# CONFIG_MFD_AXP20X_I2C is not set
# CONFIG_MFD_MADERA is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9062 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_DA9150 is not set
# CONFIG_MFD_DLN2 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_MFD_MP2629 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_INTEL_QUARK_I2C_GPIO is not set
# CONFIG_LPC_ICH is not set
# CONFIG_LPC_SCH is not set
# CONFIG_MFD_INTEL_LPSS_ACPI is not set
# CONFIG_MFD_INTEL_LPSS_PCI is not set
# CONFIG_MFD_IQS62X is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX77843 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_MT6360 is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_MENF21BMC is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RT4831 is not set
# CONFIG_MFD_RT5033 is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SIMPLE_MFD_I2C is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SKY81452 is not set
CONFIG_MFD_SYSCON=y
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_TI_LMU is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65086 is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TI_LP873X is not set
# CONFIG_MFD_TPS6586X is not set
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_TQMX86 is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_MFD_ATC260X_I2C is not set
# end of Multifunction device drivers

# CONFIG_REGULATOR is not set
# CONFIG_RC_CORE is not set

#
# CEC support
#
# CONFIG_MEDIA_CEC_SUPPORT is not set
# end of CEC support

# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
# CONFIG_AGP is not set
# CONFIG_VGA_SWITCHEROO is not set
# CONFIG_DRM is not set
# CONFIG_DRM_DEBUG_MODESET_LOCK is not set

#
# ARM devices
#
# end of ARM devices

CONFIG_DRM_PANEL_ORIENTATION_QUIRKS=y

#
# Frame buffer Devices
#
CONFIG_FB_CMDLINE=y
CONFIG_FB_NOTIFY=y
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_MODE_HELPERS is not set
# CONFIG_FB_TILEBLITTING is not set

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_IBM_GXT4500 is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_FB_SM712 is not set
# end of Frame buffer Devices

#
# Backlight & LCD device support
#
CONFIG_LCD_CLASS_DEVICE=y
# CONFIG_LCD_PLATFORM is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_QCOM_WLED is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LV5207LP is not set
# CONFIG_BACKLIGHT_BD6107 is not set
# CONFIG_BACKLIGHT_ARCXCNN is not set
# end of Backlight & LCD device support

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_LEGACY_ACCELERATION is not set
# CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FRAMEBUFFER_CONSOLE_DEFERRED_TAKEOVER is not set
# end of Console display driver support

# CONFIG_LOGO is not set
# end of Graphics support

CONFIG_SOUND=y
CONFIG_SND=y
CONFIG_SND_PCM=y
CONFIG_SND_HWDEP=y
CONFIG_SND_RAWMIDI=y
CONFIG_SND_JACK=y
CONFIG_SND_JACK_INPUT_DEV=y
# CONFIG_SND_OSSEMUL is not set
# CONFIG_SND_PCM_TIMER is not set
# CONFIG_SND_HRTIMER is not set
CONFIG_SND_DYNAMIC_MINORS=y
CONFIG_SND_MAX_CARDS=32
CONFIG_SND_SUPPORT_OLD_API=y
# CONFIG_SND_PROC_FS is not set
CONFIG_SND_VERBOSE_PRINTK=y
CONFIG_SND_DEBUG=y
CONFIG_SND_DEBUG_VERBOSE=y
# CONFIG_SND_CTL_VALIDATION is not set
# CONFIG_SND_JACK_INJECTION_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_CTL_LED=y
# CONFIG_SND_SEQUENCER is not set
CONFIG_SND_MPU401_UART=y
CONFIG_SND_AC97_CODEC=y
CONFIG_SND_DRIVERS=y
# CONFIG_SND_PCSP is not set
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_ALOOP is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
# CONFIG_SND_AC97_POWER_SAVE is not set
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ASIHPI is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CTXFI is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=y
CONFIG_SND_INTEL8X0M=y
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_LOLA is not set
# CONFIG_SND_LX6464ES is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SE6X is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
CONFIG_SND_VIA82XX=y
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set

#
# HD-Audio
#
CONFIG_SND_HDA=y
CONFIG_SND_HDA_GENERIC_LEDS=y
CONFIG_SND_HDA_INTEL=y
CONFIG_SND_HDA_HWDEP=y
CONFIG_SND_HDA_RECONFIG=y
CONFIG_SND_HDA_INPUT_BEEP=y
CONFIG_SND_HDA_INPUT_BEEP_MODE=1
CONFIG_SND_HDA_PATCH_LOADER=y
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_HDMI=y
CONFIG_SND_HDA_CODEC_CIRRUS=y
# CONFIG_SND_HDA_CODEC_CS8409 is not set
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CA0110=y
CONFIG_SND_HDA_CODEC_CA0132=y
# CONFIG_SND_HDA_CODEC_CA0132_DSP is not set
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0
# CONFIG_SND_HDA_INTEL_HDMI_SILENT_STREAM is not set
# end of HD-Audio

CONFIG_SND_HDA_CORE=y
CONFIG_SND_HDA_PREALLOC_SIZE=0
CONFIG_SND_INTEL_NHLT=y
CONFIG_SND_INTEL_DSP_CONFIG=y
CONFIG_SND_INTEL_SOUNDWIRE_ACPI=y
CONFIG_SND_USB=y
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_UA101 is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
# CONFIG_SND_USB_6FIRE is not set
# CONFIG_SND_USB_HIFACE is not set
# CONFIG_SND_BCD2000 is not set
# CONFIG_SND_USB_POD is not set
# CONFIG_SND_USB_PODHD is not set
# CONFIG_SND_USB_TONEPORT is not set
# CONFIG_SND_USB_VARIAX is not set
# CONFIG_SND_SOC is not set
CONFIG_SND_X86=y
CONFIG_AC97_BUS=y

#
# HID support
#
CONFIG_HID=y
# CONFIG_HID_BATTERY_STRENGTH is not set
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
# CONFIG_HID_A4TECH is not set
# CONFIG_HID_ACCUTOUCH is not set
# CONFIG_HID_ACRUX is not set
# CONFIG_HID_APPLE is not set
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_ASUS is not set
# CONFIG_HID_AUREAL is not set
# CONFIG_HID_BELKIN is not set
# CONFIG_HID_BETOP_FF is not set
# CONFIG_HID_BIGBEN_FF is not set
# CONFIG_HID_CHERRY is not set
# CONFIG_HID_CHICONY is not set
# CONFIG_HID_CORSAIR is not set
# CONFIG_HID_COUGAR is not set
# CONFIG_HID_MACALLY is not set
# CONFIG_HID_PRODIKEYS is not set
# CONFIG_HID_CMEDIA is not set
# CONFIG_HID_CREATIVE_SB0540 is not set
# CONFIG_HID_CYPRESS is not set
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELAN is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
# CONFIG_HID_EZKEY is not set
# CONFIG_HID_FT260 is not set
# CONFIG_HID_GEMBIRD is not set
# CONFIG_HID_GFRM is not set
# CONFIG_HID_GLORIOUS is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_VIVALDI is not set
# CONFIG_HID_GT683R is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_VIEWSONIC is not set
# CONFIG_HID_XIAOMI is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_ICADE is not set
# CONFIG_HID_ITE is not set
# CONFIG_HID_JABRA is not set
# CONFIG_HID_TWINHAN is not set
# CONFIG_HID_KENSINGTON is not set
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LED is not set
# CONFIG_HID_LENOVO is not set
# CONFIG_HID_LETSKETCH is not set
# CONFIG_HID_LOGITECH is not set
# CONFIG_HID_MAGICMOUSE is not set
# CONFIG_HID_MALTRON is not set
# CONFIG_HID_MAYFLASH is not set
# CONFIG_HID_REDRAGON is not set
# CONFIG_HID_MICROSOFT is not set
# CONFIG_HID_MONTEREY is not set
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NINTENDO is not set
# CONFIG_HID_NTI is not set
# CONFIG_HID_NTRIG is not set
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PENMOUNT is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PLANTRONICS is not set
# CONFIG_HID_RAZER is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_RETRODE is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SAMSUNG is not set
# CONFIG_HID_SEMITEK is not set
# CONFIG_HID_SIGMAMICRO is not set
# CONFIG_HID_SONY is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEAM is not set
# CONFIG_HID_STEELSERIES is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_THINGM is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_UDRAW_PS3 is not set
# CONFIG_HID_U2FZERO is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_WIIMOTE is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set
# CONFIG_HID_ALPS is not set
# end of Special HID drivers

#
# USB HID support
#
CONFIG_USB_HID=y
# CONFIG_HID_PID is not set
CONFIG_USB_HIDDEV=y
# end of USB HID support

#
# I2C HID support
#
# CONFIG_I2C_HID_ACPI is not set
# end of I2C HID support

#
# Intel ISH HID support
#
# CONFIG_INTEL_ISH_HID is not set
# end of Intel ISH HID support

#
# AMD SFH HID Support
#
# CONFIG_AMD_SFH_HID is not set
# end of AMD SFH HID Support
# end of HID support

CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
# CONFIG_USB_LED_TRIG is not set
# CONFIG_USB_ULPI_BUS is not set
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_PCI=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_FEW_INIT_RETRIES is not set
CONFIG_USB_DYNAMIC_MINORS=y
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_PRODUCTLIST is not set
# CONFIG_USB_OTG_DISABLE_EXTERNAL_HUB is not set
# CONFIG_USB_LEDS_TRIGGER_USBPORT is not set
CONFIG_USB_AUTOSUSPEND_DELAY=2
# CONFIG_USB_MON is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=y
# CONFIG_USB_XHCI_DBGCAP is not set
CONFIG_USB_XHCI_PCI=y
# CONFIG_USB_XHCI_PCI_RENESAS is not set
# CONFIG_USB_XHCI_PLATFORM is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
CONFIG_USB_EHCI_PCI=y
# CONFIG_USB_EHCI_FSL is not set
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_FOTG210_HCD is not set
# CONFIG_USB_OHCI_HCD is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_SSB is not set
# CONFIG_USB_HCD_TEST_MODE is not set

#
# USB Device Class drivers
#
CONFIG_USB_ACM=y
# CONFIG_USB_PRINTER is not set
CONFIG_USB_WDM=y
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
CONFIG_USB_STORAGE_REALTEK=y
CONFIG_REALTEK_AUTOPM=y
CONFIG_USB_STORAGE_DATAFAB=y
CONFIG_USB_STORAGE_FREECOM=y
CONFIG_USB_STORAGE_ISD200=y
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
CONFIG_USB_STORAGE_JUMPSHOT=y
CONFIG_USB_STORAGE_ALAUDA=y
CONFIG_USB_STORAGE_ONETOUCH=y
CONFIG_USB_STORAGE_KARMA=y
CONFIG_USB_STORAGE_CYPRESS_ATACB=y
CONFIG_USB_STORAGE_ENE_UB6250=y
# CONFIG_USB_UAS is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_USB_CDNS_SUPPORT is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set
# CONFIG_USB_ISP1760 is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_APPLE_MFI_FASTCHARGE is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set
# CONFIG_USB_HUB_USB251XB is not set
# CONFIG_USB_HSIC_USB3503 is not set
# CONFIG_USB_HSIC_USB4604 is not set
# CONFIG_USB_LINK_LAYER_TEST is not set
# CONFIG_USB_CHAOSKEY is not set

#
# USB Physical Layer drivers
#
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_USB_ISP1301 is not set
# end of USB Physical Layer drivers

# CONFIG_USB_GADGET is not set
# CONFIG_TYPEC is not set
# CONFIG_USB_ROLE_SWITCH is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y
# CONFIG_LEDS_CLASS_FLASH is not set
# CONFIG_LEDS_CLASS_MULTICOLOR is not set
# CONFIG_LEDS_BRIGHTNESS_HW_CHANGED is not set

#
# LED drivers
#
# CONFIG_LEDS_APU is not set
# CONFIG_LEDS_LM3530 is not set
# CONFIG_LEDS_LM3532 is not set
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA963X is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_TLC591XX is not set
# CONFIG_LEDS_LM355x is not set

#
# LED driver for blink(1) USB RGB LED is under Special HID drivers (HID_THINGM)
#
# CONFIG_LEDS_BLINKM is not set
# CONFIG_LEDS_MLXCPLD is not set
# CONFIG_LEDS_MLXREG is not set
# CONFIG_LEDS_USER is not set
# CONFIG_LEDS_NIC78BX is not set
# CONFIG_LEDS_TI_LMU_COMMON is not set

#
# Flash and Torch LED drivers
#

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
# CONFIG_LEDS_TRIGGER_DISK is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_CPU is not set
# CONFIG_LEDS_TRIGGER_ACTIVITY is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_LEDS_TRIGGER_CAMERA is not set
# CONFIG_LEDS_TRIGGER_PANIC is not set
# CONFIG_LEDS_TRIGGER_NETDEV is not set
# CONFIG_LEDS_TRIGGER_PATTERN is not set
CONFIG_LEDS_TRIGGER_AUDIO=y
# CONFIG_LEDS_TRIGGER_TTY is not set

#
# Simple LED drivers
#
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
# CONFIG_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_SYSTOHC_DEVICE="n"
# CONFIG_RTC_DEBUG is not set
CONFIG_RTC_NVMEM=y

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_ABB5ZES3 is not set
# CONFIG_RTC_DRV_ABEOZ9 is not set
# CONFIG_RTC_DRV_ABX80X is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF85063 is not set
# CONFIG_RTC_DRV_PCF85363 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8010 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV3028 is not set
# CONFIG_RTC_DRV_RV3032 is not set
# CONFIG_RTC_DRV_RV8803 is not set
# CONFIG_RTC_DRV_SD3078 is not set

#
# SPI RTC drivers
#
CONFIG_RTC_I2C_AND_SPI=y

#
# SPI and I2C RTC drivers
#
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set
# CONFIG_RTC_DRV_RX6110 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
# CONFIG_RTC_DRV_FTRTC010 is not set

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_GOLDFISH is not set
# CONFIG_DMADEVICES is not set

#
# DMABUF options
#
CONFIG_SYNC_FILE=y
CONFIG_SW_SYNC=y
# CONFIG_UDMABUF is not set
# CONFIG_DMABUF_MOVE_NOTIFY is not set
# CONFIG_DMABUF_DEBUG is not set
# CONFIG_DMABUF_SELFTESTS is not set
# CONFIG_DMABUF_HEAPS is not set
# CONFIG_DMABUF_SYSFS_STATS is not set
# end of DMABUF options

# CONFIG_AUXDISPLAY is not set
CONFIG_UIO=y
# CONFIG_UIO_CIF is not set
# CONFIG_UIO_PDRV_GENIRQ is not set
# CONFIG_UIO_DMEM_GENIRQ is not set
# CONFIG_UIO_AEC is not set
# CONFIG_UIO_SERCOS3 is not set
# CONFIG_UIO_PCI_GENERIC is not set
# CONFIG_UIO_NETX is not set
# CONFIG_UIO_PRUSS is not set
# CONFIG_UIO_MF624 is not set
# CONFIG_VFIO is not set
CONFIG_IRQ_BYPASS_MANAGER=m
# CONFIG_VIRT_DRIVERS is not set
CONFIG_VIRTIO_MENU=y
# CONFIG_VIRTIO_PCI is not set
# CONFIG_VIRTIO_MMIO is not set
# CONFIG_VDPA is not set
CONFIG_VHOST_IOTLB=m
CONFIG_VHOST=m
CONFIG_VHOST_MENU=y
CONFIG_VHOST_NET=m
# CONFIG_VHOST_CROSS_ENDIAN_LEGACY is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV is not set
# end of Microsoft Hyper-V guest support

# CONFIG_GREYBUS is not set
# CONFIG_COMEDI is not set
# CONFIG_STAGING is not set
# CONFIG_X86_PLATFORM_DEVICES is not set
CONFIG_PMC_ATOM=y
# CONFIG_CHROME_PLATFORMS is not set
# CONFIG_MELLANOX_PLATFORM is not set
CONFIG_SURFACE_PLATFORMS=y
# CONFIG_SURFACE_3_POWER_OPREGION is not set
# CONFIG_SURFACE_GPE is not set
# CONFIG_SURFACE_PRO3_BUTTON is not set
CONFIG_HAVE_CLK=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y
# CONFIG_COMMON_CLK_MAX9485 is not set
# CONFIG_COMMON_CLK_SI5341 is not set
# CONFIG_COMMON_CLK_SI5351 is not set
# CONFIG_COMMON_CLK_SI544 is not set
# CONFIG_COMMON_CLK_CDCE706 is not set
# CONFIG_COMMON_CLK_CS2000_CP is not set
# CONFIG_XILINX_VCU is not set
# CONFIG_HWSPINLOCK is not set

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# end of Clock Source drivers

CONFIG_MAILBOX=y
CONFIG_PCC=y
# CONFIG_ALTERA_MBOX is not set
CONFIG_IOMMU_IOVA=y
CONFIG_IOASID=y
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y

#
# Generic IOMMU Pagetable Support
#
# end of Generic IOMMU Pagetable Support

# CONFIG_IOMMU_DEBUGFS is not set
# CONFIG_IOMMU_DEFAULT_DMA_STRICT is not set
CONFIG_IOMMU_DEFAULT_DMA_LAZY=y
# CONFIG_IOMMU_DEFAULT_PASSTHROUGH is not set
CONFIG_IOMMU_DMA=y
# CONFIG_AMD_IOMMU is not set
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_SVM is not set
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON=y
CONFIG_IRQ_REMAP=y

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set
# end of Remoteproc drivers

#
# Rpmsg drivers
#
# CONFIG_RPMSG_QCOM_GLINK_RPM is not set
# CONFIG_RPMSG_VIRTIO is not set
# end of Rpmsg drivers

# CONFIG_SOUNDWIRE is not set

#
# SOC (System On Chip) specific Drivers
#

#
# Amlogic SoC drivers
#
# end of Amlogic SoC drivers

#
# Broadcom SoC drivers
#
# end of Broadcom SoC drivers

#
# NXP/Freescale QorIQ SoC drivers
#
# end of NXP/Freescale QorIQ SoC drivers

#
# i.MX SoC drivers
#
# end of i.MX SoC drivers

#
# Enable LiteX SoC Builder specific drivers
#
# end of Enable LiteX SoC Builder specific drivers

#
# Qualcomm SoC drivers
#
# end of Qualcomm SoC drivers

# CONFIG_SOC_TI is not set

#
# Xilinx SoC drivers
#
# end of Xilinx SoC drivers
# end of SOC (System On Chip) specific Drivers

CONFIG_PM_DEVFREQ=y

#
# DEVFREQ Governors
#
CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND=m
# CONFIG_DEVFREQ_GOV_PERFORMANCE is not set
# CONFIG_DEVFREQ_GOV_POWERSAVE is not set
# CONFIG_DEVFREQ_GOV_USERSPACE is not set
# CONFIG_DEVFREQ_GOV_PASSIVE is not set

#
# DEVFREQ Drivers
#
# CONFIG_PM_DEVFREQ_EVENT is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set

#
# IRQ chip support
#
# end of IRQ chip support

# CONFIG_IPACK_BUS is not set
CONFIG_RESET_CONTROLLER=y
# CONFIG_RESET_TI_SYSCON is not set

#
# PHY Subsystem
#
CONFIG_GENERIC_PHY=y
# CONFIG_USB_LGM_PHY is not set
# CONFIG_PHY_CAN_TRANSCEIVER is not set

#
# PHY drivers for Broadcom platforms
#
# CONFIG_BCM_KONA_USB2_PHY is not set
# end of PHY drivers for Broadcom platforms

# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_PHY_INTEL_LGM_EMMC is not set
# end of PHY Subsystem

# CONFIG_POWERCAP is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
# end of Performance monitor support

CONFIG_RAS=y
# CONFIG_RAS_CEC is not set
# CONFIG_USB4 is not set

#
# Android
#
# CONFIG_ANDROID is not set
# end of Android

CONFIG_LIBNVDIMM=m
CONFIG_BLK_DEV_PMEM=m
CONFIG_ND_CLAIM=y
CONFIG_ND_BTT=m
CONFIG_BTT=y
CONFIG_ND_PFN=m
CONFIG_NVDIMM_PFN=y
CONFIG_NVDIMM_DAX=y
CONFIG_DAX=y
CONFIG_DEV_DAX=m
CONFIG_DEV_DAX_PMEM=m
CONFIG_DEV_DAX_KMEM=m
CONFIG_NVMEM=y
CONFIG_NVMEM_SYSFS=y
# CONFIG_NVMEM_RMEM is not set

#
# HW tracing support
#
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set
# end of HW tracing support

# CONFIG_FPGA is not set
CONFIG_PM_OPP=y
# CONFIG_UNISYS_VISORBUS is not set
# CONFIG_SIOX is not set
# CONFIG_SLIMBUS is not set
# CONFIG_INTERCONNECT is not set
# CONFIG_COUNTER is not set
# CONFIG_MOST is not set
# CONFIG_PECI is not set
# end of Device Drivers

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_VALIDATE_FS_PARSER is not set
CONFIG_FS_IOMAP=y
# CONFIG_EXT2_FS is not set
# CONFIG_EXT3_FS is not set
CONFIG_EXT4_FS=y
# CONFIG_EXT4_USE_FOR_EXT2 is not set
# CONFIG_EXT4_FS_POSIX_ACL is not set
# CONFIG_EXT4_FS_SECURITY is not set
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_XFS_FS=m
CONFIG_XFS_SUPPORT_V4=y
# CONFIG_XFS_QUOTA is not set
# CONFIG_XFS_POSIX_ACL is not set
CONFIG_XFS_RT=y
CONFIG_XFS_ONLINE_SCRUB=y
# CONFIG_XFS_ONLINE_REPAIR is not set
CONFIG_XFS_DEBUG=y
CONFIG_XFS_ASSERT_FATAL=y
# CONFIG_GFS2_FS is not set
CONFIG_OCFS2_FS=m
CONFIG_OCFS2_FS_O2CB=m
CONFIG_OCFS2_FS_STATS=y
CONFIG_OCFS2_DEBUG_MASKLOG=y
# CONFIG_OCFS2_DEBUG_FS is not set
CONFIG_BTRFS_FS=m
CONFIG_BTRFS_FS_POSIX_ACL=y
CONFIG_BTRFS_FS_CHECK_INTEGRITY=y
# CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set
# CONFIG_BTRFS_DEBUG is not set
# CONFIG_BTRFS_ASSERT is not set
# CONFIG_BTRFS_FS_REF_VERIFY is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_F2FS_FS is not set
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
# CONFIG_EXPORTFS_BLOCK_OPS is not set
CONFIG_FILE_LOCKING=y
CONFIG_FS_ENCRYPTION=y
CONFIG_FS_ENCRYPTION_ALGS=y
# CONFIG_FS_VERITY is not set
CONFIG_FSNOTIFY=y
# CONFIG_DNOTIFY is not set
CONFIG_INOTIFY_USER=y
# CONFIG_FANOTIFY is not set
CONFIG_QUOTA=y
# CONFIG_QUOTA_NETLINK_INTERFACE is not set
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=m
# CONFIG_QFMT_V1 is not set
# CONFIG_QFMT_V2 is not set
CONFIG_QUOTACTL=y
# CONFIG_AUTOFS4_FS is not set
# CONFIG_AUTOFS_FS is not set
CONFIG_FUSE_FS=m
# CONFIG_CUSE is not set
# CONFIG_VIRTIO_FS is not set
# CONFIG_OVERLAY_FS is not set

#
# Caches
#
# CONFIG_FSCACHE is not set
# end of Caches

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set
# CONFIG_UDF_FS is not set
# end of CD-ROM/DVD Filesystems

#
# DOS/FAT/EXFAT/NT Filesystems
#
CONFIG_FAT_FS=y
# CONFIG_MSDOS_FS is not set
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_FAT_DEFAULT_UTF8 is not set
# CONFIG_EXFAT_FS is not set
# CONFIG_NTFS_FS is not set
# CONFIG_NTFS3_FS is not set
# end of DOS/FAT/EXFAT/NT Filesystems

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_PROC_CHILDREN=y
CONFIG_PROC_PID_ARCH_STATUS=y
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_TMPFS_XATTR is not set
# CONFIG_TMPFS_INODE64 is not set
# CONFIG_HUGETLBFS is not set
CONFIG_MEMFD_CREATE=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
CONFIG_EFIVAR_FS=y
# end of Pseudo filesystems

CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ORANGEFS_FS is not set
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_CRAMFS=y
CONFIG_CRAMFS_BLOCKDEV=y
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
CONFIG_PSTORE_DEFAULT_KMSG_BYTES=10240
CONFIG_PSTORE_DEFLATE_COMPRESS=y
# CONFIG_PSTORE_LZO_COMPRESS is not set
# CONFIG_PSTORE_LZ4_COMPRESS is not set
# CONFIG_PSTORE_LZ4HC_COMPRESS is not set
# CONFIG_PSTORE_842_COMPRESS is not set
# CONFIG_PSTORE_ZSTD_COMPRESS is not set
CONFIG_PSTORE_COMPRESS=y
CONFIG_PSTORE_DEFLATE_COMPRESS_DEFAULT=y
CONFIG_PSTORE_COMPRESS_DEFAULT="deflate"
CONFIG_PSTORE_CONSOLE=y
CONFIG_PSTORE_PMSG=y
# CONFIG_PSTORE_FTRACE is not set
CONFIG_PSTORE_RAM=m
# CONFIG_PSTORE_BLK is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
# CONFIG_EROFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V2=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
CONFIG_NFS_V4=y
# CONFIG_NFS_SWAP is not set
CONFIG_NFS_V4_1=y
CONFIG_NFS_V4_2=y
CONFIG_PNFS_FILE_LAYOUT=y
CONFIG_PNFS_BLOCK=m
CONFIG_PNFS_FLEXFILE_LAYOUT=y
CONFIG_NFS_V4_1_IMPLEMENTATION_ID_DOMAIN="kernel.org"
# CONFIG_NFS_V4_1_MIGRATION is not set
CONFIG_ROOT_NFS=y
# CONFIG_NFS_USE_LEGACY_DNS is not set
CONFIG_NFS_USE_KERNEL_DNS=y
CONFIG_NFS_DISABLE_UDP_SUPPORT=y
# CONFIG_NFS_V4_2_READ_PLUS is not set
CONFIG_NFSD=m
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3_ACL=y
CONFIG_NFSD_V4=y
# CONFIG_NFSD_BLOCKLAYOUT is not set
# CONFIG_NFSD_SCSILAYOUT is not set
# CONFIG_NFSD_FLEXFILELAYOUT is not set
# CONFIG_NFSD_V4_2_INTER_SSC is not set
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_NFS_V4_2_SSC_HELPER=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_GSS=y
CONFIG_SUNRPC_BACKCHANNEL=y
CONFIG_RPCSEC_GSS_KRB5=y
# CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES is not set
# CONFIG_SUNRPC_DEBUG is not set
# CONFIG_CEPH_FS is not set
CONFIG_CIFS=y
CONFIG_CIFS_STATS2=y
CONFIG_CIFS_ALLOW_INSECURE_LEGACY=y
# CONFIG_CIFS_UPCALL is not set
CONFIG_CIFS_XATTR=y
CONFIG_CIFS_POSIX=y
CONFIG_CIFS_DEBUG=y
CONFIG_CIFS_DEBUG2=y
# CONFIG_CIFS_DEBUG_DUMP_KEYS is not set
# CONFIG_CIFS_DFS_UPCALL is not set
# CONFIG_CIFS_SWN_UPCALL is not set
# CONFIG_CIFS_ROOT is not set
# CONFIG_SMB_SERVER is not set
CONFIG_SMBFS_COMMON=y
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_9P_FS is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
CONFIG_NLS_CODEPAGE_936=y
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
CONFIG_NLS_UTF8=y
# CONFIG_DLM is not set
# CONFIG_UNICODE is not set
CONFIG_IO_WQ=y
# end of File systems

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_KEYS_REQUEST_CACHE is not set
# CONFIG_PERSISTENT_KEYRINGS is not set
# CONFIG_ENCRYPTED_KEYS is not set
# CONFIG_KEY_DH_OPERATIONS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
# CONFIG_SECURITY is not set
# CONFIG_SECURITYFS is not set
CONFIG_PAGE_TABLE_ISOLATION=y
# CONFIG_INTEL_TXT is not set
CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y
# CONFIG_HARDENED_USERCOPY is not set
CONFIG_FORTIFY_SOURCE=y
# CONFIG_STATIC_USERMODEHELPER is not set
# CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT is not set
CONFIG_DEFAULT_SECURITY_DAC=y
CONFIG_LSM="landlock,lockdown,yama,loadpin,safesetid,integrity,bpf"

#
# Kernel hardening options
#

#
# Memory initialization
#
CONFIG_INIT_STACK_NONE=y
# CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set
# CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
CONFIG_CC_HAS_ZERO_CALL_USED_REGS=y
# CONFIG_ZERO_CALL_USED_REGS is not set
# end of Memory initialization
# end of Kernel hardening options
# end of Security options

CONFIG_XOR_BLOCKS=m
CONFIG_ASYNC_CORE=m
CONFIG_ASYNC_MEMCPY=m
CONFIG_ASYNC_XOR=m
CONFIG_ASYNC_PQ=m
CONFIG_ASYNC_RAID6_RECOV=m
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_SKCIPHER=y
CONFIG_CRYPTO_SKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_AKCIPHER=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_ACOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_PCRYPT is not set
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=y
# CONFIG_CRYPTO_TEST is not set

#
# Public-key cryptography
#
CONFIG_CRYPTO_RSA=y
# CONFIG_CRYPTO_DH is not set
# CONFIG_CRYPTO_ECDH is not set
# CONFIG_CRYPTO_ECDSA is not set
# CONFIG_CRYPTO_ECRDSA is not set
# CONFIG_CRYPTO_SM2 is not set
# CONFIG_CRYPTO_CURVE25519 is not set
# CONFIG_CRYPTO_CURVE25519_X86 is not set

#
# Authenticated Encryption with Associated Data
#
CONFIG_CRYPTO_CCM=y
CONFIG_CRYPTO_GCM=y
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
# CONFIG_CRYPTO_AEGIS128 is not set
# CONFIG_CRYPTO_AEGIS128_AESNI_SSE2 is not set
CONFIG_CRYPTO_SEQIV=y
# CONFIG_CRYPTO_ECHAINIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CFB is not set
CONFIG_CRYPTO_CTR=y
CONFIG_CRYPTO_CTS=y
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_OFB is not set
# CONFIG_CRYPTO_PCBC is not set
CONFIG_CRYPTO_XTS=y
# CONFIG_CRYPTO_KEYWRAP is not set
# CONFIG_CRYPTO_NHPOLY1305_SSE2 is not set
# CONFIG_CRYPTO_NHPOLY1305_AVX2 is not set
# CONFIG_CRYPTO_ADIANTUM is not set
CONFIG_CRYPTO_ESSIV=m

#
# Hash modes
#
CONFIG_CRYPTO_CMAC=y
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
# CONFIG_CRYPTO_CRC32C_INTEL is not set
# CONFIG_CRYPTO_CRC32 is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
CONFIG_CRYPTO_XXHASH=m
CONFIG_CRYPTO_BLAKE2B=m
# CONFIG_CRYPTO_BLAKE2S is not set
# CONFIG_CRYPTO_BLAKE2S_X86 is not set
CONFIG_CRYPTO_CRCT10DIF=m
# CONFIG_CRYPTO_CRCT10DIF_PCLMUL is not set
CONFIG_CRYPTO_GHASH=y
# CONFIG_CRYPTO_POLY1305 is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
CONFIG_CRYPTO_MD4=y
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_RMD160 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_SHA512=y
# CONFIG_CRYPTO_SHA3 is not set
# CONFIG_CRYPTO_SM3 is not set
# CONFIG_CRYPTO_SM3_AVX_X86_64 is not set
# CONFIG_CRYPTO_STREEBOG is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
CONFIG_CRYPTO_DES=y
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_CHACHA20 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_SM4 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_SM4_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=y
# CONFIG_CRYPTO_LZO is not set
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set
# CONFIG_CRYPTO_ZSTD is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
# CONFIG_CRYPTO_DRBG_HASH is not set
# CONFIG_CRYPTO_DRBG_CTR is not set
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_USER_API_AEAD is not set
CONFIG_CRYPTO_HASH_INFO=y
# CONFIG_CRYPTO_HW is not set
CONFIG_ASYMMETRIC_KEY_TYPE=y
CONFIG_ASYMMETRIC_PUBLIC_KEY_SUBTYPE=y
CONFIG_X509_CERTIFICATE_PARSER=y
# CONFIG_PKCS8_PRIVATE_KEY_PARSER is not set
CONFIG_PKCS7_MESSAGE_PARSER=y
# CONFIG_PKCS7_TEST_KEY is not set
# CONFIG_SIGNED_PE_FILE_VERIFICATION is not set

#
# Certificates for signature checking
#
CONFIG_SYSTEM_TRUSTED_KEYRING=y
CONFIG_SYSTEM_TRUSTED_KEYS=""
# CONFIG_SYSTEM_EXTRA_CERTIFICATE is not set
# CONFIG_SECONDARY_TRUSTED_KEYRING is not set
# CONFIG_SYSTEM_BLACKLIST_KEYRING is not set
# end of Certificates for signature checking

CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=m
CONFIG_RAID6_PQ_BENCHMARK=y
# CONFIG_PACKING is not set
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
# CONFIG_CORDIC is not set
# CONFIG_PRIME_NUMBERS is not set
CONFIG_RATIONAL=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_ARCH_USE_SYM_ANNOTATIONS=y

#
# Crypto library routines
#
CONFIG_CRYPTO_LIB_AES=y
CONFIG_CRYPTO_LIB_ARC4=y
CONFIG_CRYPTO_LIB_BLAKE2S_GENERIC=y
# CONFIG_CRYPTO_LIB_CHACHA is not set
# CONFIG_CRYPTO_LIB_CURVE25519 is not set
CONFIG_CRYPTO_LIB_DES=y
CONFIG_CRYPTO_LIB_POLY1305_RSIZE=11
# CONFIG_CRYPTO_LIB_POLY1305 is not set
# CONFIG_CRYPTO_LIB_CHACHA20POLY1305 is not set
CONFIG_CRYPTO_LIB_SHA256=y
# end of Crypto library routines

CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=m
# CONFIG_CRC64_ROCKSOFT is not set
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC64 is not set
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
# CONFIG_CRC8 is not set
CONFIG_XXHASH=y
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_ZSTD_COMPRESS=m
CONFIG_ZSTD_DECOMPRESS=y
# CONFIG_XZ_DEC is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_ZSTD=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_REED_SOLOMON=m
CONFIG_REED_SOLOMON_ENC8=y
CONFIG_REED_SOLOMON_DEC8=y
CONFIG_BTREE=y
CONFIG_INTERVAL_TREE=y
CONFIG_XARRAY_MULTI=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
CONFIG_DMA_OPS=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_SWIOTLB=y
CONFIG_DMA_CMA=y
# CONFIG_DMA_PERNUMA_CMA is not set

#
# Default contiguous memory area size:
#
CONFIG_CMA_SIZE_MBYTES=200
CONFIG_CMA_SIZE_SEL_MBYTES=y
# CONFIG_CMA_SIZE_SEL_PERCENTAGE is not set
# CONFIG_CMA_SIZE_SEL_MIN is not set
# CONFIG_CMA_SIZE_SEL_MAX is not set
CONFIG_CMA_ALIGNMENT=8
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_DMA_MAP_BENCHMARK is not set
CONFIG_SGL_ALLOC=y
CONFIG_CHECK_SIGNATURE=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
CONFIG_CLZ_TAB=y
CONFIG_IRQ_POLL=y
CONFIG_MPILIB=y
CONFIG_OID_REGISTRY=y
CONFIG_UCS2_STRING=y
CONFIG_HAVE_GENERIC_VDSO=y
CONFIG_GENERIC_GETTIMEOFDAY=y
CONFIG_GENERIC_VDSO_TIME_NS=y
CONFIG_FONT_SUPPORT=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_MEMREGION=y
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE=y
CONFIG_ARCH_HAS_COPY_MC=y
CONFIG_ARCH_STACKWALK=y
CONFIG_SBITMAP=y
# end of Library routines

#
# Kernel hacking
#

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
CONFIG_PRINTK_CALLER=y
# CONFIG_STACKTRACE_BUILD_ID is not set
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_CONSOLE_LOGLEVEL_QUIET=4
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_DYNAMIC_DEBUG is not set
# CONFIG_DYNAMIC_DEBUG_CORE is not set
CONFIG_SYMBOLIC_ERRNAME=y
CONFIG_DEBUG_BUGVERBOSE=y
# end of printk and dmesg options

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_MISC=y

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO_NONE=y
# CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT is not set
# CONFIG_DEBUG_INFO_DWARF4 is not set
# CONFIG_DEBUG_INFO_DWARF5 is not set
CONFIG_FRAME_WARN=2048
# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_READABLE_ASM is not set
# CONFIG_HEADERS_INSTALL is not set
CONFIG_DEBUG_SECTION_MISMATCH=y
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
# CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B is not set
CONFIG_STACK_VALIDATION=y
# CONFIG_VMLINUX_MAP is not set
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
# end of Compile-time checks and compiler options

#
# Generic Kernel Debugging Instruments
#
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_MAGIC_SYSRQ_SERIAL_SEQUENCE=""
CONFIG_DEBUG_FS=y
CONFIG_DEBUG_FS_ALLOW_ALL=y
# CONFIG_DEBUG_FS_DISALLOW_MOUNT is not set
# CONFIG_DEBUG_FS_ALLOW_NONE is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_UBSAN is not set
CONFIG_HAVE_ARCH_KCSAN=y
CONFIG_HAVE_KCSAN_COMPILER=y
# CONFIG_KCSAN is not set
# end of Generic Kernel Debugging Instruments

#
# Networking Debugging
#
# CONFIG_NET_DEV_REFCNT_TRACKER is not set
# CONFIG_NET_NS_REFCNT_TRACKER is not set
# end of Networking Debugging

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_PAGE_OWNER is not set
# CONFIG_PAGE_TABLE_CHECK is not set
# CONFIG_PAGE_POISONING is not set
# CONFIG_DEBUG_PAGE_REF is not set
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_ARCH_HAS_DEBUG_WX=y
# CONFIG_DEBUG_WX is not set
CONFIG_GENERIC_PTDUMP=y
# CONFIG_PTDUMP_DEBUGFS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_SCHED_STACK_END_CHECK is not set
CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE=y
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_VMACACHE is not set
# CONFIG_DEBUG_VM_RB is not set
# CONFIG_DEBUG_VM_PGFLAGS is not set
CONFIG_DEBUG_VM_PGTABLE=y
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_MEMORY_INIT=y
CONFIG_MEMORY_NOTIFIER_ERROR_INJECT=m
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP=y
# CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is not set
CONFIG_HAVE_ARCH_KASAN=y
CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
CONFIG_CC_HAS_KASAN_GENERIC=y
CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
# CONFIG_KASAN is not set
CONFIG_HAVE_ARCH_KFENCE=y
# CONFIG_KFENCE is not set
# end of Memory Debugging

# CONFIG_DEBUG_SHIRQ is not set

#
# Debug Oops, Lockups and Hangs
#
CONFIG_PANIC_ON_OOPS=y
CONFIG_PANIC_ON_OOPS_VALUE=1
CONFIG_PANIC_TIMEOUT=0
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
# CONFIG_HARDLOCKUP_DETECTOR is not set
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=480
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
CONFIG_WQ_WATCHDOG=y
# CONFIG_TEST_LOCKUP is not set
# end of Debug Oops, Lockups and Hangs

#
# Scheduler Debugging
#
CONFIG_SCHED_DEBUG=y
CONFIG_SCHED_INFO=y
# CONFIG_SCHEDSTATS is not set
# end of Scheduler Debugging

# CONFIG_DEBUG_TIMEKEEPING is not set

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
CONFIG_DEBUG_ATOMIC_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set
# end of Lock Debugging (spinlocks, mutexes, etc...)

# CONFIG_DEBUG_IRQFLAGS is not set
CONFIG_STACKTRACE=y
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set

#
# Debug kernel data structures
#
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_PLIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_BUG_ON_DATA_CORRUPTION is not set
# end of Debug kernel data structures

# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
# CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
# CONFIG_LATENCYTOP is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_RETHOOK=y
CONFIG_RETHOOK=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_OBJTOOL_MCOUNT=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_HAVE_BUILDTIME_MCOUNT_SORT=y
CONFIG_BUILDTIME_MCOUNT_SORT=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
# CONFIG_BOOTTIME_TRACING is not set
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS=y
CONFIG_DYNAMIC_FTRACE_WITH_ARGS=y
# CONFIG_FPROBE is not set
# CONFIG_FUNCTION_PROFILER is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_SCHED_TRACER is not set
# CONFIG_HWLAT_TRACER is not set
# CONFIG_OSNOISE_TRACER is not set
# CONFIG_TIMERLAT_TRACER is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_FTRACE_SYSCALLS is not set
# CONFIG_TRACER_SNAPSHOT is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_KPROBE_EVENTS=y
# CONFIG_KPROBE_EVENTS_ON_NOTRACE is not set
CONFIG_UPROBE_EVENTS=y
CONFIG_BPF_EVENTS=y
CONFIG_DYNAMIC_EVENTS=y
CONFIG_PROBE_EVENTS=y
# CONFIG_BPF_KPROBE_OVERRIDE is not set
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_MCOUNT_USE_CC=y
CONFIG_TRACING_MAP=y
CONFIG_SYNTH_EVENTS=y
CONFIG_HIST_TRIGGERS=y
# CONFIG_TRACE_EVENT_INJECT is not set
# CONFIG_TRACEPOINT_BENCHMARK is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_TRACE_EVAL_MAP_FILE is not set
# CONFIG_FTRACE_RECORD_RECURSION is not set
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_FTRACE_SORT_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set
# CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
# CONFIG_SYNTH_EVENT_GEN_TEST is not set
# CONFIG_KPROBE_EVENT_GEN_TEST is not set
# CONFIG_HIST_TRIGGERS_DEBUG is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT=y
CONFIG_HAVE_SAMPLE_FTRACE_DIRECT_MULTI=y
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
# CONFIG_STRICT_DEVMEM is not set

#
# x86 Debugging
#
CONFIG_TRACE_IRQFLAGS_NMI_SUPPORT=y
CONFIG_EARLY_PRINTK_USB=y
CONFIG_X86_VERBOSE_BOOTUP=y
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
CONFIG_EARLY_PRINTK_USB_XDBC=y
# CONFIG_EFI_PGT_DUMP is not set
# CONFIG_DEBUG_TLBFLUSH is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
# CONFIG_DEBUG_ENTRY is not set
# CONFIG_DEBUG_NMI_SELFTEST is not set
# CONFIG_X86_DEBUG_FPU is not set
# CONFIG_PUNIT_ATOM_DEBUG is not set
CONFIG_UNWINDER_ORC=y
# CONFIG_UNWINDER_FRAME_POINTER is not set
# CONFIG_UNWINDER_GUESS is not set
# end of x86 Debugging

#
# Kernel Testing and Coverage
#
# CONFIG_KUNIT is not set
CONFIG_NOTIFIER_ERROR_INJECTION=m
CONFIG_PM_NOTIFIER_ERROR_INJECT=m
# CONFIG_NETDEV_NOTIFIER_ERROR_INJECT is not set
CONFIG_FUNCTION_ERROR_INJECTION=y
CONFIG_FAULT_INJECTION=y
# CONFIG_FAILSLAB is not set
# CONFIG_FAIL_PAGE_ALLOC is not set
# CONFIG_FAULT_INJECTION_USERCOPY is not set
CONFIG_FAIL_MAKE_REQUEST=y
# CONFIG_FAIL_IO_TIMEOUT is not set
# CONFIG_FAIL_FUTEX is not set
CONFIG_FAULT_INJECTION_DEBUG_FS=y
# CONFIG_FAIL_FUNCTION is not set
CONFIG_ARCH_HAS_KCOV=y
CONFIG_CC_HAS_SANCOV_TRACE_PC=y
# CONFIG_KCOV is not set
CONFIG_RUNTIME_TESTING_MENU=y
# CONFIG_LKDTM is not set
# CONFIG_TEST_MIN_HEAP is not set
# CONFIG_TEST_DIV64 is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_TEST_REF_TRACKER is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_REED_SOLOMON_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
# CONFIG_PERCPU_TEST is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_ASYNC_RAID6_TEST is not set
# CONFIG_TEST_HEXDUMP is not set
# CONFIG_STRING_SELFTEST is not set
# CONFIG_TEST_STRING_HELPERS is not set
# CONFIG_TEST_STRSCPY is not set
# CONFIG_TEST_KSTRTOX is not set
# CONFIG_TEST_PRINTF is not set
# CONFIG_TEST_SCANF is not set
# CONFIG_TEST_BITMAP is not set
# CONFIG_TEST_UUID is not set
# CONFIG_TEST_XARRAY is not set
# CONFIG_TEST_RHASHTABLE is not set
# CONFIG_TEST_SIPHASH is not set
# CONFIG_TEST_IDA is not set
# CONFIG_TEST_LKM is not set
# CONFIG_TEST_BITOPS is not set
# CONFIG_TEST_VMALLOC is not set
# CONFIG_TEST_USER_COPY is not set
# CONFIG_TEST_BPF is not set
# CONFIG_TEST_BLACKHOLE_DEV is not set
# CONFIG_FIND_BIT_BENCHMARK is not set
# CONFIG_TEST_FIRMWARE is not set
# CONFIG_TEST_SYSCTL is not set
# CONFIG_TEST_UDELAY is not set
# CONFIG_TEST_STATIC_KEYS is not set
# CONFIG_TEST_KMOD is not set
# CONFIG_TEST_MEMCAT_P is not set
# CONFIG_TEST_MEMINIT is not set
# CONFIG_TEST_FREE_PAGES is not set
# CONFIG_TEST_FPU is not set
# CONFIG_TEST_CLOCKSOURCE_WATCHDOG is not set
CONFIG_ARCH_USE_MEMTEST=y
CONFIG_MEMTEST=y
# end of Kernel Testing and Coverage
# end of Kernel hacking

[-- Attachment #3: job-script --]
[-- Type: text/plain, Size: 4769 bytes --]

#!/bin/sh

export_top_env()
{
	export suite='boot'
	export testcase='boot'
	export category='functional'
	export timeout='10m'
	export job_origin='boot.yaml'
	export queue_cmdline_keys='branch
commit'
	export queue='bisect'
	export testbox='vm-snb-111'
	export tbox_group='vm-snb'
	export branch='linux-devel/devel-hourly-20220528-004237'
	export commit='8ebccd60c2db6beefef2f39b05a95024be0c39eb'
	export kconfig='x86_64-kexec'
	export nr_vm=160
	export submit_id='62917158f998e8d4579602ae'
	export job_file='/lkp/jobs/scheduled/vm-snb-111/boot-1-debian-10.4-x86_64-20200603.cgz-8ebccd60c2db6beefef2f39b05a95024be0c39eb-20220528-119895-1g3k9uq-1.yaml'
	export id='11310633211e37d0ea6683b97cb7e62ee2ac1467'
	export queuer_version='/zday/lkp'
	export model='qemu-system-x86_64 -enable-kvm -cpu SandyBridge'
	export nr_cpu=2
	export memory='16G'
	export need_kconfig=\{\"KVM_GUEST\"\=\>\"y\"\}
	export ssh_base_port=23032
	export kernel_cmdline='vmalloc=256M initramfs_async=0 page_owner=on'
	export rootfs='debian-10.4-x86_64-20200603.cgz'
	export compiler='gcc-11'
	export enqueue_time='2022-05-28 08:48:24 +0800'
	export _id='629174a6f998e8d4579602af'
	export _rt='/result/boot/1/vm-snb/debian-10.4-x86_64-20200603.cgz/x86_64-kexec/gcc-11/8ebccd60c2db6beefef2f39b05a95024be0c39eb'
	export user='lkp'
	export LKP_SERVER='internal-lkp-server'
	export result_root='/result/boot/1/vm-snb/debian-10.4-x86_64-20200603.cgz/x86_64-kexec/gcc-11/8ebccd60c2db6beefef2f39b05a95024be0c39eb/3'
	export scheduler_version='/lkp/lkp/.src-20220525-200837'
	export arch='x86_64'
	export max_uptime=600
	export initrd='/osimage/debian/debian-10.4-x86_64-20200603.cgz'
	export bootloader_append='root=/dev/ram0
RESULT_ROOT=/result/boot/1/vm-snb/debian-10.4-x86_64-20200603.cgz/x86_64-kexec/gcc-11/8ebccd60c2db6beefef2f39b05a95024be0c39eb/3
BOOT_IMAGE=/pkg/linux/x86_64-kexec/gcc-11/8ebccd60c2db6beefef2f39b05a95024be0c39eb/vmlinuz-5.18.0-rc5-00059-g8ebccd60c2db
branch=linux-devel/devel-hourly-20220528-004237
job=/lkp/jobs/scheduled/vm-snb-111/boot-1-debian-10.4-x86_64-20200603.cgz-8ebccd60c2db6beefef2f39b05a95024be0c39eb-20220528-119895-1g3k9uq-1.yaml
user=lkp
ARCH=x86_64
kconfig=x86_64-kexec
commit=8ebccd60c2db6beefef2f39b05a95024be0c39eb
vmalloc=256M initramfs_async=0 page_owner=on
max_uptime=600
LKP_SERVER=internal-lkp-server
selinux=0
debug
apic=debug
sysrq_always_enabled
rcupdate.rcu_cpu_stall_timeout=100
net.ifnames=0
printk.devkmsg=on
panic=-1
softlockup_panic=1
nmi_watchdog=panic
oops=panic
load_ramdisk=2
prompt_ramdisk=0
drbd.minor_count=8
systemd.log_level=err
ignore_loglevel
console=tty0
earlyprintk=ttyS0,115200
console=ttyS0,115200
vga=normal
rw'
	export modules_initrd='/pkg/linux/x86_64-kexec/gcc-11/8ebccd60c2db6beefef2f39b05a95024be0c39eb/modules.cgz'
	export bm_initrd='/osimage/deps/debian-10.4-x86_64-20200603.cgz/run-ipconfig_20200608.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/lkp_20220105.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/rsync-rootfs_20200608.cgz'
	export lkp_initrd='/osimage/user/lkp/lkp-x86_64.cgz'
	export site='inn'
	export LKP_CGI_PORT=80
	export LKP_CIFS_PORT=139
	export schedule_notify_address=
	export kernel='/pkg/linux/x86_64-kexec/gcc-11/8ebccd60c2db6beefef2f39b05a95024be0c39eb/vmlinuz-5.18.0-rc5-00059-g8ebccd60c2db'
	export dequeue_time='2022-05-28 09:03:30 +0800'
	export job_initrd='/lkp/jobs/scheduled/vm-snb-111/boot-1-debian-10.4-x86_64-20200603.cgz-8ebccd60c2db6beefef2f39b05a95024be0c39eb-20220528-119895-1g3k9uq-1.cgz'

	[ -n "$LKP_SRC" ] ||
	export LKP_SRC=/lkp/${user:-lkp}/src
}

run_job()
{
	echo $$ > $TMP/run-job.pid

	. $LKP_SRC/lib/http.sh
	. $LKP_SRC/lib/job.sh
	. $LKP_SRC/lib/env.sh

	export_top_env

	run_monitor $LKP_SRC/monitors/one-shot/wrapper boot-slabinfo
	run_monitor $LKP_SRC/monitors/one-shot/wrapper boot-meminfo
	run_monitor $LKP_SRC/monitors/one-shot/wrapper memmap
	run_monitor $LKP_SRC/monitors/no-stdout/wrapper boot-time
	run_monitor $LKP_SRC/monitors/wrapper kmsg
	run_monitor $LKP_SRC/monitors/wrapper heartbeat
	run_monitor $LKP_SRC/monitors/wrapper meminfo
	run_monitor $LKP_SRC/monitors/wrapper oom-killer
	run_monitor $LKP_SRC/monitors/plain/watchdog

	run_test $LKP_SRC/tests/wrapper sleep 1
}

extract_stats()
{
	export stats_part_begin=
	export stats_part_end=

	$LKP_SRC/stats/wrapper boot-slabinfo
	$LKP_SRC/stats/wrapper boot-meminfo
	$LKP_SRC/stats/wrapper memmap
	$LKP_SRC/stats/wrapper boot-memory
	$LKP_SRC/stats/wrapper boot-time
	$LKP_SRC/stats/wrapper kernel-size
	$LKP_SRC/stats/wrapper kmsg
	$LKP_SRC/stats/wrapper sleep
	$LKP_SRC/stats/wrapper meminfo

	$LKP_SRC/stats/wrapper time sleep.time
	$LKP_SRC/stats/wrapper dmesg
	$LKP_SRC/stats/wrapper kmsg
	$LKP_SRC/stats/wrapper last_state
	$LKP_SRC/stats/wrapper stderr
	$LKP_SRC/stats/wrapper time
}

"$@"

[-- Attachment #4: dmesg.xz --]
[-- Type: application/x-xz, Size: 13876 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier
  2022-05-27 15:45       ` Aneesh Kumar K V
@ 2022-05-30 12:36         ` Jonathan Cameron
  0 siblings, 0 replies; 66+ messages in thread
From: Jonathan Cameron @ 2022-05-30 12:36 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On Fri, 27 May 2022 21:15:09 +0530
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:

> On 5/27/22 8:15 PM, Jonathan Cameron wrote:
> > On Fri, 27 May 2022 17:55:26 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> >   
> >> The rank approach allows us to keep memory tier device IDs stable even if there
> >> is a need to change the tier ordering among different memory tiers. e.g. DRAM
> >> nodes with CPUs will always be on memtier1, no matter how many tiers are higher
> >> or lower than these nodes. A new memory tier can be inserted into the tier
> >> hierarchy for a new set of nodes without affecting the node assignment of any
> >> existing memtier, provided that there is enough gap in the rank values for the
> >> new memtier.
> >>
> >> The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
> >> Its value relative to other memtiers decides the level of this memtier in the tier
> >> hierarchy.
> >>
> >> For now, This patch supports hardcoded rank values which are 100, 200, & 300 for
> >> memory tiers 0,1 & 2 respectively.
> >>
> >> Below is the sysfs interface to read the rank values of memory tier,
> >> /sys/devices/system/memtier/memtierN/rank
> >>
> >> This interface is read only for now, write support can be added when there is
> >> a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
> >> requirement among them, rank can be utilized there as rank decides now memory
> >> tiering ordering and not memory tier device ids.
> >>
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>  
> > 
> > I'd squash a lot of this with the original patch introducing tiers. As things
> > stand we have 2 tricky to follow patches covering the same code rather than
> > one that would be simpler.
> >   
> 
> Sure. Will do that in the next update.
> 
> > Jonathan
> >   
> >> ---
> >>   drivers/base/node.c     |   5 +-
> >>   drivers/dax/kmem.c      |   2 +-
> >>   include/linux/migrate.h |  17 ++--
> >>   mm/migrate.c            | 218 ++++++++++++++++++++++++----------------
> >>   4 files changed, 144 insertions(+), 98 deletions(-)
> >>
> >> diff --git a/drivers/base/node.c b/drivers/base/node.c
> >> index cf4a58446d8c..892f7c23c94e 100644
> >> --- a/drivers/base/node.c
> >> +++ b/drivers/base/node.c
> >> @@ -567,8 +567,11 @@ static ssize_t memtier_show(struct device *dev,
> >>   			    char *buf)
> >>   {
> >>   	int node = dev->id;
> >> +	int tier_index = node_get_memory_tier_id(node);
> >>   
> >> -	return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
> >> +	if (tier_index != -1)
> >> +		return sysfs_emit(buf, "%d\n", tier_index);  
> > I think failure to get a tier is an error. So if it happens, return an error code.
> > Also prefered to handle errors out of line as more idiomatic so reviewers
> > read it quicker.
> > 
> > 	if (tier_index == -1)
> > 		return -EINVAL;
> > 
> > 	return sysfs_emit()...
> >   
> >> +	return 0;
> >>   }
> >>     
> 
> 
> That was needed to handle NUMA nodes that is not part of any memory 
> tiers, like CPU only NUMA node or NUMA node that doesn't want to 
> participate in memory demotion.
> 
> 
> 
> >>   static ssize_t memtier_store(struct device *dev,
> >> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> >> index 991782aa2448..79953426ddaf 100644
> >> --- a/drivers/dax/kmem.c
> >> +++ b/drivers/dax/kmem.c
> >> @@ -149,7 +149,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> >>   	dev_set_drvdata(dev, data);
> >>     
> 
> 
> ...
> 
> >>   
> >> -static DEVICE_ATTR_RO(default_tier);
> >> +static DEVICE_ATTR_RO(default_rank);
> >>   
> >>   static struct attribute *memoty_tier_attrs[] = {
> >> -	&dev_attr_max_tiers.attr,
> >> -	&dev_attr_default_tier.attr,
> >> +	&dev_attr_max_tier.attr,
> >> +	&dev_attr_default_rank.attr,  
> > 
> > hmm. Not sure why rename to tier rather than tiers.
> > 
> > Also, I think we default should be tier, not rank.  If someone later
> > wants to change the rank of tier1 that's up to them, but any new hotplugged
> > memory should still end up in their by default.
> >   
> 
> Didn't we say, the tier index/device id is a meaning less entity that 
> control just the naming. ie, for memtier128, 128 doesn't mean anything.

> Instead it is the rank value associated with memtier128 that control the 
> demotion order? If so what we want to update the userspace is max tier 
> index userspace can expect and what is the default rank value to which 
> memory will be added by hotplug.

Sort of.  I think we want default to refer to a particular tier, probably
at all times, thus allowing the administrator to potentially change what the
rank of that default group is for everything currently in it and anything
added later.  So I would keep the default as pointing to a particular
tier. This also reflect the earlier discussion about having multiple tiers
with the same rank. I would allow that as it makes for a cleaner interface
if we make rank writeable in the future. If that happens, what does
default rank mean? Which of the the tiers is used?

For other cases, rank is the value that matters for ordering but the particular
tier is what a driver etc uses.

The reason being to allow an admin to change the rank of (for example)
all GPU memory, such that it affects the GPU memory already present and
any added in the future (rather than a new tier being created with whatever
the GPU driver thinks the rank should be).  The way I think about this
means that default should be the same - tied to a particular tier, not
a particular rank.  If software wants the current rank of the default tier
then it can go look it up in the tier. 

> 
> But yes. tierindex 1 and default rank 200 are reserved and created by 
> default.
> 
> 
> ....
> 
> >>   	/*
> >>   	 * if node is already part of the tier proceed with the
> >>   	 * current tier value, because we might want to establish
> >> @@ -2411,15 +2452,17 @@ int node_set_memory_tier(int node, int tier)
> >>   	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
> >>   	 * will have skipped this node.
> >>   	 */
> >> -	if (current_tier != -1)
> >> -		tier = current_tier;
> >> -	ret = __node_set_memory_tier(node, tier);
> >> +	if (memtier)
> >> +		establish_migration_targets();
> >> +	else {
> >> +		/* For now rank value and tier value is same. */  
> > 
> > We should avoid baking that in...  
> 
> 
> Making it dynamic adds lots of complexity such as an ida alloc for tier 
> index etc. I didn't want to get there unless we are sure we need dynamic 
> number of tiers.

Agreed it's more complex (though not very).  I'm just suggesting dropping
the comment.

If it were me, I'd make tier0 the default with the mid rank. Then tier1
as slower and tier2 as faster.  Hopefully that would avoid any userspace
code making assumptions about the ordering.

Jonathan




> 
> -aneesh
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v3)
  2022-05-29  4:31     ` Ying Huang
@ 2022-05-30 12:50       ` Jonathan Cameron
  2022-05-31  1:57         ` Ying Huang
  2022-06-07 19:25         ` Tim Chen
  0 siblings, 2 replies; 66+ messages in thread
From: Jonathan Cameron @ 2022-05-30 12:50 UTC (permalink / raw)
  To: Ying Huang
  Cc: Wei Xu, Aneesh Kumar K V, Andrew Morton, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Linux MM,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Sun, 29 May 2022 12:31:30 +0800
Ying Huang <ying.huang@intel.com> wrote:

> On Fri, 2022-05-27 at 09:30 -0700, Wei Xu wrote:
> > On Fri, May 27, 2022 at 6:41 AM Aneesh Kumar K V
> > <aneesh.kumar@linux.ibm.com> wrote:  
> > > 
> > > On 5/27/22 2:52 AM, Wei Xu wrote:
> > >   
> > > >    The order of memory tiers is determined by their rank values, not by
> > > >    their memtier device names.
> > > > 
> > > >    - /sys/devices/system/memtier/possible
> > > > 
> > > >      Format: ordered list of "memtier(rank)"
> > > >      Example: 0(64), 1(128), 2(192)
> > > > 
> > > >      Read-only.  When read, list all available memory tiers and their
> > > >      associated ranks, ordered by the rank values (from the highest
> > > >       tier to the lowest tier).
> > > >   
> > > 
> > > Did we discuss the need for this? I haven't done this in the patch
> > > series I sent across.  
> > 
> > The "possible" file is only needed if we decide to hide the
> > directories of memtiers that have no nodes.  We can remove this
> > interface and always show all memtier directories to keep things
> > simpler.  
> 
> When discussed offline, Tim Chen pointed out that with the proposed
> interface, it's unconvenient to know the position of a given memory tier
> in all memory tiers.  We must sort "rank" of all memory tiers to know
> that.  "possible" file can be used for that.  Although "possible" file
> can be generated with a shell script, it's more convenient to show it
> directly.
> 
> Another way to address the issue is to add memtierN/pos for each memory
> tier as suggested by Tim.  It's readonly and will show position of
> "memtierN" in all memory tiers.  It's even better to show the relative
> postion to the default memory tier (DRAM with CPU). That is, the
> position of DRAM memory tier is 0.
> 
> Unlike memory tier device ID or rank, the position is relative and
> dynamic.

Hi,

I'm unconvinced.  This is better done with a shell script than
by adding ABI we'll have to live with for ever..

I'm no good at shell scripting but this does the job 
grep "" tier*/rank | sort -n -k 2 -t : 

tier2/rank:50
tier0/rank:100
tier1/rank:200
tier3/rank:240

I'm sure someone more knowledgeable will do it in a simpler fashion still.

Jonathan

> 
> Best Regards,
> Huang, Ying
> 
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v3)
  2022-05-30 12:50       ` Jonathan Cameron
@ 2022-05-31  1:57         ` Ying Huang
  2022-06-07 19:25         ` Tim Chen
  1 sibling, 0 replies; 66+ messages in thread
From: Ying Huang @ 2022-05-31  1:57 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Wei Xu, Aneesh Kumar K V, Andrew Morton, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Linux MM,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Mon, 2022-05-30 at 13:50 +0100, Jonathan Cameron wrote:
> On Sun, 29 May 2022 12:31:30 +0800
> Ying Huang <ying.huang@intel.com> wrote:
> 
> > On Fri, 2022-05-27 at 09:30 -0700, Wei Xu wrote:
> > > On Fri, May 27, 2022 at 6:41 AM Aneesh Kumar K V
> > > <aneesh.kumar@linux.ibm.com> wrote:  
> > > > 
> > > > On 5/27/22 2:52 AM, Wei Xu wrote:
> > > >   
> > > > 
> > > > 
> > > > 
> > > > >    The order of memory tiers is determined by their rank values, not by
> > > > >    their memtier device names.
> > > > > 
> > > > >    - /sys/devices/system/memtier/possible
> > > > > 
> > > > >      Format: ordered list of "memtier(rank)"
> > > > >      Example: 0(64), 1(128), 2(192)
> > > > > 
> > > > >      Read-only.  When read, list all available memory tiers and their
> > > > >      associated ranks, ordered by the rank values (from the highest
> > > > >       tier to the lowest tier).
> > > > >   
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > Did we discuss the need for this? I haven't done this in the patch
> > > > series I sent across.  
> > > 
> > > The "possible" file is only needed if we decide to hide the
> > > directories of memtiers that have no nodes.  We can remove this
> > > interface and always show all memtier directories to keep things
> > > simpler.  
> > 
> > When discussed offline, Tim Chen pointed out that with the proposed
> > interface, it's unconvenient to know the position of a given memory tier
> > in all memory tiers.  We must sort "rank" of all memory tiers to know
> > that.  "possible" file can be used for that.  Although "possible" file
> > can be generated with a shell script, it's more convenient to show it
> > directly.
> > 
> > Another way to address the issue is to add memtierN/pos for each memory
> > tier as suggested by Tim.  It's readonly and will show position of
> > "memtierN" in all memory tiers.  It's even better to show the relative
> > postion to the default memory tier (DRAM with CPU). That is, the
> > position of DRAM memory tier is 0.
> > 
> > Unlike memory tier device ID or rank, the position is relative and
> > dynamic.
> 
> Hi,
> 
> I'm unconvinced.  This is better done with a shell script than
> by adding ABI we'll have to live with for ever..
> 
> I'm no good at shell scripting but this does the job 
> grep "" tier*/rank | sort -n -k 2 -t : 
> 
> tier2/rank:50
> tier0/rank:100
> tier1/rank:200
> tier3/rank:240
> 
> I'm sure someone more knowledgeable will do it in a simpler fashion still.

I am OK to leave this to be added later if we found that it's useful.

Best Regards,
Huang, Ying

> Jonathan
> 
> > 
> > Best Regards,
> > Huang, Ying
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-05-27 12:25   ` [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
@ 2022-06-01  6:29     ` Bharata B Rao
  2022-06-01 13:49       ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Bharata B Rao @ 2022-06-01  6:29 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 5/27/2022 5:55 PM, Aneesh Kumar K.V wrote:
> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> 
> By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
> is memory tier 1 which is designated for nodes with DRAM, so it
> is not the right tier for dax devices.
> 
> Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future,
> support should be added to distinguish the dax-devices which should
> not be MEMORY_TIER_PMEM and right memory tier should be set for them.
> 
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  drivers/dax/kmem.c | 4 ++++
>  mm/migrate.c       | 2 ++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index a37622060fff..991782aa2448 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -11,6 +11,7 @@
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  #include <linux/mman.h>
> +#include <linux/migrate.h>
>  #include "dax-private.h"
>  #include "bus.h"
>  
> @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  
>  	dev_set_drvdata(dev, data);
>  
> +#ifdef CONFIG_TIERED_MEMORY
> +	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> +#endif

I was experimenting with this patchset and found this behaviour.
Here's what I did:

Boot a KVM guest with vNVDIMM device which ends up with device_dax
driver by default.

Use it as RAM by binding it to dax kmem driver. It now appears as
RAM with a new NUMA node that is put to memtier1 (the existing tier
where DRAM already exists)

I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
that expected to happen automatically when a node with dax kmem
device comes up?

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-06-01  6:29     ` Bharata B Rao
@ 2022-06-01 13:49       ` Aneesh Kumar K V
  2022-06-02  6:36         ` Bharata B Rao
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-01 13:49 UTC (permalink / raw)
  To: Bharata B Rao, linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/1/22 11:59 AM, Bharata B Rao wrote:
> On 5/27/2022 5:55 PM, Aneesh Kumar K.V wrote:
>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>
>> By default, all nodes are assigned to DEFAULT_MEMORY_TIER which
>> is memory tier 1 which is designated for nodes with DRAM, so it
>> is not the right tier for dax devices.
>>
>> Set dax kmem device node's tier to MEMORY_TIER_PMEM, In future,
>> support should be added to distinguish the dax-devices which should
>> not be MEMORY_TIER_PMEM and right memory tier should be set for them.
>>
>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   drivers/dax/kmem.c | 4 ++++
>>   mm/migrate.c       | 2 ++
>>   2 files changed, 6 insertions(+)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index a37622060fff..991782aa2448 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -11,6 +11,7 @@
>>   #include <linux/fs.h>
>>   #include <linux/mm.h>
>>   #include <linux/mman.h>
>> +#include <linux/migrate.h>
>>   #include "dax-private.h"
>>   #include "bus.h"
>>   
>> @@ -147,6 +148,9 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   
>>   	dev_set_drvdata(dev, data);
>>   
>> +#ifdef CONFIG_TIERED_MEMORY
>> +	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
>> +#endif
> 
> I was experimenting with this patchset and found this behaviour.
> Here's what I did:
> 
> Boot a KVM guest with vNVDIMM device which ends up with device_dax
> driver by default.
> 
> Use it as RAM by binding it to dax kmem driver. It now appears as
> RAM with a new NUMA node that is put to memtier1 (the existing tier
> where DRAM already exists)
> 

That should have placed it in memtier2.

> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
> that expected to happen automatically when a node with dax kmem
> device comes up?
> 

This can happen if we have added the same NUMA node to memtier1 before 
dax kmem driver initialized the pmem memory. Can you check before the 
above node_set_memory_tier_rank() whether the specific NUMA node is 
already part of any memory tier?

Thank you for testing the patchset.
-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-05-27 12:25   ` [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
@ 2022-06-02  6:07     ` Ying Huang
  2022-06-06  2:49       ` Ying Huang
  2022-06-08  7:16     ` Ying Huang
  1 sibling, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-06-02  6:07 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> 
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created
> during the kernel initialization and updated when a NUMA node is
> hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and builds the tier hierarchy
> tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
> 
> This current memory tier kernel interface needs to be improved for
> several important use cases,
> 
> The current tier initialization code always initializes
> each memory-only NUMA node into a lower tier.  But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into a higher tier.
> 
> The current tier hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM or GPU devices, the
> memory-only NUMA nodes mapping these devices should be in the
> top tier, and DRAM nodes with CPUs are better to be placed into the
> next lower tier.
> 
> With current kernel higher tier node can only be demoted to selected nodes on the
> next lower tier as defined by the demotion path, not any other
> node from any lower tier.  This strict, hard-coded demotion order
> does not work in all use cases (e.g. some use cases may want to
> allow cross-socket demotion to another node in the same demotion
> tier as a fallback when the preferred demotion node is out of
> space), This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from
> any lower tier, whereas the demotion order doesn't allow that.
> 
> The current kernel also don't provide any interfaces for the
> userspace to learn about the memory tier hierarchy in order to
> optimize its memory allocations.
> 
> This patch series address the above by defining memory tiers explicitly.
> 
> This patch adds below sysfs interface which is read-only and
> can be used to read nodes available in specific tier.
> 
> /sys/devices/system/memtier/memtierN/nodelist
> 
> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> lowest tier. The absolute value of a tier id number has no specific
> meaning. what matters is the relative order of the tier id numbers.
> 
> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> 
> Default memory tier can be read from,
> /sys/devices/system/memtier/default_tier
> 
> Max memory tier can be read from,
> /sys/devices/system/memtier/max_tiers
> 
> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> 
> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> 
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

IMHO, we should change the kernel internal implementation firstly, then
implement the kerne/user space interface.  That is, make memory tier
explicit inside kernel, then expose it to user space.

Best Regards,
Huang, Ying


[snip]


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-06-01 13:49       ` Aneesh Kumar K V
@ 2022-06-02  6:36         ` Bharata B Rao
  2022-06-03  9:04           ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Bharata B Rao @ 2022-06-02  6:36 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>> I was experimenting with this patchset and found this behaviour.
>> Here's what I did:
>>
>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>> driver by default.
>>
>> Use it as RAM by binding it to dax kmem driver. It now appears as
>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>> where DRAM already exists)
>>
> 
> That should have placed it in memtier2.
> 
>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>> that expected to happen automatically when a node with dax kmem
>> device comes up?
>>
> 
> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?

When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
is already part of memtier1 whose nodelist shows 0-1.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier
  2022-05-27 12:25   ` [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier Aneesh Kumar K.V
       [not found]     ` <20220527154557.00002c56@Huawei.com>
@ 2022-06-02  6:41     ` Ying Huang
  1 sibling, 0 replies; 66+ messages in thread
From: Ying Huang @ 2022-06-02  6:41 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> The rank approach allows us to keep memory tier device IDs stable even if there
> is a need to change the tier ordering among different memory tiers. e.g. DRAM
> nodes with CPUs will always be on memtier1, no matter how many tiers are higher
> or lower than these nodes. A new memory tier can be inserted into the tier
> hierarchy for a new set of nodes without affecting the node assignment of any
> existing memtier, provided that there is enough gap in the rank values for the
> new memtier.
> 
> The absolute value of "rank" of a memtier doesn't necessarily carry any meaning.
> Its value relative to other memtiers decides the level of this memtier in the tier
> hierarchy.
> 
> For now, This patch supports hardcoded rank values which are 100, 200, & 300 for
> memory tiers 0,1 & 2 respectively.
> 
> Below is the sysfs interface to read the rank values of memory tier,
> /sys/devices/system/memtier/memtierN/rank
> 
> This interface is read only for now, write support can be added when there is
> a need of flexibility of more number of memory tiers(> 3) with flexibile ordering
> requirement among them, rank can be utilized there as rank decides now memory
> tiering ordering and not memory tier device ids.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  drivers/base/node.c     |   5 +-
>  drivers/dax/kmem.c      |   2 +-
>  include/linux/migrate.h |  17 ++--
>  mm/migrate.c            | 218 ++++++++++++++++++++++++----------------
>  4 files changed, 144 insertions(+), 98 deletions(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index cf4a58446d8c..892f7c23c94e 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -567,8 +567,11 @@ static ssize_t memtier_show(struct device *dev,
>  			    char *buf)
>  {
>  	int node = dev->id;
> +	int tier_index = node_get_memory_tier_id(node);
>  
> 
> 
> 
> -	return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
> +	if (tier_index != -1)
> +		return sysfs_emit(buf, "%d\n", tier_index);
> +	return 0;
>  }
>  
> 
> 
> 
>  static ssize_t memtier_store(struct device *dev,
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 991782aa2448..79953426ddaf 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -149,7 +149,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  	dev_set_drvdata(dev, data);
>  
> 
> 
> 
>  #ifdef CONFIG_TIERED_MEMORY
> -	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> +	node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM);

I think that we can work with memory tier ID inside kernel?

Best Regards,
Huang, Ying


[snip]


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers
  2022-05-27 12:25   ` [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
@ 2022-06-02  6:43     ` Ying Huang
  0 siblings, 0 replies; 66+ messages in thread
From: Ying Huang @ 2022-06-02  6:43 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> This patch adds the special string "none" as a supported memtier value
> that we can use to remove a specific node from being using as demotion target.
> 
> For ex:
> :/sys/devices/system/node/node1# cat memtier
> 1
> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> 1-3
> :/sys/devices/system/node/node1# echo none > memtier
> :/sys/devices/system/node/node1#
> :/sys/devices/system/node/node1# cat memtier
> :/sys/devices/system/node/node1# cat ../../memtier/memtier1/nodelist
> 2-3
> :/sys/devices/system/node/node1#

Why do you need this?  Do you have some real users?

Best Regards,
Huang, Ying


[snip]



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-05-27 12:25   ` [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
@ 2022-06-02  7:35     ` Ying Huang
  2022-06-03 15:09       ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-06-02  7:35 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> 
> currently, a higher tier node can only be demoted to selected
> nodes on the next lower tier as defined by the demotion path,
> not any other node from any lower tier.  This strict, hard-coded
> demotion order does not work in all use cases (e.g. some use cases
> may want to allow cross-socket demotion to another node in the same
> demotion tier as a fallback when the preferred demotion node is out
> of space). This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from any
> lower tier, whereas the demotion order doesn't allow that currently.
> 
> This patch adds support to get all the allowed demotion targets mask
> for node, also demote_page_list() function is modified to utilize this
> allowed node mask by filling it in migration_target_control structure
> before passing it to migrate_pages().
> 
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/migrate.h |  5 ++++
>  mm/migrate.c            | 52 +++++++++++++++++++++++++++++++++++++----
>  mm/vmscan.c             | 38 ++++++++++++++----------------
>  3 files changed, 71 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 77c581f47953..1f3cbd5185ca 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -182,6 +182,7 @@ void node_remove_from_memory_tier(int node);
>  int node_get_memory_tier_id(int node);
>  int node_set_memory_tier_rank(int node, int tier);
>  int node_reset_memory_tier(int node, int tier);
> +void node_get_allowed_targets(int node, nodemask_t *targets);
>  #else
>  #define numa_demotion_enabled	false
>  static inline int next_demotion_node(int node)
> @@ -189,6 +190,10 @@ static inline int next_demotion_node(int node)
>  	return NUMA_NO_NODE;
>  }
>  
> 
> 
> 
> +static inline void node_get_allowed_targets(int node, nodemask_t *targets)
> +{
> +	*targets = NODE_MASK_NONE;
> +}
>  #endif	/* CONFIG_TIERED_MEMORY */
>  
> 
> 
> 
>  #endif /* _LINUX_MIGRATE_H */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 114c7428b9f3..84fac477538c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2129,6 +2129,7 @@ struct memory_tier {
>  
> 
> 
> 
>  struct demotion_nodes {
>  	nodemask_t preferred;
> +	nodemask_t allowed;
>  };
>  
> 
> 
> 
>  #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> @@ -2475,6 +2476,25 @@ int node_set_memory_tier_rank(int node, int rank)
>  }
>  EXPORT_SYMBOL_GPL(node_set_memory_tier_rank);
>  
> 
> 
> 
> +void node_get_allowed_targets(int node, nodemask_t *targets)
> +{
> +	/*
> +	 * node_demotion[] is updated without excluding this
> +	 * function from running.
> +	 *
> +	 * If any node is moving to lower tiers then modifications
> +	 * in node_demotion[] are still valid for this node, if any
> +	 * node is moving to higher tier then moving node may be
> +	 * used once for demotion which should be ok so rcu should
> +	 * be enough here.
> +	 */
> +	rcu_read_lock();
> +
> +	*targets = node_demotion[node].allowed;
> +
> +	rcu_read_unlock();
> +}
> +
>  /**
>   * next_demotion_node() - Get the next node in the demotion path
>   * @node: The starting node to lookup the next node
> @@ -2534,8 +2554,10 @@ static void __disable_all_migrate_targets(void)
>  {
>  	int node;
>  
> 
> 
> 
> -	for_each_node_mask(node, node_states[N_MEMORY])
> +	for_each_node_mask(node, node_states[N_MEMORY]) {
>  		node_demotion[node].preferred = NODE_MASK_NONE;
> +		node_demotion[node].allowed = NODE_MASK_NONE;
> +	}
>  }
>  
> 
> 
> 
>  static void disable_all_migrate_targets(void)
> @@ -2558,12 +2580,11 @@ static void disable_all_migrate_targets(void)
>  */
>  static void establish_migration_targets(void)
>  {
> -	struct list_head *ent;
>  	struct memory_tier *memtier;
>  	struct demotion_nodes *nd;
> -	int tier, target = NUMA_NO_NODE, node;
> +	int target = NUMA_NO_NODE, node;
>  	int distance, best_distance;
> -	nodemask_t used;
> +	nodemask_t used, allowed = NODE_MASK_NONE;
>  
> 
> 
> 
>  	if (!node_demotion)
>  		return;
> @@ -2603,6 +2624,29 @@ static void establish_migration_targets(void)
>  			}
>  		} while (1);
>  	}
> +	/*
> +	 * Now build the allowed mask for each node collecting node mask from
> +	 * all memory tier below it. This allows us to fallback demotion page
> +	 * allocation to a set of nodes that is closer the above selected
> +	 * perferred node.
> +	 */
> +	list_for_each_entry(memtier, &memory_tiers, list)
> +		nodes_or(allowed, allowed, memtier->nodelist);
> +	/*
> +	 * Removes nodes not yet in N_MEMORY.
> +	 */
> +	nodes_and(allowed, node_states[N_MEMORY], allowed);
> +
> +	list_for_each_entry(memtier, &memory_tiers, list) {
> +		/*
> +		 * Keep removing current tier from allowed nodes,
> +		 * This will remove all nodes in current and above
> +		 * memory tier from the allowed mask.
> +		 */
> +		nodes_andnot(allowed, allowed, memtier->nodelist);
> +		for_each_node_mask(node, memtier->nodelist)
> +			node_demotion[node].allowed = allowed;
> +	}
>  }
>  
> 
> 
> 
>  /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1678802e03e7..feb994589481 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1454,23 +1454,6 @@ static void folio_check_dirty_writeback(struct folio *folio,
>  		mapping->a_ops->is_dirty_writeback(&folio->page, dirty, writeback);
>  }
>  
> 
> 
> 
> -static struct page *alloc_demote_page(struct page *page, unsigned long node)
> -{
> -	struct migration_target_control mtc = {
> -		/*
> -		 * Allocate from 'node', or fail quickly and quietly.
> -		 * When this happens, 'page' will likely just be discarded
> -		 * instead of migrated.
> -		 */
> -		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> -			    __GFP_THISNODE  | __GFP_NOWARN |
> -			    __GFP_NOMEMALLOC | GFP_NOWAIT,
> -		.nid = node
> -	};
> -
> -	return alloc_migration_target(page, (unsigned long)&mtc);
> -}
> -
>  /*
>   * Take pages on @demote_list and attempt to demote them to
>   * another node.  Pages which are not demoted are left on
> @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
>  {
>  	int target_nid = next_demotion_node(pgdat->node_id);
>  	unsigned int nr_succeeded;
> +	nodemask_t allowed_mask;
> +
> +	struct migration_target_control mtc = {
> +		/*
> +		 * Allocate from 'node', or fail quickly and quietly.
> +		 * When this happens, 'page' will likely just be discarded
> +		 * instead of migrated.
> +		 */
> +		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> +			__GFP_NOMEMALLOC | GFP_NOWAIT,
> +		.nid = target_nid,
> +		.nmask = &allowed_mask
> +	};

IMHO, we should try to allocate from preferred node firstly (which will
kick kswapd of the preferred node if necessary).  If failed, we will
fallback to all allowed node.

As we discussed as follows,

https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/

That is, something like below,

static struct page *alloc_demote_page(struct page *page, unsigned long node)
{
	struct page *page;
	nodemask_t allowed_mask;
	struct migration_target_control mtc = {
		/*
		 * Allocate from 'node', or fail quickly and quietly.
		 * When this happens, 'page' will likely just be discarded
		 * instead of migrated.
		 */
		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
			    __GFP_THISNODE  | __GFP_NOWARN |
			    __GFP_NOMEMALLOC | GFP_NOWAIT,
		.nid = node
	};

	page = alloc_migration_target(page, (unsigned long)&mtc);
	if (page)
		return page;

	mtc.gfp_mask &= ~__GFP_THISNODE;
	mtc.nmask = &allowed_mask;

	return alloc_migration_target(page, (unsigned long)&mtc);
}

Best Regards,
Huang, Ying

>  	if (list_empty(demote_pages))
>  		return 0;
> @@ -1488,10 +1484,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
>  	if (target_nid == NUMA_NO_NODE)
>  		return 0;
>  
> 
> 
> 
> +	node_get_allowed_targets(pgdat->node_id, &allowed_mask);
> +
>  	/* Demotion ignores all cpuset and mempolicy settings */
> -	migrate_pages(demote_pages, alloc_demote_page, NULL,
> -			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> -			    &nr_succeeded);
> +	migrate_pages(demote_pages, alloc_migration_target, NULL,
> +		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
> +		      &nr_succeeded);
>  
> 
> 
> 
>  	if (current_is_kswapd())
>  		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
       [not found]     ` <20220527151531.00002a0c@Huawei.com>
@ 2022-06-03  8:40       ` Aneesh Kumar K V
  2022-06-06 14:59         ` Jonathan Cameron
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-03  8:40 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-mm, akpm, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On 5/27/22 7:45 PM, Jonathan Cameron wrote:
> On Fri, 27 May 2022 17:55:23 +0530
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> 
>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>
>> Add support to read/write the memory tierindex for a NUMA node.
>>
>> /sys/devices/system/node/nodeN/memtier
>>
>> where N = node id
>>
>> When read, It list the memory tier that the node belongs to.
>>
>> When written, the kernel moves the node into the specified
>> memory tier, the tier assignment of all other nodes are not
>> affected.
>>
>> If the memory tier does not exist, writing to the above file
>> create the tier and assign the NUMA node to that tier.
> creates
> 
> There was some discussion in v2 of Wei Xu's RFC that what matter
> for creation is the rank, not the tier number.
> 
> My suggestion is move to an explicit creation file such as
> memtier/create_tier_from_rank
> to which writing the rank gives results in a new tier
> with the next device ID and requested rank.

I think the below workflow is much simpler.

:/sys/devices/system# cat memtier/memtier1/nodelist
1-3
:/sys/devices/system# cat node/node1/memtier
1
:/sys/devices/system# ls memtier/memtier*
nodelist  power  rank  subsystem  uevent
/sys/devices/system# ls memtier/
default_rank  max_tier  memtier1  power  uevent
:/sys/devices/system# echo 2 > node/node1/memtier
:/sys/devices/system#

:/sys/devices/system# ls memtier/
default_rank  max_tier  memtier1  memtier2  power  uevent
:/sys/devices/system# cat memtier/memtier1/nodelist
2-3
:/sys/devices/system# cat memtier/memtier2/nodelist
1
:/sys/devices/system#

ie, to create a tier we just write the tier id/tier index to 
node/nodeN/memtier file. That will create a new memory tier if needed 
and add the node to that specific memory tier. Since for now we are 
having 1:1 mapping between tier index to rank value, we can derive the 
rank value from the memory tier index.

For dynamic memory tier support, we can assign a rank value such that 
new memory tiers are always created such that it comes last in the 
demotion order.

-aneesh





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-06-02  6:36         ` Bharata B Rao
@ 2022-06-03  9:04           ` Aneesh Kumar K V
  2022-06-06 10:11             ` Bharata B Rao
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-03  9:04 UTC (permalink / raw)
  To: Bharata B Rao, linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/2/22 12:06 PM, Bharata B Rao wrote:
> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>> I was experimenting with this patchset and found this behaviour.
>>> Here's what I did:
>>>
>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>> driver by default.
>>>
>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>> where DRAM already exists)
>>>
>>
>> That should have placed it in memtier2.
>>
>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>> that expected to happen automatically when a node with dax kmem
>>> device comes up?
>>>
>>
>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
> 
> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
> is already part of memtier1 whose nodelist shows 0-1.
> 

can you find out which code path added node1 to memtier1? Do you have 
regular memory also appearing on node1?

-aneesh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-06-02  7:35     ` Ying Huang
@ 2022-06-03 15:09       ` Aneesh Kumar K V
  2022-06-06  0:43         ` Ying Huang
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-03 15:09 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On 6/2/22 1:05 PM, Ying Huang wrote:
> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>
>> currently, a higher tier node can only be demoted to selected
>> nodes on the next lower tier as defined by the demotion path,
>> not any other node from any lower tier.  This strict, hard-coded
>> demotion order does not work in all use cases (e.g. some use cases
>> may want to allow cross-socket demotion to another node in the same
>> demotion tier as a fallback when the preferred demotion node is out
>> of space). This demotion order is also inconsistent with the page
>> allocation fallback order when all the nodes in a higher tier are
>> out of space: The page allocation can fall back to any node from any
>> lower tier, whereas the demotion order doesn't allow that currently.
>>
>> This patch adds support to get all the allowed demotion targets mask
>> for node, also demote_page_list() function is modified to utilize this
>> allowed node mask by filling it in migration_target_control structure
>> before passing it to migrate_pages().
>

...

>>    * Take pages on @demote_list and attempt to demote them to
>>    * another node.  Pages which are not demoted are left on
>> @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
>>   {
>>   	int target_nid = next_demotion_node(pgdat->node_id);
>>   	unsigned int nr_succeeded;
>> +	nodemask_t allowed_mask;
>> +
>> +	struct migration_target_control mtc = {
>> +		/*
>> +		 * Allocate from 'node', or fail quickly and quietly.
>> +		 * When this happens, 'page' will likely just be discarded
>> +		 * instead of migrated.
>> +		 */
>> +		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
>> +			__GFP_NOMEMALLOC | GFP_NOWAIT,
>> +		.nid = target_nid,
>> +		.nmask = &allowed_mask
>> +	};
> 
> IMHO, we should try to allocate from preferred node firstly (which will
> kick kswapd of the preferred node if necessary).  If failed, we will
> fallback to all allowed node.
> 
> As we discussed as follows,
> 
> https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/
> 
> That is, something like below,
> 
> static struct page *alloc_demote_page(struct page *page, unsigned long node)
> {
> 	struct page *page;
> 	nodemask_t allowed_mask;
> 	struct migration_target_control mtc = {
> 		/*
> 		 * Allocate from 'node', or fail quickly and quietly.
> 		 * When this happens, 'page' will likely just be discarded
> 		 * instead of migrated.
> 		 */
> 		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> 			    __GFP_THISNODE  | __GFP_NOWARN |
> 			    __GFP_NOMEMALLOC | GFP_NOWAIT,
> 		.nid = node
> 	};
> 
> 	page = alloc_migration_target(page, (unsigned long)&mtc);
> 	if (page)
> 		return page;
> 
> 	mtc.gfp_mask &= ~__GFP_THISNODE;
> 	mtc.nmask = &allowed_mask;
> 
> 	return alloc_migration_target(page, (unsigned long)&mtc);
> }

I skipped doing this in v5 because I was not sure this is really what we 
want. I guess we can do this as part of the change that is going to 
introduce the usage of memory policy for the allocation?

-aneesh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-06-03 15:09       ` Aneesh Kumar K V
@ 2022-06-06  0:43         ` Ying Huang
  2022-06-06  4:07           ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-06-06  0:43 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-06-03 at 20:39 +0530, Aneesh Kumar K V wrote:
> On 6/2/22 1:05 PM, Ying Huang wrote:
> > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > 
> > > currently, a higher tier node can only be demoted to selected
> > > nodes on the next lower tier as defined by the demotion path,
> > > not any other node from any lower tier.  This strict, hard-coded
> > > demotion order does not work in all use cases (e.g. some use cases
> > > may want to allow cross-socket demotion to another node in the same
> > > demotion tier as a fallback when the preferred demotion node is out
> > > of space). This demotion order is also inconsistent with the page
> > > allocation fallback order when all the nodes in a higher tier are
> > > out of space: The page allocation can fall back to any node from any
> > > lower tier, whereas the demotion order doesn't allow that currently.
> > > 
> > > This patch adds support to get all the allowed demotion targets mask
> > > for node, also demote_page_list() function is modified to utilize this
> > > allowed node mask by filling it in migration_target_control structure
> > > before passing it to migrate_pages().
> > 
> 
> ...
> 
> > >    * Take pages on @demote_list and attempt to demote them to
> > >    * another node.  Pages which are not demoted are left on
> > > @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
> > >   {
> > >   	int target_nid = next_demotion_node(pgdat->node_id);
> > >   	unsigned int nr_succeeded;
> > > +	nodemask_t allowed_mask;
> > > +
> > > +	struct migration_target_control mtc = {
> > > +		/*
> > > +		 * Allocate from 'node', or fail quickly and quietly.
> > > +		 * When this happens, 'page' will likely just be discarded
> > > +		 * instead of migrated.
> > > +		 */
> > > +		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> > > +			__GFP_NOMEMALLOC | GFP_NOWAIT,
> > > +		.nid = target_nid,
> > > +		.nmask = &allowed_mask
> > > +	};
> > 
> > IMHO, we should try to allocate from preferred node firstly (which will
> > kick kswapd of the preferred node if necessary).  If failed, we will
> > fallback to all allowed node.
> > 
> > As we discussed as follows,
> > 
> > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/
> > 
> > That is, something like below,
> > 
> > static struct page *alloc_demote_page(struct page *page, unsigned long node)
> > {
> > 	struct page *page;
> > 	nodemask_t allowed_mask;
> > 	struct migration_target_control mtc = {
> > 		/*
> > 		 * Allocate from 'node', or fail quickly and quietly.
> > 		 * When this happens, 'page' will likely just be discarded
> > 		 * instead of migrated.
> > 		 */
> > 		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> > 			    __GFP_THISNODE  | __GFP_NOWARN |
> > 			    __GFP_NOMEMALLOC | GFP_NOWAIT,
> > 		.nid = node
> > 	};
> > 
> > 	page = alloc_migration_target(page, (unsigned long)&mtc);
> > 	if (page)
> > 		return page;
> > 
> > 	mtc.gfp_mask &= ~__GFP_THISNODE;
> > 	mtc.nmask = &allowed_mask;
> > 
> > 	return alloc_migration_target(page, (unsigned long)&mtc);
> > }
> 
> I skipped doing this in v5 because I was not sure this is really what we 
> want.

I think so.  And this is the original behavior.  We should keep the
original behavior as much as possible, then make changes if necessary.

> I guess we can do this as part of the change that is going to 
> introduce the usage of memory policy for the allocation?

Like the memory allocation policy, the default policy should be local
preferred.  We shouldn't force users to use explicit memory policy for
that.

And the added code isn't complex.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-02  6:07     ` Ying Huang
@ 2022-06-06  2:49       ` Ying Huang
  2022-06-06  3:56         ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-06-06  2:49 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes, linux-mm,
	akpm

On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > 
> > In the current kernel, memory tiers are defined implicitly via a
> > demotion path relationship between NUMA nodes, which is created
> > during the kernel initialization and updated when a NUMA node is
> > hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and builds the tier hierarchy
> > tier-by-tier by establishing the per-node demotion targets based
> > on the distances between nodes.
> > 
> > This current memory tier kernel interface needs to be improved for
> > several important use cases,
> > 
> > The current tier initialization code always initializes
> > each memory-only NUMA node into a lower tier.  But a memory-only
> > NUMA node may have a high performance memory device (e.g. a DRAM
> > device attached via CXL.mem or a DRAM-backed memory-only node on
> > a virtual machine) and should be put into a higher tier.
> > 
> > The current tier hierarchy always puts CPU nodes into the top
> > tier. But on a system with HBM or GPU devices, the
> > memory-only NUMA nodes mapping these devices should be in the
> > top tier, and DRAM nodes with CPUs are better to be placed into the
> > next lower tier.
> > 
> > With current kernel higher tier node can only be demoted to selected nodes on the
> > next lower tier as defined by the demotion path, not any other
> > node from any lower tier.  This strict, hard-coded demotion order
> > does not work in all use cases (e.g. some use cases may want to
> > allow cross-socket demotion to another node in the same demotion
> > tier as a fallback when the preferred demotion node is out of
> > space), This demotion order is also inconsistent with the page
> > allocation fallback order when all the nodes in a higher tier are
> > out of space: The page allocation can fall back to any node from
> > any lower tier, whereas the demotion order doesn't allow that.
> > 
> > The current kernel also don't provide any interfaces for the
> > userspace to learn about the memory tier hierarchy in order to
> > optimize its memory allocations.
> > 
> > This patch series address the above by defining memory tiers explicitly.
> > 
> > This patch adds below sysfs interface which is read-only and
> > can be used to read nodes available in specific tier.
> > 
> > /sys/devices/system/memtier/memtierN/nodelist
> > 
> > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > lowest tier. The absolute value of a tier id number has no specific
> > meaning. what matters is the relative order of the tier id numbers.
> > 
> > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > 
> > Default memory tier can be read from,
> > /sys/devices/system/memtier/default_tier
> > 
> > Max memory tier can be read from,
> > /sys/devices/system/memtier/max_tiers
> > 
> > This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> > 
> > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > 
> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> 
> IMHO, we should change the kernel internal implementation firstly, then
> implement the kerne/user space interface.  That is, make memory tier
> explicit inside kernel, then expose it to user space.

Why ignore this comment for v5?  If you don't agree, please respond me.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-06  2:49       ` Ying Huang
@ 2022-06-06  3:56         ` Aneesh Kumar K V
  2022-06-06  5:33           ` Ying Huang
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06  3:56 UTC (permalink / raw)
  To: Ying Huang
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes, linux-mm,
	akpm

On 6/6/22 8:19 AM, Ying Huang wrote:
> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>
>>> In the current kernel, memory tiers are defined implicitly via a
>>> demotion path relationship between NUMA nodes, which is created
>>> during the kernel initialization and updated when a NUMA node is
>>> hot-added or hot-removed.  The current implementation puts all
>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>> tier-by-tier by establishing the per-node demotion targets based
>>> on the distances between nodes.
>>>
>>> This current memory tier kernel interface needs to be improved for
>>> several important use cases,
>>>
>>> The current tier initialization code always initializes
>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>> a virtual machine) and should be put into a higher tier.
>>>
>>> The current tier hierarchy always puts CPU nodes into the top
>>> tier. But on a system with HBM or GPU devices, the
>>> memory-only NUMA nodes mapping these devices should be in the
>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>> next lower tier.
>>>
>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>> next lower tier as defined by the demotion path, not any other
>>> node from any lower tier.  This strict, hard-coded demotion order
>>> does not work in all use cases (e.g. some use cases may want to
>>> allow cross-socket demotion to another node in the same demotion
>>> tier as a fallback when the preferred demotion node is out of
>>> space), This demotion order is also inconsistent with the page
>>> allocation fallback order when all the nodes in a higher tier are
>>> out of space: The page allocation can fall back to any node from
>>> any lower tier, whereas the demotion order doesn't allow that.
>>>
>>> The current kernel also don't provide any interfaces for the
>>> userspace to learn about the memory tier hierarchy in order to
>>> optimize its memory allocations.
>>>
>>> This patch series address the above by defining memory tiers explicitly.
>>>
>>> This patch adds below sysfs interface which is read-only and
>>> can be used to read nodes available in specific tier.
>>>
>>> /sys/devices/system/memtier/memtierN/nodelist
>>>
>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>> lowest tier. The absolute value of a tier id number has no specific
>>> meaning. what matters is the relative order of the tier id numbers.
>>>
>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>
>>> Default memory tier can be read from,
>>> /sys/devices/system/memtier/default_tier
>>>
>>> Max memory tier can be read from,
>>> /sys/devices/system/memtier/max_tiers
>>>
>>> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
>>>
>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>
>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>
>> IMHO, we should change the kernel internal implementation firstly, then
>> implement the kerne/user space interface.  That is, make memory tier
>> explicit inside kernel, then expose it to user space.
> 
> Why ignore this comment for v5?  If you don't agree, please respond me.
> 

I am not sure what benefit such a rearrange would bring in? Right now I 
am writing the series from the point of view of introducing all the 
plumbing and them switching the existing demotion logic to use the new 
infrastructure. Redoing the code to hide all the userspace sysfs till we 
switch the demotion logic to use the new infrastructure doesn't really 
bring any additional clarity to patch review and would require me to 
redo the series with a lot of conflicts across the patches in the patchset.

-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-06-06  0:43         ` Ying Huang
@ 2022-06-06  4:07           ` Aneesh Kumar K V
  2022-06-06  5:26             ` Ying Huang
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06  4:07 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On 6/6/22 6:13 AM, Ying Huang wrote:
> On Fri, 2022-06-03 at 20:39 +0530, Aneesh Kumar K V wrote:
>> On 6/2/22 1:05 PM, Ying Huang wrote:
>>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>
>>>> currently, a higher tier node can only be demoted to selected
>>>> nodes on the next lower tier as defined by the demotion path,
>>>> not any other node from any lower tier.  This strict, hard-coded
>>>> demotion order does not work in all use cases (e.g. some use cases
>>>> may want to allow cross-socket demotion to another node in the same
>>>> demotion tier as a fallback when the preferred demotion node is out
>>>> of space). This demotion order is also inconsistent with the page
>>>> allocation fallback order when all the nodes in a higher tier are
>>>> out of space: The page allocation can fall back to any node from any
>>>> lower tier, whereas the demotion order doesn't allow that currently.
>>>>
>>>> This patch adds support to get all the allowed demotion targets mask
>>>> for node, also demote_page_list() function is modified to utilize this
>>>> allowed node mask by filling it in migration_target_control structure
>>>> before passing it to migrate_pages().
>>>
>>
>> ...
>>
>>>>     * Take pages on @demote_list and attempt to demote them to
>>>>     * another node.  Pages which are not demoted are left on
>>>> @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
>>>>    {
>>>>    	int target_nid = next_demotion_node(pgdat->node_id);
>>>>    	unsigned int nr_succeeded;
>>>> +	nodemask_t allowed_mask;
>>>> +
>>>> +	struct migration_target_control mtc = {
>>>> +		/*
>>>> +		 * Allocate from 'node', or fail quickly and quietly.
>>>> +		 * When this happens, 'page' will likely just be discarded
>>>> +		 * instead of migrated.
>>>> +		 */
>>>> +		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
>>>> +			__GFP_NOMEMALLOC | GFP_NOWAIT,
>>>> +		.nid = target_nid,
>>>> +		.nmask = &allowed_mask
>>>> +	};
>>>
>>> IMHO, we should try to allocate from preferred node firstly (which will
>>> kick kswapd of the preferred node if necessary).  If failed, we will
>>> fallback to all allowed node.
>>>
>>> As we discussed as follows,
>>>
>>> https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/
>>>
>>> That is, something like below,
>>>
>>> static struct page *alloc_demote_page(struct page *page, unsigned long node)
>>> {
>>> 	struct page *page;
>>> 	nodemask_t allowed_mask;
>>> 	struct migration_target_control mtc = {
>>> 		/*
>>> 		 * Allocate from 'node', or fail quickly and quietly.
>>> 		 * When this happens, 'page' will likely just be discarded
>>> 		 * instead of migrated.
>>> 		 */
>>> 		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
>>> 			    __GFP_THISNODE  | __GFP_NOWARN |
>>> 			    __GFP_NOMEMALLOC | GFP_NOWAIT,
>>> 		.nid = node
>>> 	};
>>>
>>> 	page = alloc_migration_target(page, (unsigned long)&mtc);
>>> 	if (page)
>>> 		return page;
>>>
>>> 	mtc.gfp_mask &= ~__GFP_THISNODE;
>>> 	mtc.nmask = &allowed_mask;
>>>
>>> 	return alloc_migration_target(page, (unsigned long)&mtc);
>>> }
>>
>> I skipped doing this in v5 because I was not sure this is really what we
>> want.
> 
> I think so.  And this is the original behavior.  We should keep the
> original behavior as much as possible, then make changes if necessary.
> 

That is the reason I split the new page allocation as a separate patch. 
Previous discussion on this topic didn't conclude on whether we really 
need to do the above or not
https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/

Based on the above I looked at avoiding GFP_THISNODE allocation. If you 
have experiment results that suggest otherwise can you share? I could 
summarize that in the commit message for better description of why 
GFP_THISNODE enforcing is needed.

>> I guess we can do this as part of the change that is going to
>> introduce the usage of memory policy for the allocation?
> 
> Like the memory allocation policy, the default policy should be local
> preferred.  We shouldn't force users to use explicit memory policy for
> that.
> 
> And the added code isn't complex.
> 

-aneesh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-06-06  4:07           ` Aneesh Kumar K V
@ 2022-06-06  5:26             ` Ying Huang
  2022-06-06  6:21               ` Aneesh Kumar K.V
  2022-06-06 17:07               ` Yang Shi
  0 siblings, 2 replies; 66+ messages in thread
From: Ying Huang @ 2022-06-06  5:26 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Mon, 2022-06-06 at 09:37 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 6:13 AM, Ying Huang wrote:
> > On Fri, 2022-06-03 at 20:39 +0530, Aneesh Kumar K V wrote:
> > > On 6/2/22 1:05 PM, Ying Huang wrote:
> > > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > > 
> > > > > currently, a higher tier node can only be demoted to selected
> > > > > nodes on the next lower tier as defined by the demotion path,
> > > > > not any other node from any lower tier.  This strict, hard-coded
> > > > > demotion order does not work in all use cases (e.g. some use cases
> > > > > may want to allow cross-socket demotion to another node in the same
> > > > > demotion tier as a fallback when the preferred demotion node is out
> > > > > of space). This demotion order is also inconsistent with the page
> > > > > allocation fallback order when all the nodes in a higher tier are
> > > > > out of space: The page allocation can fall back to any node from any
> > > > > lower tier, whereas the demotion order doesn't allow that currently.
> > > > > 
> > > > > This patch adds support to get all the allowed demotion targets mask
> > > > > for node, also demote_page_list() function is modified to utilize this
> > > > > allowed node mask by filling it in migration_target_control structure
> > > > > before passing it to migrate_pages().
> > > > 
> > > 
> > > ...
> > > 
> > > > >     * Take pages on @demote_list and attempt to demote them to
> > > > >     * another node.  Pages which are not demoted are left on
> > > > > @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
> > > > >    {
> > > > >    	int target_nid = next_demotion_node(pgdat->node_id);
> > > > >    	unsigned int nr_succeeded;
> > > > > +	nodemask_t allowed_mask;
> > > > > +
> > > > > +	struct migration_target_control mtc = {
> > > > > +		/*
> > > > > +		 * Allocate from 'node', or fail quickly and quietly.
> > > > > +		 * When this happens, 'page' will likely just be discarded
> > > > > +		 * instead of migrated.
> > > > > +		 */
> > > > > +		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> > > > > +			__GFP_NOMEMALLOC | GFP_NOWAIT,
> > > > > +		.nid = target_nid,
> > > > > +		.nmask = &allowed_mask
> > > > > +	};
> > > > 
> > > > IMHO, we should try to allocate from preferred node firstly (which will
> > > > kick kswapd of the preferred node if necessary).  If failed, we will
> > > > fallback to all allowed node.
> > > > 
> > > > As we discussed as follows,
> > > > 
> > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/
> > > > 
> > > > That is, something like below,
> > > > 
> > > > static struct page *alloc_demote_page(struct page *page, unsigned long node)
> > > > {
> > > > 	struct page *page;
> > > > 	nodemask_t allowed_mask;
> > > > 	struct migration_target_control mtc = {
> > > > 		/*
> > > > 		 * Allocate from 'node', or fail quickly and quietly.
> > > > 		 * When this happens, 'page' will likely just be discarded
> > > > 		 * instead of migrated.
> > > > 		 */
> > > > 		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> > > > 			    __GFP_THISNODE  | __GFP_NOWARN |
> > > > 			    __GFP_NOMEMALLOC | GFP_NOWAIT,
> > > > 		.nid = node
> > > > 	};
> > > > 
> > > > 	page = alloc_migration_target(page, (unsigned long)&mtc);
> > > > 	if (page)
> > > > 		return page;
> > > > 
> > > > 	mtc.gfp_mask &= ~__GFP_THISNODE;
> > > > 	mtc.nmask = &allowed_mask;
> > > > 
> > > > 	return alloc_migration_target(page, (unsigned long)&mtc);
> > > > }
> > > 
> > > I skipped doing this in v5 because I was not sure this is really what we
> > > want.
> > 
> > I think so.  And this is the original behavior.  We should keep the
> > original behavior as much as possible, then make changes if necessary.
> > 
> 
> That is the reason I split the new page allocation as a separate patch. 
> Previous discussion on this topic didn't conclude on whether we really 
> need to do the above or not
> https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/

Please check the later email in the thread you referenced.  Both Wei and
me agree that the use case needs to be supported.  We just didn't reach
concensus about how to implement it.  If you think Wei's solution is
better (referenced as below), you can try to do that too.  Although I
think my proposed implementation is much simpler.

"
This is true with the current allocation code. But I think we can make
some changes for demotion allocations. For example, we can add a
GFP_DEMOTE flag and update the allocation function to wake up kswapd
when this flag is set and we need to fall back to another node.
"

> Based on the above I looked at avoiding GFP_THISNODE allocation. If you 
> have experiment results that suggest otherwise can you share? I could 
> summarize that in the commit message for better description of why 
> GFP_THISNODE enforcing is needed.

Why?  GFP_THISNODE is just the first step.  We will fallback to
allocation without it if necessary.

Best Regards,
Huang, Ying

> > > I guess we can do this as part of the change that is going to
> > > introduce the usage of memory policy for the allocation?
> > 
> > Like the memory allocation policy, the default policy should be local
> > preferred.  We shouldn't force users to use explicit memory policy for
> > that.
> > 
> > And the added code isn't complex.
> > 
> 
> -aneesh



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-06  3:56         ` Aneesh Kumar K V
@ 2022-06-06  5:33           ` Ying Huang
  2022-06-06  6:01             ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-06-06  5:33 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes, linux-mm,
	akpm

On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 8:19 AM, Ying Huang wrote:
> > On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > 
> > > > In the current kernel, memory tiers are defined implicitly via a
> > > > demotion path relationship between NUMA nodes, which is created
> > > > during the kernel initialization and updated when a NUMA node is
> > > > hot-added or hot-removed.  The current implementation puts all
> > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > tier-by-tier by establishing the per-node demotion targets based
> > > > on the distances between nodes.
> > > > 
> > > > This current memory tier kernel interface needs to be improved for
> > > > several important use cases,
> > > > 
> > > > The current tier initialization code always initializes
> > > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > a virtual machine) and should be put into a higher tier.
> > > > 
> > > > The current tier hierarchy always puts CPU nodes into the top
> > > > tier. But on a system with HBM or GPU devices, the
> > > > memory-only NUMA nodes mapping these devices should be in the
> > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > next lower tier.
> > > > 
> > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > next lower tier as defined by the demotion path, not any other
> > > > node from any lower tier.  This strict, hard-coded demotion order
> > > > does not work in all use cases (e.g. some use cases may want to
> > > > allow cross-socket demotion to another node in the same demotion
> > > > tier as a fallback when the preferred demotion node is out of
> > > > space), This demotion order is also inconsistent with the page
> > > > allocation fallback order when all the nodes in a higher tier are
> > > > out of space: The page allocation can fall back to any node from
> > > > any lower tier, whereas the demotion order doesn't allow that.
> > > > 
> > > > The current kernel also don't provide any interfaces for the
> > > > userspace to learn about the memory tier hierarchy in order to
> > > > optimize its memory allocations.
> > > > 
> > > > This patch series address the above by defining memory tiers explicitly.
> > > > 
> > > > This patch adds below sysfs interface which is read-only and
> > > > can be used to read nodes available in specific tier.
> > > > 
> > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > 
> > > > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > > > lowest tier. The absolute value of a tier id number has no specific
> > > > meaning. what matters is the relative order of the tier id numbers.
> > > > 
> > > > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > > > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > > > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > > > 
> > > > Default memory tier can be read from,
> > > > /sys/devices/system/memtier/default_tier
> > > > 
> > > > Max memory tier can be read from,
> > > > /sys/devices/system/memtier/max_tiers
> > > > 
> > > > This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> > > > 
> > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > > > 
> > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > 
> > > IMHO, we should change the kernel internal implementation firstly, then
> > > implement the kerne/user space interface.  That is, make memory tier
> > > explicit inside kernel, then expose it to user space.
> > 
> > Why ignore this comment for v5?  If you don't agree, please respond me.
> > 
> 
> I am not sure what benefit such a rearrange would bring in? Right now I 
> am writing the series from the point of view of introducing all the 
> plumbing and them switching the existing demotion logic to use the new 
> infrastructure. Redoing the code to hide all the userspace sysfs till we 
> switch the demotion logic to use the new infrastructure doesn't really 
> bring any additional clarity to patch review and would require me to 
> redo the series with a lot of conflicts across the patches in the patchset.

IMHO, we shouldn't introduce regression even in the middle of a
patchset.  Each step should only rely on previous patches in the series
to work correctly.  In your current way of organization, after patch
[1/7], on a system with 2 memory tiers, the user space interface will
output wrong information (only 1 memory tier).  So I think the correct
way is to make it right inside the kenrel firstly, then expose the right
information to user space.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-06  5:33           ` Ying Huang
@ 2022-06-06  6:01             ` Aneesh Kumar K V
  2022-06-06  6:27               ` Aneesh Kumar K.V
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06  6:01 UTC (permalink / raw)
  To: Ying Huang
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes, linux-mm,
	akpm

On 6/6/22 11:03 AM, Ying Huang wrote:
> On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
>> On 6/6/22 8:19 AM, Ying Huang wrote:
>>> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>>>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>
>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>> demotion path relationship between NUMA nodes, which is created
>>>>> during the kernel initialization and updated when a NUMA node is
>>>>> hot-added or hot-removed.  The current implementation puts all
>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>> on the distances between nodes.
>>>>>
>>>>> This current memory tier kernel interface needs to be improved for
>>>>> several important use cases,
>>>>>
>>>>> The current tier initialization code always initializes
>>>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>> a virtual machine) and should be put into a higher tier.
>>>>>
>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>> tier. But on a system with HBM or GPU devices, the
>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>> next lower tier.
>>>>>
>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>> next lower tier as defined by the demotion path, not any other
>>>>> node from any lower tier.  This strict, hard-coded demotion order
>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>> allow cross-socket demotion to another node in the same demotion
>>>>> tier as a fallback when the preferred demotion node is out of
>>>>> space), This demotion order is also inconsistent with the page
>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>> out of space: The page allocation can fall back to any node from
>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>
>>>>> The current kernel also don't provide any interfaces for the
>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>> optimize its memory allocations.
>>>>>
>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>
>>>>> This patch adds below sysfs interface which is read-only and
>>>>> can be used to read nodes available in specific tier.
>>>>>
>>>>> /sys/devices/system/memtier/memtierN/nodelist
>>>>>
>>>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>>>> lowest tier. The absolute value of a tier id number has no specific
>>>>> meaning. what matters is the relative order of the tier id numbers.
>>>>>
>>>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>>>
>>>>> Default memory tier can be read from,
>>>>> /sys/devices/system/memtier/default_tier
>>>>>
>>>>> Max memory tier can be read from,
>>>>> /sys/devices/system/memtier/max_tiers
>>>>>
>>>>> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>>>
>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>
>>>> IMHO, we should change the kernel internal implementation firstly, then
>>>> implement the kerne/user space interface.  That is, make memory tier
>>>> explicit inside kernel, then expose it to user space.
>>>
>>> Why ignore this comment for v5?  If you don't agree, please respond me.
>>>
>>
>> I am not sure what benefit such a rearrange would bring in? Right now I
>> am writing the series from the point of view of introducing all the
>> plumbing and them switching the existing demotion logic to use the new
>> infrastructure. Redoing the code to hide all the userspace sysfs till we
>> switch the demotion logic to use the new infrastructure doesn't really
>> bring any additional clarity to patch review and would require me to
>> redo the series with a lot of conflicts across the patches in the patchset.
> 
> IMHO, we shouldn't introduce regression even in the middle of a
> patchset.  Each step should only rely on previous patches in the series
> to work correctly.  In your current way of organization, after patch
> [1/7], on a system with 2 memory tiers, the user space interface will
> output wrong information (only 1 memory tier).  So I think the correct
> way is to make it right inside the kenrel firstly, then expose the right
> information to user space.
>

The patchset doesn't add additional tier until "mm/demotion/dax/kmem: 
Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional 
tiers done till all the demotion logic is in place. So even if the 
system got dax/kmem, the support for adding dax/kmem as a memory tier 
comes later in the patch series.


-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-06-06  5:26             ` Ying Huang
@ 2022-06-06  6:21               ` Aneesh Kumar K.V
  2022-06-06  7:42                 ` Ying Huang
  2022-06-06 17:07               ` Yang Shi
  1 sibling, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-06  6:21 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

Ying Huang <ying.huang@intel.com> writes:

.....

> > > > 
>> > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/
>> > > > 
>> > > > That is, something like below,
>> > > > 
>> > > > static struct page *alloc_demote_page(struct page *page, unsigned long node)
>> > > > {
>> > > > 	struct page *page;
>> > > > 	nodemask_t allowed_mask;
>> > > > 	struct migration_target_control mtc = {
>> > > > 		/*
>> > > > 		 * Allocate from 'node', or fail quickly and quietly.
>> > > > 		 * When this happens, 'page' will likely just be discarded
>> > > > 		 * instead of migrated.
>> > > > 		 */
>> > > > 		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
>> > > > 			    __GFP_THISNODE  | __GFP_NOWARN |
>> > > > 			    __GFP_NOMEMALLOC | GFP_NOWAIT,
>> > > > 		.nid = node
>> > > > 	};
>> > > > 
>> > > > 	page = alloc_migration_target(page, (unsigned long)&mtc);
>> > > > 	if (page)
>> > > > 		return page;
>> > > > 
>> > > > 	mtc.gfp_mask &= ~__GFP_THISNODE;
>> > > > 	mtc.nmask = &allowed_mask;
>> > > > 
>> > > > 	return alloc_migration_target(page, (unsigned long)&mtc);
>> > > > }
>> > > 
>> > > I skipped doing this in v5 because I was not sure this is really what we
>> > > want.
>> > 
>> > I think so.  And this is the original behavior.  We should keep the
>> > original behavior as much as possible, then make changes if necessary.
>> > 
>> 
>> That is the reason I split the new page allocation as a separate patch. 
>> Previous discussion on this topic didn't conclude on whether we really 
>> need to do the above or not
>> https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/
>
> Please check the later email in the thread you referenced.  Both Wei and
> me agree that the use case needs to be supported.  We just didn't reach
> concensus about how to implement it.  If you think Wei's solution is
> better (referenced as below), you can try to do that too.  Although I
> think my proposed implementation is much simpler.

How about the below details

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 79bd8d26feb2..cd6e71f702ad 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -21,6 +21,7 @@ void node_remove_from_memory_tier(int node);
 int node_get_memory_tier_id(int node);
 int node_set_memory_tier(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
+void node_get_allowed_targets(int node, nodemask_t *targets);
 #else
 #define numa_demotion_enabled	false
 static inline int next_demotion_node(int node)
@@ -28,6 +29,10 @@ static inline int next_demotion_node(int node)
 	return NUMA_NO_NODE;
 }
 
+static inline void node_get_allowed_targets(int node, nodemask_t *targets)
+{
+	*targets = NODE_MASK_NONE;
+}
 #endif	/* CONFIG_TIERED_MEMORY */
 
 #endif
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index b4e72b672d4d..592d939ec28d 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -18,6 +18,7 @@ struct memory_tier {
 
 struct demotion_nodes {
 	nodemask_t preferred;
+	nodemask_t allowed;
 };
 
 #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
@@ -378,6 +379,25 @@ int node_set_memory_tier(int node, int tier)
 }
 EXPORT_SYMBOL_GPL(node_set_memory_tier);
 
+void node_get_allowed_targets(int node, nodemask_t *targets)
+{
+	/*
+	 * node_demotion[] is updated without excluding this
+	 * function from running.
+	 *
+	 * If any node is moving to lower tiers then modifications
+	 * in node_demotion[] are still valid for this node, if any
+	 * node is moving to higher tier then moving node may be
+	 * used once for demotion which should be ok so rcu should
+	 * be enough here.
+	 */
+	rcu_read_lock();
+
+	*targets = node_demotion[node].allowed;
+
+	rcu_read_unlock();
+}
+
 /**
  * next_demotion_node() - Get the next node in the demotion path
  * @node: The starting node to lookup the next node
@@ -437,8 +457,10 @@ static void __disable_all_migrate_targets(void)
 {
 	int node;
 
-	for_each_node_mask(node, node_states[N_MEMORY])
+	for_each_node_mask(node, node_states[N_MEMORY]) {
 		node_demotion[node].preferred = NODE_MASK_NONE;
+		node_demotion[node].allowed = NODE_MASK_NONE;
+	}
 }
 
 static void disable_all_migrate_targets(void)
@@ -465,7 +487,7 @@ static void establish_migration_targets(void)
 	struct demotion_nodes *nd;
 	int target = NUMA_NO_NODE, node;
 	int distance, best_distance;
-	nodemask_t used;
+	nodemask_t used, allowed = NODE_MASK_NONE;
 
 	if (!node_demotion)
 		return;
@@ -511,6 +533,29 @@ static void establish_migration_targets(void)
 			}
 		} while (1);
 	}
+	/*
+	 * Now build the allowed mask for each node collecting node mask from
+	 * all memory tier below it. This allows us to fallback demotion page
+	 * allocation to a set of nodes that is closer the above selected
+	 * perferred node.
+	 */
+	list_for_each_entry(memtier, &memory_tiers, list)
+		nodes_or(allowed, allowed, memtier->nodelist);
+	/*
+	 * Removes nodes not yet in N_MEMORY.
+	 */
+	nodes_and(allowed, node_states[N_MEMORY], allowed);
+
+	list_for_each_entry(memtier, &memory_tiers, list) {
+		/*
+		 * Keep removing current tier from allowed nodes,
+		 * This will remove all nodes in current and above
+		 * memory tier from the allowed mask.
+		 */
+		nodes_andnot(allowed, allowed, memtier->nodelist);
+		for_each_node_mask(node, memtier->nodelist)
+			node_demotion[node].allowed = allowed;
+	}
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3a8f78277f99..b0792d838efb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1460,19 +1460,32 @@ static void folio_check_dirty_writeback(struct folio *folio,
 		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
 }
 
-static struct page *alloc_demote_page(struct page *page, unsigned long node)
+static struct page *alloc_demote_page(struct page *page, unsigned long private)
 {
-	struct migration_target_control mtc = {
-		/*
-		 * Allocate from 'node', or fail quickly and quietly.
-		 * When this happens, 'page' will likely just be discarded
-		 * instead of migrated.
-		 */
-		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
-			    __GFP_THISNODE  | __GFP_NOWARN |
-			    __GFP_NOMEMALLOC | GFP_NOWAIT,
-		.nid = node
-	};
+	struct page *target_page;
+	nodemask_t *allowed_mask;
+	struct migration_target_control *mtc;
+
+	mtc = (struct migration_target_control *)private;
+
+	allowed_mask = mtc->nmask;
+	/*
+	 * make sure we allocate from the target node first also trying to
+	 * reclaim pages from the target node via kswapd if we are low on
+	 * free memory on target node. If we don't do this and if we have low
+	 * free memory on the target memtier, we would start allocating pages
+	 * from higher memory tiers without even forcing a demotion of cold
+	 * pages from the target memtier. This can result in the kernel placing
+	 * hotpages in higher memory tiers.
+	 */
+	mtc->nmask = NULL;
+	mtc->gfp_mask |= __GFP_THISNODE;
+	target_page = alloc_migration_target(page, (unsigned long)&mtc);
+	if (target_page)
+		return target_page;
+
+	mtc->gfp_mask &= ~__GFP_THISNODE;
+	mtc->nmask = allowed_mask;
 
 	return alloc_migration_target(page, (unsigned long)&mtc);
 }
@@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 {
 	int target_nid = next_demotion_node(pgdat->node_id);
 	unsigned int nr_succeeded;
+	nodemask_t allowed_mask;
+
+	struct migration_target_control mtc = {
+		/*
+		 * Allocate from 'node', or fail quickly and quietly.
+		 * When this happens, 'page' will likely just be discarded
+		 * instead of migrated.
+		 */
+		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
+			__GFP_NOMEMALLOC | GFP_NOWAIT,
+		.nid = target_nid,
+		.nmask = &allowed_mask
+	};
 
 	if (list_empty(demote_pages))
 		return 0;
@@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
 	if (target_nid == NUMA_NO_NODE)
 		return 0;
 
+	node_get_allowed_targets(pgdat->node_id, &allowed_mask);
+
 	/* Demotion ignores all cpuset and mempolicy settings */
 	migrate_pages(demote_pages, alloc_demote_page, NULL,
-			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
-			    &nr_succeeded);
+		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
+		      &nr_succeeded);
 
 	if (current_is_kswapd())
 		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-06  6:01             ` Aneesh Kumar K V
@ 2022-06-06  6:27               ` Aneesh Kumar K.V
  2022-06-06  7:53                 ` Ying Huang
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-06  6:27 UTC (permalink / raw)
  To: Ying Huang
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes, linux-mm,
	akpm

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 6/6/22 11:03 AM, Ying Huang wrote:
>> On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
>>> On 6/6/22 8:19 AM, Ying Huang wrote:
>>>> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>>>>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>
>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>> demotion path relationship between NUMA nodes, which is created
>>>>>> during the kernel initialization and updated when a NUMA node is
>>>>>> hot-added or hot-removed.  The current implementation puts all
>>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>>> on the distances between nodes.
>>>>>>
>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>> several important use cases,
>>>>>>
>>>>>> The current tier initialization code always initializes
>>>>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>> a virtual machine) and should be put into a higher tier.
>>>>>>
>>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>>> tier. But on a system with HBM or GPU devices, the
>>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>>> next lower tier.
>>>>>>
>>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>>> next lower tier as defined by the demotion path, not any other
>>>>>> node from any lower tier.  This strict, hard-coded demotion order
>>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>>> allow cross-socket demotion to another node in the same demotion
>>>>>> tier as a fallback when the preferred demotion node is out of
>>>>>> space), This demotion order is also inconsistent with the page
>>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>>> out of space: The page allocation can fall back to any node from
>>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>>
>>>>>> The current kernel also don't provide any interfaces for the
>>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>>> optimize its memory allocations.
>>>>>>
>>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>>
>>>>>> This patch adds below sysfs interface which is read-only and
>>>>>> can be used to read nodes available in specific tier.
>>>>>>
>>>>>> /sys/devices/system/memtier/memtierN/nodelist
>>>>>>
>>>>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>>>>> lowest tier. The absolute value of a tier id number has no specific
>>>>>> meaning. what matters is the relative order of the tier id numbers.
>>>>>>
>>>>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>>>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>>>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>>>>
>>>>>> Default memory tier can be read from,
>>>>>> /sys/devices/system/memtier/default_tier
>>>>>>
>>>>>> Max memory tier can be read from,
>>>>>> /sys/devices/system/memtier/max_tiers
>>>>>>
>>>>>> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
>>>>>>
>>>>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>>>>
>>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>
>>>>> IMHO, we should change the kernel internal implementation firstly, then
>>>>> implement the kerne/user space interface.  That is, make memory tier
>>>>> explicit inside kernel, then expose it to user space.
>>>>
>>>> Why ignore this comment for v5?  If you don't agree, please respond me.
>>>>
>>>
>>> I am not sure what benefit such a rearrange would bring in? Right now I
>>> am writing the series from the point of view of introducing all the
>>> plumbing and them switching the existing demotion logic to use the new
>>> infrastructure. Redoing the code to hide all the userspace sysfs till we
>>> switch the demotion logic to use the new infrastructure doesn't really
>>> bring any additional clarity to patch review and would require me to
>>> redo the series with a lot of conflicts across the patches in the patchset.
>> 
>> IMHO, we shouldn't introduce regression even in the middle of a
>> patchset.  Each step should only rely on previous patches in the series
>> to work correctly.  In your current way of organization, after patch
>> [1/7], on a system with 2 memory tiers, the user space interface will
>> output wrong information (only 1 memory tier).  So I think the correct
>> way is to make it right inside the kenrel firstly, then expose the right
>> information to user space.
>>
>
> The patchset doesn't add additional tier until "mm/demotion/dax/kmem: 
> Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional 
> tiers done till all the demotion logic is in place. So even if the 
> system got dax/kmem, the support for adding dax/kmem as a memory tier 
> comes later in the patch series.

Let me clarify this a bit more. This patchset doesn't change the
existing kernel behavior till "mm/demotion: Build demotion targets
based on explicit memory tiers". So there is no regression till then.
It adds a parallel framework (memory tiers to the existing demotion
logic).

I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
with two memory tiers (DRAM and pmem) the demotion continues to work
as expected after patch 3 ("mm/demotion: Build demotion targets based on
explicit memory tiers"). With that, there will not be any regression in
between the patch series.

-aneesh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-06-06  6:21               ` Aneesh Kumar K.V
@ 2022-06-06  7:42                 ` Ying Huang
  2022-06-06  8:02                   ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-06-06  7:42 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Mon, 2022-06-06 at 11:51 +0530, Aneesh Kumar K.V wrote:
> Ying Huang <ying.huang@intel.com> writes:
> 
> .....
> 
> > > > > 
> > > > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/
> > > > > > 
> > > > > > That is, something like below,
> > > > > > 
> > > > > > static struct page *alloc_demote_page(struct page *page, unsigned long node)
> > > > > > {
> > > > > > 	struct page *page;
> > > > > > 	nodemask_t allowed_mask;
> > > > > > 	struct migration_target_control mtc = {
> > > > > > 		/*
> > > > > > 		 * Allocate from 'node', or fail quickly and quietly.
> > > > > > 		 * When this happens, 'page' will likely just be discarded
> > > > > > 		 * instead of migrated.
> > > > > > 		 */
> > > > > > 		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> > > > > > 			    __GFP_THISNODE  | __GFP_NOWARN |
> > > > > > 			    __GFP_NOMEMALLOC | GFP_NOWAIT,
> > > > > > 		.nid = node
> > > > > > 	};
> > > > > > 
> > > > > > 	page = alloc_migration_target(page, (unsigned long)&mtc);
> > > > > > 	if (page)
> > > > > > 		return page;
> > > > > > 
> > > > > > 	mtc.gfp_mask &= ~__GFP_THISNODE;
> > > > > > 	mtc.nmask = &allowed_mask;
> > > > > > 
> > > > > > 	return alloc_migration_target(page, (unsigned long)&mtc);
> > > > > > }
> > > > > 
> > > > > I skipped doing this in v5 because I was not sure this is really what we
> > > > > want.
> > > > 
> > > > I think so.  And this is the original behavior.  We should keep the
> > > > original behavior as much as possible, then make changes if necessary.
> > > > 
> > > 
> > > That is the reason I split the new page allocation as a separate patch. 
> > > Previous discussion on this topic didn't conclude on whether we really 
> > > need to do the above or not
> > > https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/
> > 
> > Please check the later email in the thread you referenced.  Both Wei and
> > me agree that the use case needs to be supported.  We just didn't reach
> > concensus about how to implement it.  If you think Wei's solution is
> > better (referenced as below), you can try to do that too.  Although I
> > think my proposed implementation is much simpler.
> 
> How about the below details
> 
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 79bd8d26feb2..cd6e71f702ad 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -21,6 +21,7 @@ void node_remove_from_memory_tier(int node);
>  int node_get_memory_tier_id(int node);
>  int node_set_memory_tier(int node, int tier);
>  int node_reset_memory_tier(int node, int tier);
> +void node_get_allowed_targets(int node, nodemask_t *targets);
>  #else
>  #define numa_demotion_enabled	false
>  static inline int next_demotion_node(int node)
> @@ -28,6 +29,10 @@ static inline int next_demotion_node(int node)
>  	return NUMA_NO_NODE;
>  }
>  
> 
> 
> 
> +static inline void node_get_allowed_targets(int node, nodemask_t *targets)
> +{
> +	*targets = NODE_MASK_NONE;
> +}
>  #endif	/* CONFIG_TIERED_MEMORY */
>  
> 
> 
> 
>  #endif
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index b4e72b672d4d..592d939ec28d 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -18,6 +18,7 @@ struct memory_tier {
>  
> 
> 
> 
>  struct demotion_nodes {
>  	nodemask_t preferred;
> +	nodemask_t allowed;
>  };
>  
> 
> 
> 
>  #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> @@ -378,6 +379,25 @@ int node_set_memory_tier(int node, int tier)
>  }
>  EXPORT_SYMBOL_GPL(node_set_memory_tier);
>  
> 
> 
> 
> +void node_get_allowed_targets(int node, nodemask_t *targets)
> +{
> +	/*
> +	 * node_demotion[] is updated without excluding this
> +	 * function from running.
> +	 *
> +	 * If any node is moving to lower tiers then modifications
> +	 * in node_demotion[] are still valid for this node, if any
> +	 * node is moving to higher tier then moving node may be
> +	 * used once for demotion which should be ok so rcu should
> +	 * be enough here.
> +	 */
> +	rcu_read_lock();
> +
> +	*targets = node_demotion[node].allowed;
> +
> +	rcu_read_unlock();
> +}
> +
>  /**
>   * next_demotion_node() - Get the next node in the demotion path
>   * @node: The starting node to lookup the next node
> @@ -437,8 +457,10 @@ static void __disable_all_migrate_targets(void)
>  {
>  	int node;
>  
> 
> 
> 
> -	for_each_node_mask(node, node_states[N_MEMORY])
> +	for_each_node_mask(node, node_states[N_MEMORY]) {
>  		node_demotion[node].preferred = NODE_MASK_NONE;
> +		node_demotion[node].allowed = NODE_MASK_NONE;
> +	}
>  }
>  
> 
> 
> 
>  static void disable_all_migrate_targets(void)
> @@ -465,7 +487,7 @@ static void establish_migration_targets(void)
>  	struct demotion_nodes *nd;
>  	int target = NUMA_NO_NODE, node;
>  	int distance, best_distance;
> -	nodemask_t used;
> +	nodemask_t used, allowed = NODE_MASK_NONE;
>  
> 
> 
> 
>  	if (!node_demotion)
>  		return;
> @@ -511,6 +533,29 @@ static void establish_migration_targets(void)
>  			}
>  		} while (1);
>  	}
> +	/*
> +	 * Now build the allowed mask for each node collecting node mask from
> +	 * all memory tier below it. This allows us to fallback demotion page
> +	 * allocation to a set of nodes that is closer the above selected
> +	 * perferred node.
> +	 */
> +	list_for_each_entry(memtier, &memory_tiers, list)
> +		nodes_or(allowed, allowed, memtier->nodelist);
> +	/*
> +	 * Removes nodes not yet in N_MEMORY.
> +	 */
> +	nodes_and(allowed, node_states[N_MEMORY], allowed);
> +
> +	list_for_each_entry(memtier, &memory_tiers, list) {
> +		/*
> +		 * Keep removing current tier from allowed nodes,
> +		 * This will remove all nodes in current and above
> +		 * memory tier from the allowed mask.
> +		 */
> +		nodes_andnot(allowed, allowed, memtier->nodelist);
> +		for_each_node_mask(node, memtier->nodelist)
> +			node_demotion[node].allowed = allowed;
> +	}
>  }
>  
> 
> 
> 
>  /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3a8f78277f99..b0792d838efb 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1460,19 +1460,32 @@ static void folio_check_dirty_writeback(struct folio *folio,
>  		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
>  }
>  
> 
> 
> 
> -static struct page *alloc_demote_page(struct page *page, unsigned long node)
> +static struct page *alloc_demote_page(struct page *page, unsigned long private)
>  {
> -	struct migration_target_control mtc = {
> -		/*
> -		 * Allocate from 'node', or fail quickly and quietly.
> -		 * When this happens, 'page' will likely just be discarded
> -		 * instead of migrated.
> -		 */
> -		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> -			    __GFP_THISNODE  | __GFP_NOWARN |
> -			    __GFP_NOMEMALLOC | GFP_NOWAIT,
> -		.nid = node
> -	};
> +	struct page *target_page;
> +	nodemask_t *allowed_mask;
> +	struct migration_target_control *mtc;
> +
> +	mtc = (struct migration_target_control *)private;
> +
> +	allowed_mask = mtc->nmask;
> +	/*
> +	 * make sure we allocate from the target node first also trying to
> +	 * reclaim pages from the target node via kswapd if we are low on
> +	 * free memory on target node. If we don't do this and if we have low
> +	 * free memory on the target memtier, we would start allocating pages
> +	 * from higher memory tiers without even forcing a demotion of cold
> +	 * pages from the target memtier. This can result in the kernel placing
> +	 * hotpages in higher memory tiers.
> +	 */
> +	mtc->nmask = NULL;
> +	mtc->gfp_mask |= __GFP_THISNODE;
> +	target_page = alloc_migration_target(page, (unsigned long)&mtc);
> +	if (target_page)
> +		return target_page;
> +
> +	mtc->gfp_mask &= ~__GFP_THISNODE;
> +	mtc->nmask = allowed_mask;
>  
> 
> 
> 
>  	return alloc_migration_target(page, (unsigned long)&mtc);
>  }
> @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
>  {
>  	int target_nid = next_demotion_node(pgdat->node_id);
>  	unsigned int nr_succeeded;
> +	nodemask_t allowed_mask;
> +
> +	struct migration_target_control mtc = {
> +		/*
> +		 * Allocate from 'node', or fail quickly and quietly.
> +		 * When this happens, 'page' will likely just be discarded
> +		 * instead of migrated.
> +		 */
> +		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> +			__GFP_NOMEMALLOC | GFP_NOWAIT,
> +		.nid = target_nid,
> +		.nmask = &allowed_mask
> +	};
>  
> 
> 
> 
>  	if (list_empty(demote_pages))
>  		return 0;
> @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
>  	if (target_nid == NUMA_NO_NODE)
>  		return 0;
>  
> 
> 
> 
> +	node_get_allowed_targets(pgdat->node_id, &allowed_mask);
> +
>  	/* Demotion ignores all cpuset and mempolicy settings */
>  	migrate_pages(demote_pages, alloc_demote_page, NULL,
> -			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> -			    &nr_succeeded);
> +		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
> +		      &nr_succeeded);

Firstly, it addressed my requirement, Thanks!  And, I'd prefer to put
mtc definition in alloc_demote_page().  Because that makes all page
allocation logic in one function.  Thus the readability of code is
better.

Best Regards,
Huang, Ying

>  	if (current_is_kswapd())
>  		__count_vm_events(PGDEMOTE_KSWAPD, nr_succeeded);



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-06  6:27               ` Aneesh Kumar K.V
@ 2022-06-06  7:53                 ` Ying Huang
  2022-06-06  8:01                   ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-06-06  7:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes, linux-mm,
	akpm

On Mon, 2022-06-06 at 11:57 +0530, Aneesh Kumar K.V wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
> > On 6/6/22 11:03 AM, Ying Huang wrote:
> > > On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
> > > > On 6/6/22 8:19 AM, Ying Huang wrote:
> > > > > On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> > > > > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > > > > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > > > > 
> > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > demotion path relationship between NUMA nodes, which is created
> > > > > > > during the kernel initialization and updated when a NUMA node is
> > > > > > > hot-added or hot-removed.  The current implementation puts all
> > > > > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > > > > tier-by-tier by establishing the per-node demotion targets based
> > > > > > > on the distances between nodes.
> > > > > > > 
> > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > several important use cases,
> > > > > > > 
> > > > > > > The current tier initialization code always initializes
> > > > > > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > a virtual machine) and should be put into a higher tier.
> > > > > > > 
> > > > > > > The current tier hierarchy always puts CPU nodes into the top
> > > > > > > tier. But on a system with HBM or GPU devices, the
> > > > > > > memory-only NUMA nodes mapping these devices should be in the
> > > > > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > > > > next lower tier.
> > > > > > > 
> > > > > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > > > > next lower tier as defined by the demotion path, not any other
> > > > > > > node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > does not work in all use cases (e.g. some use cases may want to
> > > > > > > allow cross-socket demotion to another node in the same demotion
> > > > > > > tier as a fallback when the preferred demotion node is out of
> > > > > > > space), This demotion order is also inconsistent with the page
> > > > > > > allocation fallback order when all the nodes in a higher tier are
> > > > > > > out of space: The page allocation can fall back to any node from
> > > > > > > any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > 
> > > > > > > The current kernel also don't provide any interfaces for the
> > > > > > > userspace to learn about the memory tier hierarchy in order to
> > > > > > > optimize its memory allocations.
> > > > > > > 
> > > > > > > This patch series address the above by defining memory tiers explicitly.
> > > > > > > 
> > > > > > > This patch adds below sysfs interface which is read-only and
> > > > > > > can be used to read nodes available in specific tier.
> > > > > > > 
> > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > 
> > > > > > > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > > > > > > lowest tier. The absolute value of a tier id number has no specific
> > > > > > > meaning. what matters is the relative order of the tier id numbers.
> > > > > > > 
> > > > > > > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > > > > > > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > > > > > > 
> > > > > > > Default memory tier can be read from,
> > > > > > > /sys/devices/system/memtier/default_tier
> > > > > > > 
> > > > > > > Max memory tier can be read from,
> > > > > > > /sys/devices/system/memtier/max_tiers
> > > > > > > 
> > > > > > > This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> > > > > > > 
> > > > > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > > > > > > 
> > > > > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > > > > 
> > > > > > IMHO, we should change the kernel internal implementation firstly, then
> > > > > > implement the kerne/user space interface.  That is, make memory tier
> > > > > > explicit inside kernel, then expose it to user space.
> > > > > 
> > > > > Why ignore this comment for v5?  If you don't agree, please respond me.
> > > > > 
> > > > 
> > > > I am not sure what benefit such a rearrange would bring in? Right now I
> > > > am writing the series from the point of view of introducing all the
> > > > plumbing and them switching the existing demotion logic to use the new
> > > > infrastructure. Redoing the code to hide all the userspace sysfs till we
> > > > switch the demotion logic to use the new infrastructure doesn't really
> > > > bring any additional clarity to patch review and would require me to
> > > > redo the series with a lot of conflicts across the patches in the patchset.
> > > 
> > > IMHO, we shouldn't introduce regression even in the middle of a
> > > patchset.  Each step should only rely on previous patches in the series
> > > to work correctly.  In your current way of organization, after patch
> > > [1/7], on a system with 2 memory tiers, the user space interface will
> > > output wrong information (only 1 memory tier).  So I think the correct
> > > way is to make it right inside the kenrel firstly, then expose the right
> > > information to user space.
> > > 
> > 
> > The patchset doesn't add additional tier until "mm/demotion/dax/kmem: 
> > Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional 
> > tiers done till all the demotion logic is in place. So even if the 
> > system got dax/kmem, the support for adding dax/kmem as a memory tier 
> > comes later in the patch series.
> 
> Let me clarify this a bit more. This patchset doesn't change the
> existing kernel behavior till "mm/demotion: Build demotion targets
> based on explicit memory tiers". So there is no regression till then.
> It adds a parallel framework (memory tiers to the existing demotion
> logic).
> 
> I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> with two memory tiers (DRAM and pmem) the demotion continues to work
> as expected after patch 3 ("mm/demotion: Build demotion targets based on
> explicit memory tiers"). With that, there will not be any regression in
> between the patch series.
> 

Thanks!  Please do that.  And I think you can add sysfs interface after
that patch too.  That is, in [1/7]

+struct memory_tier {
+	nodemask_t nodelist;
+};

And struct device can be added after the kernel has switched the
implementation based on explicit memory tiers.

+struct memory_tier {
+	struct device dev;
+	nodemask_t nodelist;
+};

But I don't think it's a good idea to have "struct device" embedded in
"struct memory_tier".  We don't have "struct device" embedded in "struct
pgdata_list"...

Best Regards,
Huang, Ying




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-06  7:53                 ` Ying Huang
@ 2022-06-06  8:01                   ` Aneesh Kumar K V
  2022-06-06  8:52                     ` Ying Huang
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06  8:01 UTC (permalink / raw)
  To: Ying Huang
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes, linux-mm,
	akpm

On 6/6/22 1:23 PM, Ying Huang wrote:
> On Mon, 2022-06-06 at 11:57 +0530, Aneesh Kumar K.V wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
>>> On 6/6/22 11:03 AM, Ying Huang wrote:
>>>> On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
>>>>> On 6/6/22 8:19 AM, Ying Huang wrote:
>>>>>> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>>>>>>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>>>>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>>>
>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>>>> demotion path relationship between NUMA nodes, which is created
>>>>>>>> during the kernel initialization and updated when a NUMA node is
>>>>>>>> hot-added or hot-removed.  The current implementation puts all
>>>>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>>>>> on the distances between nodes.
>>>>>>>>
>>>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>>>> several important use cases,
>>>>>>>>
>>>>>>>> The current tier initialization code always initializes
>>>>>>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>>> a virtual machine) and should be put into a higher tier.
>>>>>>>>
>>>>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>>>>> tier. But on a system with HBM or GPU devices, the
>>>>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>>>>> next lower tier.
>>>>>>>>
>>>>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>>>>> next lower tier as defined by the demotion path, not any other
>>>>>>>> node from any lower tier.  This strict, hard-coded demotion order
>>>>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>>>>> allow cross-socket demotion to another node in the same demotion
>>>>>>>> tier as a fallback when the preferred demotion node is out of
>>>>>>>> space), This demotion order is also inconsistent with the page
>>>>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>>>>> out of space: The page allocation can fall back to any node from
>>>>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>>>>
>>>>>>>> The current kernel also don't provide any interfaces for the
>>>>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>>>>> optimize its memory allocations.
>>>>>>>>
>>>>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>>>>
>>>>>>>> This patch adds below sysfs interface which is read-only and
>>>>>>>> can be used to read nodes available in specific tier.
>>>>>>>>
>>>>>>>> /sys/devices/system/memtier/memtierN/nodelist
>>>>>>>>
>>>>>>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>>>>>>> lowest tier. The absolute value of a tier id number has no specific
>>>>>>>> meaning. what matters is the relative order of the tier id numbers.
>>>>>>>>
>>>>>>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>>>>>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>>>>>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>>>>>>
>>>>>>>> Default memory tier can be read from,
>>>>>>>> /sys/devices/system/memtier/default_tier
>>>>>>>>
>>>>>>>> Max memory tier can be read from,
>>>>>>>> /sys/devices/system/memtier/max_tiers
>>>>>>>>
>>>>>>>> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>>>>>>
>>>>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>>
>>>>>>> IMHO, we should change the kernel internal implementation firstly, then
>>>>>>> implement the kerne/user space interface.  That is, make memory tier
>>>>>>> explicit inside kernel, then expose it to user space.
>>>>>>
>>>>>> Why ignore this comment for v5?  If you don't agree, please respond me.
>>>>>>
>>>>>
>>>>> I am not sure what benefit such a rearrange would bring in? Right now I
>>>>> am writing the series from the point of view of introducing all the
>>>>> plumbing and them switching the existing demotion logic to use the new
>>>>> infrastructure. Redoing the code to hide all the userspace sysfs till we
>>>>> switch the demotion logic to use the new infrastructure doesn't really
>>>>> bring any additional clarity to patch review and would require me to
>>>>> redo the series with a lot of conflicts across the patches in the patchset.
>>>>
>>>> IMHO, we shouldn't introduce regression even in the middle of a
>>>> patchset.  Each step should only rely on previous patches in the series
>>>> to work correctly.  In your current way of organization, after patch
>>>> [1/7], on a system with 2 memory tiers, the user space interface will
>>>> output wrong information (only 1 memory tier).  So I think the correct
>>>> way is to make it right inside the kenrel firstly, then expose the right
>>>> information to user space.
>>>>
>>>
>>> The patchset doesn't add additional tier until "mm/demotion/dax/kmem:
>>> Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional
>>> tiers done till all the demotion logic is in place. So even if the
>>> system got dax/kmem, the support for adding dax/kmem as a memory tier
>>> comes later in the patch series.
>>
>> Let me clarify this a bit more. This patchset doesn't change the
>> existing kernel behavior till "mm/demotion: Build demotion targets
>> based on explicit memory tiers". So there is no regression till then.
>> It adds a parallel framework (memory tiers to the existing demotion
>> logic).
>>
>> I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
>> MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
>> with two memory tiers (DRAM and pmem) the demotion continues to work
>> as expected after patch 3 ("mm/demotion: Build demotion targets based on
>> explicit memory tiers"). With that, there will not be any regression in
>> between the patch series.
>>
> 
> Thanks!  Please do that.  And I think you can add sysfs interface after
> that patch too.  That is, in [1/7]
> 

I am not sure why you insist on moving sysfs interfaces later. They are 
introduced based on the helper added. It make patch review easier to 
look at both the helpers and the user of the helper together in a patch.

> +struct memory_tier {
> +	nodemask_t nodelist;
> +};
> 
> And struct device can be added after the kernel has switched the
> implementation based on explicit memory tiers.
> 
> +struct memory_tier {
> +	struct device dev;
> +	nodemask_t nodelist;
> +};
> 


Can you elaborate on this? or possibly review the v5 series indicating 
what change you are suggesting here?


> But I don't think it's a good idea to have "struct device" embedded in
> "struct memory_tier".  We don't have "struct device" embedded in "struct
> pgdata_list"...
> 

I avoided creating an array for memory_tier (memory_tier[]) so that we 
can keep it dynamic. Keeping dev embedded in struct memory_tier simplify 
the life cycle management of that dynamic list. We free the struct 
memory_tier allocation via device release function (memtier->dev.release 
= memory_tier_device_release )

Why do you think it is not a good idea?

-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-06-06  7:42                 ` Ying Huang
@ 2022-06-06  8:02                   ` Aneesh Kumar K V
  2022-06-06  8:06                     ` Ying Huang
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06  8:02 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On 6/6/22 1:12 PM, Ying Huang wrote:
> On Mon, 2022-06-06 at 11:51 +0530, Aneesh Kumar K.V wrote:
>> Ying Huang <ying.huang@intel.com> writes:
>>
>> .....
>>
>>>>>>
>>>>>>> https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/
>>>>>>>
>>>>>>> That is, something like below,
>>>>>>>
>>>>>>> static struct page *alloc_demote_page(struct page *page, unsigned long node)
>>>>>>> {
>>>>>>> 	struct page *page;
>>>>>>> 	nodemask_t allowed_mask;
>>>>>>> 	struct migration_target_control mtc = {
>>>>>>> 		/*
>>>>>>> 		 * Allocate from 'node', or fail quickly and quietly.
>>>>>>> 		 * When this happens, 'page' will likely just be discarded
>>>>>>> 		 * instead of migrated.
>>>>>>> 		 */
>>>>>>> 		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
>>>>>>> 			    __GFP_THISNODE  | __GFP_NOWARN |
>>>>>>> 			    __GFP_NOMEMALLOC | GFP_NOWAIT,
>>>>>>> 		.nid = node
>>>>>>> 	};
>>>>>>>
>>>>>>> 	page = alloc_migration_target(page, (unsigned long)&mtc);
>>>>>>> 	if (page)
>>>>>>> 		return page;
>>>>>>>
>>>>>>> 	mtc.gfp_mask &= ~__GFP_THISNODE;
>>>>>>> 	mtc.nmask = &allowed_mask;
>>>>>>>
>>>>>>> 	return alloc_migration_target(page, (unsigned long)&mtc);
>>>>>>> }
>>>>>>
>>>>>> I skipped doing this in v5 because I was not sure this is really what we
>>>>>> want.
>>>>>
>>>>> I think so.  And this is the original behavior.  We should keep the
>>>>> original behavior as much as possible, then make changes if necessary.
>>>>>
>>>>
>>>> That is the reason I split the new page allocation as a separate patch.
>>>> Previous discussion on this topic didn't conclude on whether we really
>>>> need to do the above or not
>>>> https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/
>>>
>>> Please check the later email in the thread you referenced.  Both Wei and
>>> me agree that the use case needs to be supported.  We just didn't reach
>>> concensus about how to implement it.  If you think Wei's solution is
>>> better (referenced as below), you can try to do that too.  Although I
>>> think my proposed implementation is much simpler.
>>
>> How about the below details
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index 79bd8d26feb2..cd6e71f702ad 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -21,6 +21,7 @@ void node_remove_from_memory_tier(int node);
>>   int node_get_memory_tier_id(int node);
>>   int node_set_memory_tier(int node, int tier);
>>   int node_reset_memory_tier(int node, int tier);
>> +void node_get_allowed_targets(int node, nodemask_t *targets);
>>   #else
>>   #define numa_demotion_enabled	false
>>   static inline int next_demotion_node(int node)
>> @@ -28,6 +29,10 @@ static inline int next_demotion_node(int node)
>>   	return NUMA_NO_NODE;
>>   }
>>   
>>
>>
>>
>> +static inline void node_get_allowed_targets(int node, nodemask_t *targets)
>> +{
>> +	*targets = NODE_MASK_NONE;
>> +}
>>   #endif	/* CONFIG_TIERED_MEMORY */
>>   
>>
>>
>>
>>   #endif
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index b4e72b672d4d..592d939ec28d 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -18,6 +18,7 @@ struct memory_tier {
>>   
>>
>>
>>
>>   struct demotion_nodes {
>>   	nodemask_t preferred;
>> +	nodemask_t allowed;
>>   };
>>   
>>
>>
>>
>>   #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
>> @@ -378,6 +379,25 @@ int node_set_memory_tier(int node, int tier)
>>   }
>>   EXPORT_SYMBOL_GPL(node_set_memory_tier);
>>   
>>
>>
>>
>> +void node_get_allowed_targets(int node, nodemask_t *targets)
>> +{
>> +	/*
>> +	 * node_demotion[] is updated without excluding this
>> +	 * function from running.
>> +	 *
>> +	 * If any node is moving to lower tiers then modifications
>> +	 * in node_demotion[] are still valid for this node, if any
>> +	 * node is moving to higher tier then moving node may be
>> +	 * used once for demotion which should be ok so rcu should
>> +	 * be enough here.
>> +	 */
>> +	rcu_read_lock();
>> +
>> +	*targets = node_demotion[node].allowed;
>> +
>> +	rcu_read_unlock();
>> +}
>> +
>>   /**
>>    * next_demotion_node() - Get the next node in the demotion path
>>    * @node: The starting node to lookup the next node
>> @@ -437,8 +457,10 @@ static void __disable_all_migrate_targets(void)
>>   {
>>   	int node;
>>   
>>
>>
>>
>> -	for_each_node_mask(node, node_states[N_MEMORY])
>> +	for_each_node_mask(node, node_states[N_MEMORY]) {
>>   		node_demotion[node].preferred = NODE_MASK_NONE;
>> +		node_demotion[node].allowed = NODE_MASK_NONE;
>> +	}
>>   }
>>   
>>
>>
>>
>>   static void disable_all_migrate_targets(void)
>> @@ -465,7 +487,7 @@ static void establish_migration_targets(void)
>>   	struct demotion_nodes *nd;
>>   	int target = NUMA_NO_NODE, node;
>>   	int distance, best_distance;
>> -	nodemask_t used;
>> +	nodemask_t used, allowed = NODE_MASK_NONE;
>>   
>>
>>
>>
>>   	if (!node_demotion)
>>   		return;
>> @@ -511,6 +533,29 @@ static void establish_migration_targets(void)
>>   			}
>>   		} while (1);
>>   	}
>> +	/*
>> +	 * Now build the allowed mask for each node collecting node mask from
>> +	 * all memory tier below it. This allows us to fallback demotion page
>> +	 * allocation to a set of nodes that is closer the above selected
>> +	 * perferred node.
>> +	 */
>> +	list_for_each_entry(memtier, &memory_tiers, list)
>> +		nodes_or(allowed, allowed, memtier->nodelist);
>> +	/*
>> +	 * Removes nodes not yet in N_MEMORY.
>> +	 */
>> +	nodes_and(allowed, node_states[N_MEMORY], allowed);
>> +
>> +	list_for_each_entry(memtier, &memory_tiers, list) {
>> +		/*
>> +		 * Keep removing current tier from allowed nodes,
>> +		 * This will remove all nodes in current and above
>> +		 * memory tier from the allowed mask.
>> +		 */
>> +		nodes_andnot(allowed, allowed, memtier->nodelist);
>> +		for_each_node_mask(node, memtier->nodelist)
>> +			node_demotion[node].allowed = allowed;
>> +	}
>>   }
>>   
>>
>>
>>
>>   /*
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 3a8f78277f99..b0792d838efb 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1460,19 +1460,32 @@ static void folio_check_dirty_writeback(struct folio *folio,
>>   		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
>>   }
>>   
>>
>>
>>
>> -static struct page *alloc_demote_page(struct page *page, unsigned long node)
>> +static struct page *alloc_demote_page(struct page *page, unsigned long private)
>>   {
>> -	struct migration_target_control mtc = {
>> -		/*
>> -		 * Allocate from 'node', or fail quickly and quietly.
>> -		 * When this happens, 'page' will likely just be discarded
>> -		 * instead of migrated.
>> -		 */
>> -		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
>> -			    __GFP_THISNODE  | __GFP_NOWARN |
>> -			    __GFP_NOMEMALLOC | GFP_NOWAIT,
>> -		.nid = node
>> -	};
>> +	struct page *target_page;
>> +	nodemask_t *allowed_mask;
>> +	struct migration_target_control *mtc;
>> +
>> +	mtc = (struct migration_target_control *)private;
>> +
>> +	allowed_mask = mtc->nmask;
>> +	/*
>> +	 * make sure we allocate from the target node first also trying to
>> +	 * reclaim pages from the target node via kswapd if we are low on
>> +	 * free memory on target node. If we don't do this and if we have low
>> +	 * free memory on the target memtier, we would start allocating pages
>> +	 * from higher memory tiers without even forcing a demotion of cold
>> +	 * pages from the target memtier. This can result in the kernel placing
>> +	 * hotpages in higher memory tiers.
>> +	 */
>> +	mtc->nmask = NULL;
>> +	mtc->gfp_mask |= __GFP_THISNODE;
>> +	target_page = alloc_migration_target(page, (unsigned long)&mtc);
>> +	if (target_page)
>> +		return target_page;
>> +
>> +	mtc->gfp_mask &= ~__GFP_THISNODE;
>> +	mtc->nmask = allowed_mask;
>>   
>>
>>
>>
>>   	return alloc_migration_target(page, (unsigned long)&mtc);
>>   }
>> @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
>>   {
>>   	int target_nid = next_demotion_node(pgdat->node_id);
>>   	unsigned int nr_succeeded;
>> +	nodemask_t allowed_mask;
>> +
>> +	struct migration_target_control mtc = {
>> +		/*
>> +		 * Allocate from 'node', or fail quickly and quietly.
>> +		 * When this happens, 'page' will likely just be discarded
>> +		 * instead of migrated.
>> +		 */
>> +		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
>> +			__GFP_NOMEMALLOC | GFP_NOWAIT,
>> +		.nid = target_nid,
>> +		.nmask = &allowed_mask
>> +	};
>>   
>>
>>
>>
>>   	if (list_empty(demote_pages))
>>   		return 0;
>> @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
>>   	if (target_nid == NUMA_NO_NODE)
>>   		return 0;
>>   
>>
>>
>>
>> +	node_get_allowed_targets(pgdat->node_id, &allowed_mask);
>> +
>>   	/* Demotion ignores all cpuset and mempolicy settings */
>>   	migrate_pages(demote_pages, alloc_demote_page, NULL,
>> -			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
>> -			    &nr_succeeded);
>> +		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
>> +		      &nr_succeeded);
> 
> Firstly, it addressed my requirement, Thanks!  And, I'd prefer to put
> mtc definition in alloc_demote_page().  Because that makes all page
> allocation logic in one function.  Thus the readability of code is
> better.

The challenge is in allowed_mask computation. That is based on the 
src_node and not target_node.

-aneesh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-06-06  8:02                   ` Aneesh Kumar K V
@ 2022-06-06  8:06                     ` Ying Huang
  0 siblings, 0 replies; 66+ messages in thread
From: Ying Huang @ 2022-06-06  8:06 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Mon, 2022-06-06 at 13:32 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 1:12 PM, Ying Huang wrote:
> > On Mon, 2022-06-06 at 11:51 +0530, Aneesh Kumar K.V wrote:
> > > Ying Huang <ying.huang@intel.com> writes:
> > > 
> > > .....
> > > 
> > > > > > > 
> > > > > > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/
> > > > > > > > 
> > > > > > > > That is, something like below,
> > > > > > > > 
> > > > > > > > static struct page *alloc_demote_page(struct page *page, unsigned long node)
> > > > > > > > {
> > > > > > > > 	struct page *page;
> > > > > > > > 	nodemask_t allowed_mask;
> > > > > > > > 	struct migration_target_control mtc = {
> > > > > > > > 		/*
> > > > > > > > 		 * Allocate from 'node', or fail quickly and quietly.
> > > > > > > > 		 * When this happens, 'page' will likely just be discarded
> > > > > > > > 		 * instead of migrated.
> > > > > > > > 		 */
> > > > > > > > 		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> > > > > > > > 			    __GFP_THISNODE  | __GFP_NOWARN |
> > > > > > > > 			    __GFP_NOMEMALLOC | GFP_NOWAIT,
> > > > > > > > 		.nid = node
> > > > > > > > 	};
> > > > > > > > 
> > > > > > > > 	page = alloc_migration_target(page, (unsigned long)&mtc);
> > > > > > > > 	if (page)
> > > > > > > > 		return page;
> > > > > > > > 
> > > > > > > > 	mtc.gfp_mask &= ~__GFP_THISNODE;
> > > > > > > > 	mtc.nmask = &allowed_mask;
> > > > > > > > 
> > > > > > > > 	return alloc_migration_target(page, (unsigned long)&mtc);
> > > > > > > > }
> > > > > > > 
> > > > > > > I skipped doing this in v5 because I was not sure this is really what we
> > > > > > > want.
> > > > > > 
> > > > > > I think so.  And this is the original behavior.  We should keep the
> > > > > > original behavior as much as possible, then make changes if necessary.
> > > > > > 
> > > > > 
> > > > > That is the reason I split the new page allocation as a separate patch.
> > > > > Previous discussion on this topic didn't conclude on whether we really
> > > > > need to do the above or not
> > > > > https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/
> > > > 
> > > > Please check the later email in the thread you referenced.  Both Wei and
> > > > me agree that the use case needs to be supported.  We just didn't reach
> > > > concensus about how to implement it.  If you think Wei's solution is
> > > > better (referenced as below), you can try to do that too.  Although I
> > > > think my proposed implementation is much simpler.
> > > 
> > > How about the below details
> > > 
> > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > index 79bd8d26feb2..cd6e71f702ad 100644
> > > --- a/include/linux/memory-tiers.h
> > > +++ b/include/linux/memory-tiers.h
> > > @@ -21,6 +21,7 @@ void node_remove_from_memory_tier(int node);
> > >   int node_get_memory_tier_id(int node);
> > >   int node_set_memory_tier(int node, int tier);
> > >   int node_reset_memory_tier(int node, int tier);
> > > +void node_get_allowed_targets(int node, nodemask_t *targets);
> > >   #else
> > >   #define numa_demotion_enabled	false
> > >   static inline int next_demotion_node(int node)
> > > @@ -28,6 +29,10 @@ static inline int next_demotion_node(int node)
> > >   	return NUMA_NO_NODE;
> > >   }
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > +static inline void node_get_allowed_targets(int node, nodemask_t *targets)
> > > +{
> > > +	*targets = NODE_MASK_NONE;
> > > +}
> > >   #endif	/* CONFIG_TIERED_MEMORY */
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   #endif
> > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > index b4e72b672d4d..592d939ec28d 100644
> > > --- a/mm/memory-tiers.c
> > > +++ b/mm/memory-tiers.c
> > > @@ -18,6 +18,7 @@ struct memory_tier {
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   struct demotion_nodes {
> > >   	nodemask_t preferred;
> > > +	nodemask_t allowed;
> > >   };
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   #define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> > > @@ -378,6 +379,25 @@ int node_set_memory_tier(int node, int tier)
> > >   }
> > >   EXPORT_SYMBOL_GPL(node_set_memory_tier);
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > +void node_get_allowed_targets(int node, nodemask_t *targets)
> > > +{
> > > +	/*
> > > +	 * node_demotion[] is updated without excluding this
> > > +	 * function from running.
> > > +	 *
> > > +	 * If any node is moving to lower tiers then modifications
> > > +	 * in node_demotion[] are still valid for this node, if any
> > > +	 * node is moving to higher tier then moving node may be
> > > +	 * used once for demotion which should be ok so rcu should
> > > +	 * be enough here.
> > > +	 */
> > > +	rcu_read_lock();
> > > +
> > > +	*targets = node_demotion[node].allowed;
> > > +
> > > +	rcu_read_unlock();
> > > +}
> > > +
> > >   /**
> > >    * next_demotion_node() - Get the next node in the demotion path
> > >    * @node: The starting node to lookup the next node
> > > @@ -437,8 +457,10 @@ static void __disable_all_migrate_targets(void)
> > >   {
> > >   	int node;
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -	for_each_node_mask(node, node_states[N_MEMORY])
> > > +	for_each_node_mask(node, node_states[N_MEMORY]) {
> > >   		node_demotion[node].preferred = NODE_MASK_NONE;
> > > +		node_demotion[node].allowed = NODE_MASK_NONE;
> > > +	}
> > >   }
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   static void disable_all_migrate_targets(void)
> > > @@ -465,7 +487,7 @@ static void establish_migration_targets(void)
> > >   	struct demotion_nodes *nd;
> > >   	int target = NUMA_NO_NODE, node;
> > >   	int distance, best_distance;
> > > -	nodemask_t used;
> > > +	nodemask_t used, allowed = NODE_MASK_NONE;
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   	if (!node_demotion)
> > >   		return;
> > > @@ -511,6 +533,29 @@ static void establish_migration_targets(void)
> > >   			}
> > >   		} while (1);
> > >   	}
> > > +	/*
> > > +	 * Now build the allowed mask for each node collecting node mask from
> > > +	 * all memory tier below it. This allows us to fallback demotion page
> > > +	 * allocation to a set of nodes that is closer the above selected
> > > +	 * perferred node.
> > > +	 */
> > > +	list_for_each_entry(memtier, &memory_tiers, list)
> > > +		nodes_or(allowed, allowed, memtier->nodelist);
> > > +	/*
> > > +	 * Removes nodes not yet in N_MEMORY.
> > > +	 */
> > > +	nodes_and(allowed, node_states[N_MEMORY], allowed);
> > > +
> > > +	list_for_each_entry(memtier, &memory_tiers, list) {
> > > +		/*
> > > +		 * Keep removing current tier from allowed nodes,
> > > +		 * This will remove all nodes in current and above
> > > +		 * memory tier from the allowed mask.
> > > +		 */
> > > +		nodes_andnot(allowed, allowed, memtier->nodelist);
> > > +		for_each_node_mask(node, memtier->nodelist)
> > > +			node_demotion[node].allowed = allowed;
> > > +	}
> > >   }
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   /*
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 3a8f78277f99..b0792d838efb 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1460,19 +1460,32 @@ static void folio_check_dirty_writeback(struct folio *folio,
> > >   		mapping->a_ops->is_dirty_writeback(folio, dirty, writeback);
> > >   }
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -static struct page *alloc_demote_page(struct page *page, unsigned long node)
> > > +static struct page *alloc_demote_page(struct page *page, unsigned long private)
> > >   {
> > > -	struct migration_target_control mtc = {
> > > -		/*
> > > -		 * Allocate from 'node', or fail quickly and quietly.
> > > -		 * When this happens, 'page' will likely just be discarded
> > > -		 * instead of migrated.
> > > -		 */
> > > -		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> > > -			    __GFP_THISNODE  | __GFP_NOWARN |
> > > -			    __GFP_NOMEMALLOC | GFP_NOWAIT,
> > > -		.nid = node
> > > -	};
> > > +	struct page *target_page;
> > > +	nodemask_t *allowed_mask;
> > > +	struct migration_target_control *mtc;
> > > +
> > > +	mtc = (struct migration_target_control *)private;
> > > +
> > > +	allowed_mask = mtc->nmask;
> > > +	/*
> > > +	 * make sure we allocate from the target node first also trying to
> > > +	 * reclaim pages from the target node via kswapd if we are low on
> > > +	 * free memory on target node. If we don't do this and if we have low
> > > +	 * free memory on the target memtier, we would start allocating pages
> > > +	 * from higher memory tiers without even forcing a demotion of cold
> > > +	 * pages from the target memtier. This can result in the kernel placing
> > > +	 * hotpages in higher memory tiers.
> > > +	 */
> > > +	mtc->nmask = NULL;
> > > +	mtc->gfp_mask |= __GFP_THISNODE;
> > > +	target_page = alloc_migration_target(page, (unsigned long)&mtc);
> > > +	if (target_page)
> > > +		return target_page;
> > > +
> > > +	mtc->gfp_mask &= ~__GFP_THISNODE;
> > > +	mtc->nmask = allowed_mask;
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   	return alloc_migration_target(page, (unsigned long)&mtc);
> > >   }
> > > @@ -1487,6 +1500,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
> > >   {
> > >   	int target_nid = next_demotion_node(pgdat->node_id);
> > >   	unsigned int nr_succeeded;
> > > +	nodemask_t allowed_mask;
> > > +
> > > +	struct migration_target_control mtc = {
> > > +		/*
> > > +		 * Allocate from 'node', or fail quickly and quietly.
> > > +		 * When this happens, 'page' will likely just be discarded
> > > +		 * instead of migrated.
> > > +		 */
> > > +		.gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> > > +			__GFP_NOMEMALLOC | GFP_NOWAIT,
> > > +		.nid = target_nid,
> > > +		.nmask = &allowed_mask
> > > +	};
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   	if (list_empty(demote_pages))
> > >   		return 0;
> > > @@ -1494,10 +1520,12 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
> > >   	if (target_nid == NUMA_NO_NODE)
> > >   		return 0;
> > >   
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > +	node_get_allowed_targets(pgdat->node_id, &allowed_mask);
> > > +
> > >   	/* Demotion ignores all cpuset and mempolicy settings */
> > >   	migrate_pages(demote_pages, alloc_demote_page, NULL,
> > > -			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> > > -			    &nr_succeeded);
> > > +		      (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION,
> > > +		      &nr_succeeded);
> > 
> > Firstly, it addressed my requirement, Thanks!  And, I'd prefer to put
> > mtc definition in alloc_demote_page().  Because that makes all page
> > allocation logic in one function.  Thus the readability of code is
> > better.
> 
> The challenge is in allowed_mask computation. That is based on the 
> src_node and not target_node.
> 

How about passing the src_node to alloc_demote_page()?

Best Regards,
Huang, Ying




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-06  8:01                   ` Aneesh Kumar K V
@ 2022-06-06  8:52                     ` Ying Huang
  2022-06-06  9:02                       ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-06-06  8:52 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes, linux-mm,
	akpm

On Mon, 2022-06-06 at 13:31 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 1:23 PM, Ying Huang wrote:
> > On Mon, 2022-06-06 at 11:57 +0530, Aneesh Kumar K.V wrote:
> > > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> > > 
> > > > On 6/6/22 11:03 AM, Ying Huang wrote:
> > > > > On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
> > > > > > On 6/6/22 8:19 AM, Ying Huang wrote:
> > > > > > > On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> > > > > > > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > > > > > > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > > > > > > 
> > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > demotion path relationship between NUMA nodes, which is created
> > > > > > > > > during the kernel initialization and updated when a NUMA node is
> > > > > > > > > hot-added or hot-removed.  The current implementation puts all
> > > > > > > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > > > > > > tier-by-tier by establishing the per-node demotion targets based
> > > > > > > > > on the distances between nodes.
> > > > > > > > > 
> > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > several important use cases,
> > > > > > > > > 
> > > > > > > > > The current tier initialization code always initializes
> > > > > > > > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > > a virtual machine) and should be put into a higher tier.
> > > > > > > > > 
> > > > > > > > > The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > > tier. But on a system with HBM or GPU devices, the
> > > > > > > > > memory-only NUMA nodes mapping these devices should be in the
> > > > > > > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > > > > > > next lower tier.
> > > > > > > > > 
> > > > > > > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > > > > > > next lower tier as defined by the demotion path, not any other
> > > > > > > > > node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > > > does not work in all use cases (e.g. some use cases may want to
> > > > > > > > > allow cross-socket demotion to another node in the same demotion
> > > > > > > > > tier as a fallback when the preferred demotion node is out of
> > > > > > > > > space), This demotion order is also inconsistent with the page
> > > > > > > > > allocation fallback order when all the nodes in a higher tier are
> > > > > > > > > out of space: The page allocation can fall back to any node from
> > > > > > > > > any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > > 
> > > > > > > > > The current kernel also don't provide any interfaces for the
> > > > > > > > > userspace to learn about the memory tier hierarchy in order to
> > > > > > > > > optimize its memory allocations.
> > > > > > > > > 
> > > > > > > > > This patch series address the above by defining memory tiers explicitly.
> > > > > > > > > 
> > > > > > > > > This patch adds below sysfs interface which is read-only and
> > > > > > > > > can be used to read nodes available in specific tier.
> > > > > > > > > 
> > > > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > 
> > > > > > > > > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > > > > > > > > lowest tier. The absolute value of a tier id number has no specific
> > > > > > > > > meaning. what matters is the relative order of the tier id numbers.
> > > > > > > > > 
> > > > > > > > > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > > > > > > > > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > > > > > > > > 
> > > > > > > > > Default memory tier can be read from,
> > > > > > > > > /sys/devices/system/memtier/default_tier
> > > > > > > > > 
> > > > > > > > > Max memory tier can be read from,
> > > > > > > > > /sys/devices/system/memtier/max_tiers
> > > > > > > > > 
> > > > > > > > > This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> > > > > > > > > 
> > > > > > > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > > > > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > > > > > > 
> > > > > > > > IMHO, we should change the kernel internal implementation firstly, then
> > > > > > > > implement the kerne/user space interface.  That is, make memory tier
> > > > > > > > explicit inside kernel, then expose it to user space.
> > > > > > > 
> > > > > > > Why ignore this comment for v5?  If you don't agree, please respond me.
> > > > > > > 
> > > > > > 
> > > > > > I am not sure what benefit such a rearrange would bring in? Right now I
> > > > > > am writing the series from the point of view of introducing all the
> > > > > > plumbing and them switching the existing demotion logic to use the new
> > > > > > infrastructure. Redoing the code to hide all the userspace sysfs till we
> > > > > > switch the demotion logic to use the new infrastructure doesn't really
> > > > > > bring any additional clarity to patch review and would require me to
> > > > > > redo the series with a lot of conflicts across the patches in the patchset.
> > > > > 
> > > > > IMHO, we shouldn't introduce regression even in the middle of a
> > > > > patchset.  Each step should only rely on previous patches in the series
> > > > > to work correctly.  In your current way of organization, after patch
> > > > > [1/7], on a system with 2 memory tiers, the user space interface will
> > > > > output wrong information (only 1 memory tier).  So I think the correct
> > > > > way is to make it right inside the kenrel firstly, then expose the right
> > > > > information to user space.
> > > > > 
> > > > 
> > > > The patchset doesn't add additional tier until "mm/demotion/dax/kmem:
> > > > Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional
> > > > tiers done till all the demotion logic is in place. So even if the
> > > > system got dax/kmem, the support for adding dax/kmem as a memory tier
> > > > comes later in the patch series.
> > > 
> > > Let me clarify this a bit more. This patchset doesn't change the
> > > existing kernel behavior till "mm/demotion: Build demotion targets
> > > based on explicit memory tiers". So there is no regression till then.
> > > It adds a parallel framework (memory tiers to the existing demotion
> > > logic).
> > > 
> > > I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> > > MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> > > with two memory tiers (DRAM and pmem) the demotion continues to work
> > > as expected after patch 3 ("mm/demotion: Build demotion targets based on
> > > explicit memory tiers"). With that, there will not be any regression in
> > > between the patch series.
> > > 
> > 
> > Thanks!  Please do that.  And I think you can add sysfs interface after
> > that patch too.  That is, in [1/7]
> > 
> 
> I am not sure why you insist on moving sysfs interfaces later. They are 
> introduced based on the helper added. It make patch review easier to 
> look at both the helpers and the user of the helper together in a patch.

Yes.  We should introduce a function and its user in one patch for
review.  But this doesn't mean that we should introduce the user space
interface as the first step.  I think the user space interface should
output correct information when we expose it.

> > +struct memory_tier {
> > +	nodemask_t nodelist;
> > +};
> > 
> > And struct device can be added after the kernel has switched the
> > implementation based on explicit memory tiers.
> > 
> > +struct memory_tier {
> > +	struct device dev;
> > +	nodemask_t nodelist;
> > +};
> > 
> 
> 
> Can you elaborate on this? or possibly review the v5 series indicating 
> what change you are suggesting here?
> 
> 
> > But I don't think it's a good idea to have "struct device" embedded in
> > "struct memory_tier".  We don't have "struct device" embedded in "struct
> > pgdata_list"...
> > 
> 
> I avoided creating an array for memory_tier (memory_tier[]) so that we 
> can keep it dynamic. Keeping dev embedded in struct memory_tier simplify 
> the life cycle management of that dynamic list. We free the struct 
> memory_tier allocation via device release function (memtier->dev.release 
> = memory_tier_device_release )
> 
> Why do you think it is not a good idea?

I think that we shouldn't bind our kernel internal implementation with
user space interface too much.  Yes.  We can expose kernel internal
implementation to user space in a direct way.  I suggest you to follow
the style of "struct pglist_data" and "struct node".  If we decouple
"struct memory_tier" and "struct memory_tier_dev" (or some other name),
we can refer to "struct memory_tier" without depending on all device
core.  Memory tier should be accessible inside the kernel even without a
user interface.  And memory tier isn't a device in concept.

For life cycle management, I think that we can do that without sysfs
too.

Best Regards,
Huang, Ying




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-06  8:52                     ` Ying Huang
@ 2022-06-06  9:02                       ` Aneesh Kumar K V
  2022-06-08  1:24                         ` Ying Huang
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06  9:02 UTC (permalink / raw)
  To: Ying Huang
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes, linux-mm,
	akpm

On 6/6/22 2:22 PM, Ying Huang wrote:
....
>>>> I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
>>>> MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
>>>> with two memory tiers (DRAM and pmem) the demotion continues to work
>>>> as expected after patch 3 ("mm/demotion: Build demotion targets based on
>>>> explicit memory tiers"). With that, there will not be any regression in
>>>> between the patch series.
>>>>
>>>
>>> Thanks!  Please do that.  And I think you can add sysfs interface after
>>> that patch too.  That is, in [1/7]
>>>
>>
>> I am not sure why you insist on moving sysfs interfaces later. They are
>> introduced based on the helper added. It make patch review easier to
>> look at both the helpers and the user of the helper together in a patch.
> 
> Yes.  We should introduce a function and its user in one patch for
> review.  But this doesn't mean that we should introduce the user space
> interface as the first step.  I think the user space interface should
> output correct information when we expose it.
> 

If you look at this patchset we are not exposing any wrong information.

patch 1 -> adds ability to register the memory tiers and expose details 
of registered memory tier. At this point the patchset only support DRAM 
tier and hence only one tier is shown

patch 2 -> adds per node memtier attribute. So only DRAM nodes shows the 
details, because the patchset yet has not introduced a slower memory 
tier like PMEM.

patch 4 -> introducing demotion. Will make that patch 5

patch 5 -> add dax kmem numa nodes as slower memory tier. Now this 
becomes patch 4 at which point we will correctly show two memory tiers 
in the system.


>>> +struct memory_tier {
>>> +	nodemask_t nodelist;
>>> +};
>>>
>>> And struct device can be added after the kernel has switched the
>>> implementation based on explicit memory tiers.
>>>
>>> +struct memory_tier {
>>> +	struct device dev;
>>> +	nodemask_t nodelist;
>>> +};
>>>
>>
>>
>> Can you elaborate on this? or possibly review the v5 series indicating
>> what change you are suggesting here?
>>
>>
>>> But I don't think it's a good idea to have "struct device" embedded in
>>> "struct memory_tier".  We don't have "struct device" embedded in "struct
>>> pgdata_list"...
>>>
>>
>> I avoided creating an array for memory_tier (memory_tier[]) so that we
>> can keep it dynamic. Keeping dev embedded in struct memory_tier simplify
>> the life cycle management of that dynamic list. We free the struct
>> memory_tier allocation via device release function (memtier->dev.release
>> = memory_tier_device_release )
>>
>> Why do you think it is not a good idea?
> 
> I think that we shouldn't bind our kernel internal implementation with
> user space interface too much.  Yes.  We can expose kernel internal
> implementation to user space in a direct way.  I suggest you to follow
> the style of "struct pglist_data" and "struct node".  If we decouple
> "struct memory_tier" and "struct memory_tier_dev" (or some other name),
> we can refer to "struct memory_tier" without depending on all device
> core.  Memory tier should be accessible inside the kernel even without a
> user interface.  And memory tier isn't a device in concept.
> 

memory_tiers are different from pglist_data and struct node in that we 
also allow the creation of them from userspace. That is the life time of 
a memory tier is driven from userspace and it is much easier to manage 
them via sysfs file lifetime mechanism rather than inventing an 
independent and more complex way of doing the same.

> For life cycle management, I think that we can do that without sysfs
> too.
> 

unless there are specific details that you think will be broken by 
embedding struct device inside struct memory_tier, IMHO I still consider 
the embedded implementation much simpler and in accordance with other 
kernel design patterns.

-aneesh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-06-03  9:04           ` Aneesh Kumar K V
@ 2022-06-06 10:11             ` Bharata B Rao
  2022-06-06 10:16               ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Bharata B Rao @ 2022-06-06 10:11 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote:
> On 6/2/22 12:06 PM, Bharata B Rao wrote:
>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>>> I was experimenting with this patchset and found this behaviour.
>>>> Here's what I did:
>>>>
>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>>> driver by default.
>>>>
>>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>>> where DRAM already exists)
>>>>
>>>
>>> That should have placed it in memtier2.
>>>
>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>>> that expected to happen automatically when a node with dax kmem
>>>> device comes up?
>>>>
>>>
>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
>>
>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
>> is already part of memtier1 whose nodelist shows 0-1.
>>
> 
> can you find out which code path added node1 to memtier1?

 node_set_memory_tier_rank+0x63/0x80 
 migrate_on_reclaim_callback+0x40/0x4d 
 blocking_notifier_call_chain+0x68/0x90 
 memory_notify+0x1b/0x20 
 online_pages+0x257/0x2f0 
 memory_subsys_online+0x99/0x150 
 device_online+0x65/0x90 
 online_memory_block+0x1b/0x20 
 walk_memory_blocks+0x85/0xc0 
 ? generic_online_page+0x40/0x40 
 add_memory_resource+0x1fa/0x2d0 
 add_memory_driver_managed+0x80/0xc0 
 dev_dax_kmem_probe+0x1af/0x250 
 dax_bus_probe+0x6e/0xa0

After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM)
from dev_dax_kmem_probe() finds that the memtier is already set.

> Do you have regular memory also appearing on node1?

No, regular memory is on Node0.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-06-06 10:11             ` Bharata B Rao
@ 2022-06-06 10:16               ` Aneesh Kumar K V
  2022-06-06 11:54                 ` Aneesh Kumar K.V
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06 10:16 UTC (permalink / raw)
  To: Bharata B Rao, linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/6/22 3:41 PM, Bharata B Rao wrote:
> On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote:
>> On 6/2/22 12:06 PM, Bharata B Rao wrote:
>>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>>>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>>>> I was experimenting with this patchset and found this behaviour.
>>>>> Here's what I did:
>>>>>
>>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>>>> driver by default.
>>>>>
>>>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>>>> where DRAM already exists)
>>>>>
>>>>
>>>> That should have placed it in memtier2.
>>>>
>>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>>>> that expected to happen automatically when a node with dax kmem
>>>>> device comes up?
>>>>>
>>>>
>>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
>>>
>>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
>>> is already part of memtier1 whose nodelist shows 0-1.
>>>
>>
>> can you find out which code path added node1 to memtier1?
> 
>   node_set_memory_tier_rank+0x63/0x80
>   migrate_on_reclaim_callback+0x40/0x4d
>   blocking_notifier_call_chain+0x68/0x90
>   memory_notify+0x1b/0x20
>   online_pages+0x257/0x2f0
>   memory_subsys_online+0x99/0x150
>   device_online+0x65/0x90
>   online_memory_block+0x1b/0x20
>   walk_memory_blocks+0x85/0xc0
>   ? generic_online_page+0x40/0x40
>   add_memory_resource+0x1fa/0x2d0
>   add_memory_driver_managed+0x80/0xc0
>   dev_dax_kmem_probe+0x1af/0x250
>   dax_bus_probe+0x6e/0xa0
> 
> After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM)
> from dev_dax_kmem_probe() finds that the memtier is already set.
> 
>> Do you have regular memory also appearing on node1?
> 
> No, regular memory is on Node0.
> 

Thanks for the stack trace. I was getting the kvm setup on my laptop to 
test this. We should move node_set_mem_tier() early. You had automatic 
online on memory hotplug

	/* online pages if requested */
	if (mhp_default_online_type != MMOP_OFFLINE)
		walk_memory_blocks(start, size, NULL, online_memory_block);


which caused memory to be onlined before we could do node_set_mem_tier. 
That is a bug on my side. Will send you a change after testing .

-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-06-06 10:16               ` Aneesh Kumar K V
@ 2022-06-06 11:54                 ` Aneesh Kumar K.V
  2022-06-06 12:09                   ` Bharata B Rao
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-06 11:54 UTC (permalink / raw)
  To: Bharata B Rao, linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 6/6/22 3:41 PM, Bharata B Rao wrote:
>> On 6/3/2022 2:34 PM, Aneesh Kumar K V wrote:
>>> On 6/2/22 12:06 PM, Bharata B Rao wrote:
>>>> On 6/1/2022 7:19 PM, Aneesh Kumar K V wrote:
>>>>> On 6/1/22 11:59 AM, Bharata B Rao wrote:
>>>>>> I was experimenting with this patchset and found this behaviour.
>>>>>> Here's what I did:
>>>>>>
>>>>>> Boot a KVM guest with vNVDIMM device which ends up with device_dax
>>>>>> driver by default.
>>>>>>
>>>>>> Use it as RAM by binding it to dax kmem driver. It now appears as
>>>>>> RAM with a new NUMA node that is put to memtier1 (the existing tier
>>>>>> where DRAM already exists)
>>>>>>
>>>>>
>>>>> That should have placed it in memtier2.
>>>>>
>>>>>> I can move it to memtier2 (MEMORY_RANK_PMEM) manually, but isn't
>>>>>> that expected to happen automatically when a node with dax kmem
>>>>>> device comes up?
>>>>>>
>>>>>
>>>>> This can happen if we have added the same NUMA node to memtier1 before dax kmem driver initialized the pmem memory. Can you check before the above node_set_memory_tier_rank() whether the specific NUMA node is already part of any memory tier?
>>>>
>>>> When we reach node_set_memory_tier_rank(), node1 (that has the pmem device)
>>>> is already part of memtier1 whose nodelist shows 0-1.
>>>>
>>>
>>> can you find out which code path added node1 to memtier1?
>> 
>>   node_set_memory_tier_rank+0x63/0x80
>>   migrate_on_reclaim_callback+0x40/0x4d
>>   blocking_notifier_call_chain+0x68/0x90
>>   memory_notify+0x1b/0x20
>>   online_pages+0x257/0x2f0
>>   memory_subsys_online+0x99/0x150
>>   device_online+0x65/0x90
>>   online_memory_block+0x1b/0x20
>>   walk_memory_blocks+0x85/0xc0
>>   ? generic_online_page+0x40/0x40
>>   add_memory_resource+0x1fa/0x2d0
>>   add_memory_driver_managed+0x80/0xc0
>>   dev_dax_kmem_probe+0x1af/0x250
>>   dax_bus_probe+0x6e/0xa0
>> 
>> After this the explicit call to node_set_memory_tier_rank(numa_node, MEMORY_RANK_PMEM)
>> from dev_dax_kmem_probe() finds that the memtier is already set.
>> 
>>> Do you have regular memory also appearing on node1?
>> 
>> No, regular memory is on Node0.
>> 
>
> Thanks for the stack trace. I was getting the kvm setup on my laptop to 
> test this. We should move node_set_mem_tier() early. You had automatic 
> online on memory hotplug
>
> 	/* online pages if requested */
> 	if (mhp_default_online_type != MMOP_OFFLINE)
> 		walk_memory_blocks(start, size, NULL, online_memory_block);
>
>
> which caused memory to be onlined before we could do node_set_mem_tier. 
> That is a bug on my side. Will send you a change after testing .
>
Can you try this change?

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7a11c387fbbc..905609260dda 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		goto err_reg_mgid;
 	data->mgid = rc;
 
+	/*
+	 * This get called before the node is brought online. That
+	 * is because depending on the value of mhp_default_online_type
+	 * the kernel will online the memory along with hotplug
+	 * operation. Add the new memory tier before we try to bring
+	 * memory blocks online. Otherwise new node will get added to
+	 * the default memory tier via hotplug callbacks.
+	 */
+#ifdef CONFIG_TIERED_MEMORY
+	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+#endif
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct resource *res;
 		struct range range;
@@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 
 	dev_set_drvdata(dev, data);
 
-#ifdef CONFIG_TIERED_MEMORY
-	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
-#endif
 	return 0;
 
 err_request_mem:


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-06-06 11:54                 ` Aneesh Kumar K.V
@ 2022-06-06 12:09                   ` Bharata B Rao
  2022-06-06 13:00                     ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Bharata B Rao @ 2022-06-06 12:09 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/6/2022 5:24 PM, Aneesh Kumar K.V wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
> Can you try this change?
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7a11c387fbbc..905609260dda 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  		goto err_reg_mgid;
>  	data->mgid = rc;
>  
> +	/*
> +	 * This get called before the node is brought online. That
> +	 * is because depending on the value of mhp_default_online_type
> +	 * the kernel will online the memory along with hotplug
> +	 * operation. Add the new memory tier before we try to bring
> +	 * memory blocks online. Otherwise new node will get added to
> +	 * the default memory tier via hotplug callbacks.
> +	 */
> +#ifdef CONFIG_TIERED_MEMORY
> +	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> +#endif
>  	for (i = 0; i < dev_dax->nr_range; i++) {
>  		struct resource *res;
>  		struct range range;
> @@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  
>  	dev_set_drvdata(dev, data);
>  
> -#ifdef CONFIG_TIERED_MEMORY
> -	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
> -#endif
>  	return 0;
>  
>  err_request_mem:

Yes, this fixes the issue for me. Thanks.

Regards,
Bharata.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM
  2022-06-06 12:09                   ` Bharata B Rao
@ 2022-06-06 13:00                     ` Aneesh Kumar K V
  0 siblings, 0 replies; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06 13:00 UTC (permalink / raw)
  To: Bharata B Rao, linux-mm, akpm
  Cc: Huang Ying, Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen,
	Brice Goglin, Michal Hocko, Linux Kernel Mailing List,
	Hesham Almatary, Dave Hansen, Jonathan Cameron, Alistair Popple,
	Dan Williams, Feng Tang, Jagdish Gediya, Baolin Wang,
	David Rientjes

On 6/6/22 5:39 PM, Bharata B Rao wrote:
> On 6/6/2022 5:24 PM, Aneesh Kumar K.V wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>>
>> Can you try this change?
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 7a11c387fbbc..905609260dda 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -94,6 +94,17 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   		goto err_reg_mgid;
>>   	data->mgid = rc;
>>   
>> +	/*
>> +	 * This get called before the node is brought online. That
>> +	 * is because depending on the value of mhp_default_online_type
>> +	 * the kernel will online the memory along with hotplug
>> +	 * operation. Add the new memory tier before we try to bring
>> +	 * memory blocks online. Otherwise new node will get added to
>> +	 * the default memory tier via hotplug callbacks.
>> +	 */
>> +#ifdef CONFIG_TIERED_MEMORY
>> +	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
>> +#endif
>>   	for (i = 0; i < dev_dax->nr_range; i++) {
>>   		struct resource *res;
>>   		struct range range;
>> @@ -148,9 +159,6 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   
>>   	dev_set_drvdata(dev, data);
>>   
>> -#ifdef CONFIG_TIERED_MEMORY
>> -	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
>> -#endif
>>   	return 0;
>>   
>>   err_request_mem:
> 
> Yes, this fixes the issue for me. Thanks.
> 

I might put the below change instead of the above. In the end I guess it 
is better to add a NUMA node to memory tier after the node is brought 
online than before even though with the current code it shouldn't matter 
much.

modified   drivers/dax/kmem.c
@@ -147,9 +147,15 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
  	}

  	dev_set_drvdata(dev, data);
-
+	/*
+	 * node_reset_memory_tier is used here to ensure we force
+	 * update the NUMA node memory tier. Depending on the value
+	 * of mhp_default_online_type the kernel will online the memory
+	 * blocks along with hotplug operation above. This can result in dax
+	 * kmem memory NUMA node getting added to default memory tier.
+	 */
  #ifdef CONFIG_TIERED_MEMORY
-	node_set_memory_tier(numa_node, MEMORY_TIER_PMEM);
+	node_reset_memory_tier(numa_node, MEMORY_TIER_PMEM);
  #endif
  	return 0;


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
  2022-06-03  8:40       ` Aneesh Kumar K V
@ 2022-06-06 14:59         ` Jonathan Cameron
  2022-06-06 16:01           ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Jonathan Cameron @ 2022-06-06 14:59 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On Fri, 3 Jun 2022 14:10:47 +0530
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:

> On 5/27/22 7:45 PM, Jonathan Cameron wrote:
> > On Fri, 27 May 2022 17:55:23 +0530
> > "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> >   
> >> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> >>
> >> Add support to read/write the memory tierindex for a NUMA node.
> >>
> >> /sys/devices/system/node/nodeN/memtier
> >>
> >> where N = node id
> >>
> >> When read, It list the memory tier that the node belongs to.
> >>
> >> When written, the kernel moves the node into the specified
> >> memory tier, the tier assignment of all other nodes are not
> >> affected.
> >>
> >> If the memory tier does not exist, writing to the above file
> >> create the tier and assign the NUMA node to that tier.  
> > creates
> > 
> > There was some discussion in v2 of Wei Xu's RFC that what matter
> > for creation is the rank, not the tier number.
> > 
> > My suggestion is move to an explicit creation file such as
> > memtier/create_tier_from_rank
> > to which writing the rank gives results in a new tier
> > with the next device ID and requested rank.  
> 
> I think the below workflow is much simpler.
> 
> :/sys/devices/system# cat memtier/memtier1/nodelist
> 1-3
> :/sys/devices/system# cat node/node1/memtier
> 1
> :/sys/devices/system# ls memtier/memtier*
> nodelist  power  rank  subsystem  uevent
> /sys/devices/system# ls memtier/
> default_rank  max_tier  memtier1  power  uevent
> :/sys/devices/system# echo 2 > node/node1/memtier
> :/sys/devices/system#
> 
> :/sys/devices/system# ls memtier/
> default_rank  max_tier  memtier1  memtier2  power  uevent
> :/sys/devices/system# cat memtier/memtier1/nodelist
> 2-3
> :/sys/devices/system# cat memtier/memtier2/nodelist
> 1
> :/sys/devices/system#
> 
> ie, to create a tier we just write the tier id/tier index to 
> node/nodeN/memtier file. That will create a new memory tier if needed 
> and add the node to that specific memory tier. Since for now we are 
> having 1:1 mapping between tier index to rank value, we can derive the 
> rank value from the memory tier index.
> 
> For dynamic memory tier support, we can assign a rank value such that 
> new memory tiers are always created such that it comes last in the 
> demotion order.

I'm not keen on having to pass through an intermediate state where
the rank may well be wrong, but I guess it's not that harmful even
if it feels wrong ;)

Races are potentially a bit of a pain though depending on what we
expect the usage model to be.

There are patterns (CXL regions for example) of guaranteeing the
'right' device is created by doing something like 

cat create_tier > temp.txt 
#(temp gets 2 for example on first call then
# next read of this file gets 3 etc)

cat temp.txt > create_tier
# will fail if there hasn't been a read of the same value

Assuming all software keeps to the model, then there are no
race conditions over creation.  Otherwise we have two new
devices turn up very close to each other and userspace scripting
tries to create two new tiers - if it races they may end up in
the same tier when that wasn't the intent.  Then code to set
the rank also races and we get two potentially very different
memories in a tier with a randomly selected rank.

Fun and games...  And a fine illustration why sysfs based 'device'
creation is tricky to get right (and lots of cases in the kernel
don't).

Jonathan


> 
> -aneesh
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
  2022-06-06 14:59         ` Jonathan Cameron
@ 2022-06-06 16:01           ` Aneesh Kumar K V
  2022-06-06 16:16             ` Jonathan Cameron
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06 16:01 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-mm, akpm, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On 6/6/22 8:29 PM, Jonathan Cameron wrote:
> On Fri, 3 Jun 2022 14:10:47 +0530
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
> 
>> On 5/27/22 7:45 PM, Jonathan Cameron wrote:
>>> On Fri, 27 May 2022 17:55:23 +0530
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
>>>    
>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>
>>>> Add support to read/write the memory tierindex for a NUMA node.
>>>>
>>>> /sys/devices/system/node/nodeN/memtier
>>>>
>>>> where N = node id
>>>>
>>>> When read, It list the memory tier that the node belongs to.
>>>>
>>>> When written, the kernel moves the node into the specified
>>>> memory tier, the tier assignment of all other nodes are not
>>>> affected.
>>>>
>>>> If the memory tier does not exist, writing to the above file
>>>> create the tier and assign the NUMA node to that tier.
>>> creates
>>>
>>> There was some discussion in v2 of Wei Xu's RFC that what matter
>>> for creation is the rank, not the tier number.
>>>
>>> My suggestion is move to an explicit creation file such as
>>> memtier/create_tier_from_rank
>>> to which writing the rank gives results in a new tier
>>> with the next device ID and requested rank.
>>
>> I think the below workflow is much simpler.
>>
>> :/sys/devices/system# cat memtier/memtier1/nodelist
>> 1-3
>> :/sys/devices/system# cat node/node1/memtier
>> 1
>> :/sys/devices/system# ls memtier/memtier*
>> nodelist  power  rank  subsystem  uevent
>> /sys/devices/system# ls memtier/
>> default_rank  max_tier  memtier1  power  uevent
>> :/sys/devices/system# echo 2 > node/node1/memtier
>> :/sys/devices/system#
>>
>> :/sys/devices/system# ls memtier/
>> default_rank  max_tier  memtier1  memtier2  power  uevent
>> :/sys/devices/system# cat memtier/memtier1/nodelist
>> 2-3
>> :/sys/devices/system# cat memtier/memtier2/nodelist
>> 1
>> :/sys/devices/system#
>>
>> ie, to create a tier we just write the tier id/tier index to
>> node/nodeN/memtier file. That will create a new memory tier if needed
>> and add the node to that specific memory tier. Since for now we are
>> having 1:1 mapping between tier index to rank value, we can derive the
>> rank value from the memory tier index.
>>
>> For dynamic memory tier support, we can assign a rank value such that
>> new memory tiers are always created such that it comes last in the
>> demotion order.
> 
> I'm not keen on having to pass through an intermediate state where
> the rank may well be wrong, but I guess it's not that harmful even
> if it feels wrong ;)
> 

Any new memory tier added can be of lowest rank (rank - 0) and hence 
will appear as the highest memory tier in demotion order. User can then
assign the right rank value to the memory tier? Also the actual demotion 
target paths are built during memory block online which in most case 
would happen after we properly verify that the device got assigned to 
the right memory tier with correct rank value?

> Races are potentially a bit of a pain though depending on what we
> expect the usage model to be.
> 
> There are patterns (CXL regions for example) of guaranteeing the
> 'right' device is created by doing something like
> 
> cat create_tier > temp.txt
> #(temp gets 2 for example on first call then
> # next read of this file gets 3 etc)
> 
> cat temp.txt > create_tier
> # will fail if there hasn't been a read of the same value
> 
> Assuming all software keeps to the model, then there are no
> race conditions over creation.  Otherwise we have two new
> devices turn up very close to each other and userspace scripting
> tries to create two new tiers - if it races they may end up in
> the same tier when that wasn't the intent.  Then code to set
> the rank also races and we get two potentially very different
> memories in a tier with a randomly selected rank.
> 
> Fun and games...  And a fine illustration why sysfs based 'device'
> creation is tricky to get right (and lots of cases in the kernel
> don't).
> 

I would expect userspace to be careful and verify the memory tier and 
rank value before we online the memory blocks backed by the device. Even 
if we race, the result would be two device not intended to be part of 
the same memory tier appearing at the same tier. But then we won't be 
building demotion targets yet. So userspace could verify this, move the 
nodes out of the memory tier. Once it is verified, memory blocks can be 
onlined.

Having said that can you outline the usage of 
memtier/create_tier_from_rank ?

-aneesh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
  2022-06-06 16:01           ` Aneesh Kumar K V
@ 2022-06-06 16:16             ` Jonathan Cameron
  2022-06-06 16:39               ` Aneesh Kumar K V
  0 siblings, 1 reply; 66+ messages in thread
From: Jonathan Cameron @ 2022-06-06 16:16 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: linux-mm, akpm, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On Mon, 6 Jun 2022 21:31:16 +0530
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:

> On 6/6/22 8:29 PM, Jonathan Cameron wrote:
> > On Fri, 3 Jun 2022 14:10:47 +0530
> > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
> >   
> >> On 5/27/22 7:45 PM, Jonathan Cameron wrote:  
> >>> On Fri, 27 May 2022 17:55:23 +0530
> >>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
> >>>      
> >>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> >>>>
> >>>> Add support to read/write the memory tierindex for a NUMA node.
> >>>>
> >>>> /sys/devices/system/node/nodeN/memtier
> >>>>
> >>>> where N = node id
> >>>>
> >>>> When read, It list the memory tier that the node belongs to.
> >>>>
> >>>> When written, the kernel moves the node into the specified
> >>>> memory tier, the tier assignment of all other nodes are not
> >>>> affected.
> >>>>
> >>>> If the memory tier does not exist, writing to the above file
> >>>> create the tier and assign the NUMA node to that tier.  
> >>> creates
> >>>
> >>> There was some discussion in v2 of Wei Xu's RFC that what matter
> >>> for creation is the rank, not the tier number.
> >>>
> >>> My suggestion is move to an explicit creation file such as
> >>> memtier/create_tier_from_rank
> >>> to which writing the rank gives results in a new tier
> >>> with the next device ID and requested rank.  
> >>
> >> I think the below workflow is much simpler.
> >>
> >> :/sys/devices/system# cat memtier/memtier1/nodelist
> >> 1-3
> >> :/sys/devices/system# cat node/node1/memtier
> >> 1
> >> :/sys/devices/system# ls memtier/memtier*
> >> nodelist  power  rank  subsystem  uevent
> >> /sys/devices/system# ls memtier/
> >> default_rank  max_tier  memtier1  power  uevent
> >> :/sys/devices/system# echo 2 > node/node1/memtier
> >> :/sys/devices/system#
> >>
> >> :/sys/devices/system# ls memtier/
> >> default_rank  max_tier  memtier1  memtier2  power  uevent
> >> :/sys/devices/system# cat memtier/memtier1/nodelist
> >> 2-3
> >> :/sys/devices/system# cat memtier/memtier2/nodelist
> >> 1
> >> :/sys/devices/system#
> >>
> >> ie, to create a tier we just write the tier id/tier index to
> >> node/nodeN/memtier file. That will create a new memory tier if needed
> >> and add the node to that specific memory tier. Since for now we are
> >> having 1:1 mapping between tier index to rank value, we can derive the
> >> rank value from the memory tier index.
> >>
> >> For dynamic memory tier support, we can assign a rank value such that
> >> new memory tiers are always created such that it comes last in the
> >> demotion order.  
> > 
> > I'm not keen on having to pass through an intermediate state where
> > the rank may well be wrong, but I guess it's not that harmful even
> > if it feels wrong ;)
> >   
> 
> Any new memory tier added can be of lowest rank (rank - 0) and hence 
> will appear as the highest memory tier in demotion order. 

Depends on driver interaction - if new memory is CXL attached or
GPU attached, chances are the driver has an input on which tier
it is put in by default.

> User can then
> assign the right rank value to the memory tier? Also the actual demotion 
> target paths are built during memory block online which in most case 
> would happen after we properly verify that the device got assigned to 
> the right memory tier with correct rank value?

Agreed, though that may change the model of how memory is brought online
somewhat.

> 
> > Races are potentially a bit of a pain though depending on what we
> > expect the usage model to be.
> > 
> > There are patterns (CXL regions for example) of guaranteeing the
> > 'right' device is created by doing something like
> > 
> > cat create_tier > temp.txt
> > #(temp gets 2 for example on first call then
> > # next read of this file gets 3 etc)
> > 
> > cat temp.txt > create_tier
> > # will fail if there hasn't been a read of the same value
> > 
> > Assuming all software keeps to the model, then there are no
> > race conditions over creation.  Otherwise we have two new
> > devices turn up very close to each other and userspace scripting
> > tries to create two new tiers - if it races they may end up in
> > the same tier when that wasn't the intent.  Then code to set
> > the rank also races and we get two potentially very different
> > memories in a tier with a randomly selected rank.
> > 
> > Fun and games...  And a fine illustration why sysfs based 'device'
> > creation is tricky to get right (and lots of cases in the kernel
> > don't).
> >   
> 
> I would expect userspace to be careful and verify the memory tier and 
> rank value before we online the memory blocks backed by the device. Even 
> if we race, the result would be two device not intended to be part of 
> the same memory tier appearing at the same tier. But then we won't be 
> building demotion targets yet. So userspace could verify this, move the 
> nodes out of the memory tier. Once it is verified, memory blocks can be 
> onlined.

The race is there and not avoidable as far as I can see. Two processes A and B.

A checks for a spare tier number
B checks for a spare tier number
A tries to assign node 3 to new tier 2 (new tier created)
B tries to assign node 4 to new tier 2 (accidentally hits existing tier - as this
is the same method we'd use to put it in the existing tier we can't tell this
write was meant to create a new tier).
A writes rank 100 to tier 2
A checks rank for tier 2 and finds it is 100 as expected.
B write rank 200 to tier 2 (it could check if still default but even that is racy)
B checks rank for tier 2 rank and finds it is 200 as expected.
A onlines memory.
B onlines memory.

Both think they got what they wanted, but A definitely didn't.

One work around is the read / write approach and create_tier.

A reads create_tier - gets 2.
B reads create_tier - gets 3.
A writes 2 to create_tier as that's what it read.
B writes 3 to create_tier as that's what it read.

continue with created tiers.  Obviously can exhaust tiers, but if this is
root only, could just create lots anyway so no worse off.
 
> 
> Having said that can you outline the usage of 
> memtier/create_tier_from_rank ?

There are corner cases to deal with...

A writes 100 to create_tier_from_rank.
A goes looking for matching tier - finds it: tier2
B writes 200 to create_tier_from_rank
B goes looking for matching tier - finds it: tier3

rest is fine as operating on different tiers.

Trickier is
A writes 100 to create_tier_from_rank  - succeed.
B writes 100 to create_tier_from_rank  - Could fail, or could just eat it?

Logically this is same as separate create_tier and then a write
of rank, but in one operation, but then you need to search
for the right one.  As such, perhaps a create_tier
that does the read/write pair as above is the best solution.

Jonathan


> 
> -aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
  2022-06-06 16:16             ` Jonathan Cameron
@ 2022-06-06 16:39               ` Aneesh Kumar K V
  2022-06-06 17:46                 ` Aneesh Kumar K.V
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-06 16:39 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-mm, akpm, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

On 6/6/22 9:46 PM, Jonathan Cameron wrote:
> On Mon, 6 Jun 2022 21:31:16 +0530
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
> 
>> On 6/6/22 8:29 PM, Jonathan Cameron wrote:
>>> On Fri, 3 Jun 2022 14:10:47 +0530
>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
>>>    
>>>> On 5/27/22 7:45 PM, Jonathan Cameron wrote:
>>>>> On Fri, 27 May 2022 17:55:23 +0530
>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
>>>>>       
>>>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>
>>>>>> Add support to read/write the memory tierindex for a NUMA node.
>>>>>>
>>>>>> /sys/devices/system/node/nodeN/memtier
>>>>>>
>>>>>> where N = node id
>>>>>>
>>>>>> When read, It list the memory tier that the node belongs to.
>>>>>>
>>>>>> When written, the kernel moves the node into the specified
>>>>>> memory tier, the tier assignment of all other nodes are not
>>>>>> affected.
>>>>>>
>>>>>> If the memory tier does not exist, writing to the above file
>>>>>> create the tier and assign the NUMA node to that tier.
>>>>> creates
>>>>>
>>>>> There was some discussion in v2 of Wei Xu's RFC that what matter
>>>>> for creation is the rank, not the tier number.
>>>>>
>>>>> My suggestion is move to an explicit creation file such as
>>>>> memtier/create_tier_from_rank
>>>>> to which writing the rank gives results in a new tier
>>>>> with the next device ID and requested rank.
>>>>
>>>> I think the below workflow is much simpler.
>>>>
>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>> 1-3
>>>> :/sys/devices/system# cat node/node1/memtier
>>>> 1
>>>> :/sys/devices/system# ls memtier/memtier*
>>>> nodelist  power  rank  subsystem  uevent
>>>> /sys/devices/system# ls memtier/
>>>> default_rank  max_tier  memtier1  power  uevent
>>>> :/sys/devices/system# echo 2 > node/node1/memtier
>>>> :/sys/devices/system#
>>>>
>>>> :/sys/devices/system# ls memtier/
>>>> default_rank  max_tier  memtier1  memtier2  power  uevent
>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>> 2-3
>>>> :/sys/devices/system# cat memtier/memtier2/nodelist
>>>> 1
>>>> :/sys/devices/system#
>>>>
>>>> ie, to create a tier we just write the tier id/tier index to
>>>> node/nodeN/memtier file. That will create a new memory tier if needed
>>>> and add the node to that specific memory tier. Since for now we are
>>>> having 1:1 mapping between tier index to rank value, we can derive the
>>>> rank value from the memory tier index.
>>>>
>>>> For dynamic memory tier support, we can assign a rank value such that
>>>> new memory tiers are always created such that it comes last in the
>>>> demotion order.
>>>
>>> I'm not keen on having to pass through an intermediate state where
>>> the rank may well be wrong, but I guess it's not that harmful even
>>> if it feels wrong ;)
>>>    
>>
>> Any new memory tier added can be of lowest rank (rank - 0) and hence
>> will appear as the highest memory tier in demotion order.
> 
> Depends on driver interaction - if new memory is CXL attached or
> GPU attached, chances are the driver has an input on which tier
> it is put in by default.
> 
>> User can then
>> assign the right rank value to the memory tier? Also the actual demotion
>> target paths are built during memory block online which in most case
>> would happen after we properly verify that the device got assigned to
>> the right memory tier with correct rank value?
> 
> Agreed, though that may change the model of how memory is brought online
> somewhat.
> 
>>
>>> Races are potentially a bit of a pain though depending on what we
>>> expect the usage model to be.
>>>
>>> There are patterns (CXL regions for example) of guaranteeing the
>>> 'right' device is created by doing something like
>>>
>>> cat create_tier > temp.txt
>>> #(temp gets 2 for example on first call then
>>> # next read of this file gets 3 etc)
>>>
>>> cat temp.txt > create_tier
>>> # will fail if there hasn't been a read of the same value
>>>
>>> Assuming all software keeps to the model, then there are no
>>> race conditions over creation.  Otherwise we have two new
>>> devices turn up very close to each other and userspace scripting
>>> tries to create two new tiers - if it races they may end up in
>>> the same tier when that wasn't the intent.  Then code to set
>>> the rank also races and we get two potentially very different
>>> memories in a tier with a randomly selected rank.
>>>
>>> Fun and games...  And a fine illustration why sysfs based 'device'
>>> creation is tricky to get right (and lots of cases in the kernel
>>> don't).
>>>    
>>
>> I would expect userspace to be careful and verify the memory tier and
>> rank value before we online the memory blocks backed by the device. Even
>> if we race, the result would be two device not intended to be part of
>> the same memory tier appearing at the same tier. But then we won't be
>> building demotion targets yet. So userspace could verify this, move the
>> nodes out of the memory tier. Once it is verified, memory blocks can be
>> onlined.
> 
> The race is there and not avoidable as far as I can see. Two processes A and B.
> 
> A checks for a spare tier number
> B checks for a spare tier number
> A tries to assign node 3 to new tier 2 (new tier created)
> B tries to assign node 4 to new tier 2 (accidentally hits existing tier - as this
> is the same method we'd use to put it in the existing tier we can't tell this
> write was meant to create a new tier).
> A writes rank 100 to tier 2
> A checks rank for tier 2 and finds it is 100 as expected.
> B write rank 200 to tier 2 (it could check if still default but even that is racy)
> B checks rank for tier 2 rank and finds it is 200 as expected.
> A onlines memory.
> B onlines memory.
> 
> Both think they got what they wanted, but A definitely didn't.
> 
> One work around is the read / write approach and create_tier.
> 
> A reads create_tier - gets 2.
> B reads create_tier - gets 3.
> A writes 2 to create_tier as that's what it read.
> B writes 3 to create_tier as that's what it read.
> 
> continue with created tiers.  Obviously can exhaust tiers, but if this is
> root only, could just create lots anyway so no worse off.
>   
>>
>> Having said that can you outline the usage of
>> memtier/create_tier_from_rank ?
> 
> There are corner cases to deal with...
> 
> A writes 100 to create_tier_from_rank.
> A goes looking for matching tier - finds it: tier2
> B writes 200 to create_tier_from_rank
> B goes looking for matching tier - finds it: tier3
> 
> rest is fine as operating on different tiers.
> 
> Trickier is
> A writes 100 to create_tier_from_rank  - succeed.
> B writes 100 to create_tier_from_rank  - Could fail, or could just eat it?
> 
> Logically this is same as separate create_tier and then a write
> of rank, but in one operation, but then you need to search
> for the right one.  As such, perhaps a create_tier
> that does the read/write pair as above is the best solution.
> 

This all is good when we allow dynamic rank values. But currently we are 
restricting ourselves to three rank value as below:

rank   memtier
300    memtier0
200    memtier1
100    memtier2

Now with the above, how do we define a write to create_tier_from_rank. 
What should be the behavior if user write value other than above defined 
rank values? Also enforcing the above three rank values as supported 
implies teaching userspace about them. I am trying to see how to fit
create_tier_from_rank without requiring the above.

Can we look at implementing create_tier_from_rank when we start 
supporting dynamic tiers/rank values? ie,

we still allow node/nodeN/memtier. But with dynamic tiers a race free
way to get a new memory tier would be echo rank > 
memtier/create_tier_from_rank. We could also say, memtier0/1/2 are 
kernel defined memory tiers. Writing to memtier/create_tier_from_rank 
will create new memory tiers above memtier2 with the rank value specified?

-aneesh




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order
  2022-06-06  5:26             ` Ying Huang
  2022-06-06  6:21               ` Aneesh Kumar K.V
@ 2022-06-06 17:07               ` Yang Shi
  1 sibling, 0 replies; 66+ messages in thread
From: Yang Shi @ 2022-06-06 17:07 UTC (permalink / raw)
  To: Ying Huang
  Cc: Aneesh Kumar K V, Linux MM, Andrew Morton, Greg Thelen,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Jonathan Cameron, Alistair Popple, Dan Williams, Feng Tang,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Sun, Jun 5, 2022 at 10:27 PM Ying Huang <ying.huang@intel.com> wrote:
>
> On Mon, 2022-06-06 at 09:37 +0530, Aneesh Kumar K V wrote:
> > On 6/6/22 6:13 AM, Ying Huang wrote:
> > > On Fri, 2022-06-03 at 20:39 +0530, Aneesh Kumar K V wrote:
> > > > On 6/2/22 1:05 PM, Ying Huang wrote:
> > > > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > > > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > > >
> > > > > > currently, a higher tier node can only be demoted to selected
> > > > > > nodes on the next lower tier as defined by the demotion path,
> > > > > > not any other node from any lower tier.  This strict, hard-coded
> > > > > > demotion order does not work in all use cases (e.g. some use cases
> > > > > > may want to allow cross-socket demotion to another node in the same
> > > > > > demotion tier as a fallback when the preferred demotion node is out
> > > > > > of space). This demotion order is also inconsistent with the page
> > > > > > allocation fallback order when all the nodes in a higher tier are
> > > > > > out of space: The page allocation can fall back to any node from any
> > > > > > lower tier, whereas the demotion order doesn't allow that currently.
> > > > > >
> > > > > > This patch adds support to get all the allowed demotion targets mask
> > > > > > for node, also demote_page_list() function is modified to utilize this
> > > > > > allowed node mask by filling it in migration_target_control structure
> > > > > > before passing it to migrate_pages().
> > > > >
> > > >
> > > > ...
> > > >
> > > > > >     * Take pages on @demote_list and attempt to demote them to
> > > > > >     * another node.  Pages which are not demoted are left on
> > > > > > @@ -1481,6 +1464,19 @@ static unsigned int demote_page_list(struct list_head *demote_pages,
> > > > > >    {
> > > > > >       int target_nid = next_demotion_node(pgdat->node_id);
> > > > > >       unsigned int nr_succeeded;
> > > > > > +     nodemask_t allowed_mask;
> > > > > > +
> > > > > > +     struct migration_target_control mtc = {
> > > > > > +             /*
> > > > > > +              * Allocate from 'node', or fail quickly and quietly.
> > > > > > +              * When this happens, 'page' will likely just be discarded
> > > > > > +              * instead of migrated.
> > > > > > +              */
> > > > > > +             .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) | __GFP_NOWARN |
> > > > > > +                     __GFP_NOMEMALLOC | GFP_NOWAIT,
> > > > > > +             .nid = target_nid,
> > > > > > +             .nmask = &allowed_mask
> > > > > > +     };
> > > > >
> > > > > IMHO, we should try to allocate from preferred node firstly (which will
> > > > > kick kswapd of the preferred node if necessary).  If failed, we will
> > > > > fallback to all allowed node.
> > > > >
> > > > > As we discussed as follows,
> > > > >
> > > > > https://lore.kernel.org/lkml/69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com/
> > > > >
> > > > > That is, something like below,
> > > > >
> > > > > static struct page *alloc_demote_page(struct page *page, unsigned long node)
> > > > > {
> > > > >         struct page *page;
> > > > >         nodemask_t allowed_mask;
> > > > >         struct migration_target_control mtc = {
> > > > >                 /*
> > > > >                  * Allocate from 'node', or fail quickly and quietly.
> > > > >                  * When this happens, 'page' will likely just be discarded
> > > > >                  * instead of migrated.
> > > > >                  */
> > > > >                 .gfp_mask = (GFP_HIGHUSER_MOVABLE & ~__GFP_RECLAIM) |
> > > > >                             __GFP_THISNODE  | __GFP_NOWARN |
> > > > >                             __GFP_NOMEMALLOC | GFP_NOWAIT,
> > > > >                 .nid = node
> > > > >         };
> > > > >
> > > > >         page = alloc_migration_target(page, (unsigned long)&mtc);
> > > > >         if (page)
> > > > >                 return page;
> > > > >
> > > > >         mtc.gfp_mask &= ~__GFP_THISNODE;
> > > > >         mtc.nmask = &allowed_mask;
> > > > >
> > > > >         return alloc_migration_target(page, (unsigned long)&mtc);
> > > > > }
> > > >
> > > > I skipped doing this in v5 because I was not sure this is really what we
> > > > want.
> > >
> > > I think so.  And this is the original behavior.  We should keep the
> > > original behavior as much as possible, then make changes if necessary.
> > >
> >
> > That is the reason I split the new page allocation as a separate patch.
> > Previous discussion on this topic didn't conclude on whether we really
> > need to do the above or not
> > https://lore.kernel.org/lkml/CAAPL-u9endrWf_aOnPENDPdvT-2-YhCAeJ7ONGckGnXErTLOfQ@mail.gmail.com/
>
> Please check the later email in the thread you referenced.  Both Wei and
> me agree that the use case needs to be supported.  We just didn't reach
> concensus about how to implement it.  If you think Wei's solution is
> better (referenced as below), you can try to do that too.  Although I
> think my proposed implementation is much simpler.
>
> "
> This is true with the current allocation code. But I think we can make
> some changes for demotion allocations. For example, we can add a
> GFP_DEMOTE flag and update the allocation function to wake up kswapd
> when this flag is set and we need to fall back to another node.
> "

Sorry for chiming in late. Yes, I also agree doing harder on the
preferred node before fallback is a valid usecase. I think the "trying
with __GFP_THISNODE then retrying w/o it" should be good enough for
now since demotion is the only user and vmscan is the only callsite
for now. It won't be late to modify core page allocation function to
support this semantic when it gets broader use.

>
> > Based on the above I looked at avoiding GFP_THISNODE allocation. If you
> > have experiment results that suggest otherwise can you share? I could
> > summarize that in the commit message for better description of why
> > GFP_THISNODE enforcing is needed.
>
> Why?  GFP_THISNODE is just the first step.  We will fallback to
> allocation without it if necessary.
>
> Best Regards,
> Huang, Ying
>
> > > > I guess we can do this as part of the change that is going to
> > > > introduce the usage of memory policy for the allocation?
> > >
> > > Like the memory allocation policy, the default policy should be local
> > > preferred.  We shouldn't force users to use explicit memory policy for
> > > that.
> > >
> > > And the added code isn't complex.
> > >
> >
> > -aneesh
>
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
  2022-06-06 16:39               ` Aneesh Kumar K V
@ 2022-06-06 17:46                 ` Aneesh Kumar K.V
  0 siblings, 0 replies; 66+ messages in thread
From: Aneesh Kumar K.V @ 2022-06-06 17:46 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-mm, akpm, Huang Ying, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Jagdish Gediya,
	Baolin Wang, David Rientjes

Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 6/6/22 9:46 PM, Jonathan Cameron wrote:
>> On Mon, 6 Jun 2022 21:31:16 +0530
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
>> 
>>> On 6/6/22 8:29 PM, Jonathan Cameron wrote:
>>>> On Fri, 3 Jun 2022 14:10:47 +0530
>>>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote:
>>>>    
>>>>> On 5/27/22 7:45 PM, Jonathan Cameron wrote:
>>>>>> On Fri, 27 May 2022 17:55:23 +0530
>>>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:
>>>>>>       
>>>>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>>
>>>>>>> Add support to read/write the memory tierindex for a NUMA node.
>>>>>>>
>>>>>>> /sys/devices/system/node/nodeN/memtier
>>>>>>>
>>>>>>> where N = node id
>>>>>>>
>>>>>>> When read, It list the memory tier that the node belongs to.
>>>>>>>
>>>>>>> When written, the kernel moves the node into the specified
>>>>>>> memory tier, the tier assignment of all other nodes are not
>>>>>>> affected.
>>>>>>>
>>>>>>> If the memory tier does not exist, writing to the above file
>>>>>>> create the tier and assign the NUMA node to that tier.
>>>>>> creates
>>>>>>
>>>>>> There was some discussion in v2 of Wei Xu's RFC that what matter
>>>>>> for creation is the rank, not the tier number.
>>>>>>
>>>>>> My suggestion is move to an explicit creation file such as
>>>>>> memtier/create_tier_from_rank
>>>>>> to which writing the rank gives results in a new tier
>>>>>> with the next device ID and requested rank.
>>>>>
>>>>> I think the below workflow is much simpler.
>>>>>
>>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>>> 1-3
>>>>> :/sys/devices/system# cat node/node1/memtier
>>>>> 1
>>>>> :/sys/devices/system# ls memtier/memtier*
>>>>> nodelist  power  rank  subsystem  uevent
>>>>> /sys/devices/system# ls memtier/
>>>>> default_rank  max_tier  memtier1  power  uevent
>>>>> :/sys/devices/system# echo 2 > node/node1/memtier
>>>>> :/sys/devices/system#
>>>>>
>>>>> :/sys/devices/system# ls memtier/
>>>>> default_rank  max_tier  memtier1  memtier2  power  uevent
>>>>> :/sys/devices/system# cat memtier/memtier1/nodelist
>>>>> 2-3
>>>>> :/sys/devices/system# cat memtier/memtier2/nodelist
>>>>> 1
>>>>> :/sys/devices/system#
>>>>>
>>>>> ie, to create a tier we just write the tier id/tier index to
>>>>> node/nodeN/memtier file. That will create a new memory tier if needed
>>>>> and add the node to that specific memory tier. Since for now we are
>>>>> having 1:1 mapping between tier index to rank value, we can derive the
>>>>> rank value from the memory tier index.
>>>>>
>>>>> For dynamic memory tier support, we can assign a rank value such that
>>>>> new memory tiers are always created such that it comes last in the
>>>>> demotion order.
>>>>
>>>> I'm not keen on having to pass through an intermediate state where
>>>> the rank may well be wrong, but I guess it's not that harmful even
>>>> if it feels wrong ;)
>>>>    
>>>
>>> Any new memory tier added can be of lowest rank (rank - 0) and hence
>>> will appear as the highest memory tier in demotion order.
>> 
>> Depends on driver interaction - if new memory is CXL attached or
>> GPU attached, chances are the driver has an input on which tier
>> it is put in by default.
>> 
>>> User can then
>>> assign the right rank value to the memory tier? Also the actual demotion
>>> target paths are built during memory block online which in most case
>>> would happen after we properly verify that the device got assigned to
>>> the right memory tier with correct rank value?
>> 
>> Agreed, though that may change the model of how memory is brought online
>> somewhat.
>> 
>>>
>>>> Races are potentially a bit of a pain though depending on what we
>>>> expect the usage model to be.
>>>>
>>>> There are patterns (CXL regions for example) of guaranteeing the
>>>> 'right' device is created by doing something like
>>>>
>>>> cat create_tier > temp.txt
>>>> #(temp gets 2 for example on first call then
>>>> # next read of this file gets 3 etc)
>>>>
>>>> cat temp.txt > create_tier
>>>> # will fail if there hasn't been a read of the same value
>>>>
>>>> Assuming all software keeps to the model, then there are no
>>>> race conditions over creation.  Otherwise we have two new
>>>> devices turn up very close to each other and userspace scripting
>>>> tries to create two new tiers - if it races they may end up in
>>>> the same tier when that wasn't the intent.  Then code to set
>>>> the rank also races and we get two potentially very different
>>>> memories in a tier with a randomly selected rank.
>>>>
>>>> Fun and games...  And a fine illustration why sysfs based 'device'
>>>> creation is tricky to get right (and lots of cases in the kernel
>>>> don't).
>>>>    
>>>
>>> I would expect userspace to be careful and verify the memory tier and
>>> rank value before we online the memory blocks backed by the device. Even
>>> if we race, the result would be two device not intended to be part of
>>> the same memory tier appearing at the same tier. But then we won't be
>>> building demotion targets yet. So userspace could verify this, move the
>>> nodes out of the memory tier. Once it is verified, memory blocks can be
>>> onlined.
>> 
>> The race is there and not avoidable as far as I can see. Two processes A and B.
>> 
>> A checks for a spare tier number
>> B checks for a spare tier number
>> A tries to assign node 3 to new tier 2 (new tier created)
>> B tries to assign node 4 to new tier 2 (accidentally hits existing tier - as this
>> is the same method we'd use to put it in the existing tier we can't tell this
>> write was meant to create a new tier).
>> A writes rank 100 to tier 2
>> A checks rank for tier 2 and finds it is 100 as expected.
>> B write rank 200 to tier 2 (it could check if still default but even that is racy)
>> B checks rank for tier 2 rank and finds it is 200 as expected.
>> A onlines memory.
>> B onlines memory.
>> 
>> Both think they got what they wanted, but A definitely didn't.
>> 
>> One work around is the read / write approach and create_tier.
>> 
>> A reads create_tier - gets 2.
>> B reads create_tier - gets 3.
>> A writes 2 to create_tier as that's what it read.
>> B writes 3 to create_tier as that's what it read.
>> 
>> continue with created tiers.  Obviously can exhaust tiers, but if this is
>> root only, could just create lots anyway so no worse off.
>>   
>>>
>>> Having said that can you outline the usage of
>>> memtier/create_tier_from_rank ?
>> 
>> There are corner cases to deal with...
>> 
>> A writes 100 to create_tier_from_rank.
>> A goes looking for matching tier - finds it: tier2
>> B writes 200 to create_tier_from_rank
>> B goes looking for matching tier - finds it: tier3
>> 
>> rest is fine as operating on different tiers.
>> 
>> Trickier is
>> A writes 100 to create_tier_from_rank  - succeed.
>> B writes 100 to create_tier_from_rank  - Could fail, or could just eat it?
>> 
>> Logically this is same as separate create_tier and then a write
>> of rank, but in one operation, but then you need to search
>> for the right one.  As such, perhaps a create_tier
>> that does the read/write pair as above is the best solution.
>> 
>
> This all is good when we allow dynamic rank values. But currently we are 
> restricting ourselves to three rank value as below:
>
> rank   memtier
> 300    memtier0
> 200    memtier1
> 100    memtier2
>
> Now with the above, how do we define a write to create_tier_from_rank. 
> What should be the behavior if user write value other than above defined 
> rank values? Also enforcing the above three rank values as supported 
> implies teaching userspace about them. I am trying to see how to fit
> create_tier_from_rank without requiring the above.
>
> Can we look at implementing create_tier_from_rank when we start 
> supporting dynamic tiers/rank values? ie,
>
> we still allow node/nodeN/memtier. But with dynamic tiers a race free
> way to get a new memory tier would be echo rank > 
> memtier/create_tier_from_rank. We could also say, memtier0/1/2 are 
> kernel defined memory tiers. Writing to memtier/create_tier_from_rank 
> will create new memory tiers above memtier2 with the rank value specified?
>

To keep it compatible we could do this. ie, we just allow creation of
one additional memory tier (memtier3) via the above interface.


:/sys/devices/system/memtier# ls -al
total 0
drwxr-xr-x  4 root root    0 Jun  6 17:39 .
drwxr-xr-x 10 root root    0 Jun  6 17:39 ..
--w-------  1 root root 4096 Jun  6 17:40 create_tier_from_rank
-r--r--r--  1 root root 4096 Jun  6 17:40 default_tier
-r--r--r--  1 root root 4096 Jun  6 17:40 max_tier
drwxr-xr-x  3 root root    0 Jun  6 17:39 memtier1
drwxr-xr-x  2 root root    0 Jun  6 17:40 power
-rw-r--r--  1 root root 4096 Jun  6 17:39 uevent
:/sys/devices/system/memtier# echo 20 > create_tier_from_rank 
:/sys/devices/system/memtier# ls
create_tier_from_rank  default_tier  max_tier  memtier1  memtier3  power  uevent
:/sys/devices/system/memtier# cat memtier3/rank 
20
:/sys/devices/system/memtier# echo 20 > create_tier_from_rank 
bash: echo: write error: No space left on device
:/sys/devices/system/memtier# 

is this good? 

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 0468af60d427..a4150120ba24 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -13,7 +13,7 @@
 #define MEMORY_RANK_PMEM	100
 
 #define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
-#define MAX_MEMORY_TIERS  3
+#define MAX_MEMORY_TIERS  4
 
 extern bool numa_demotion_enabled;
 extern nodemask_t promotion_mask;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index c6eb223a219f..7fdee0c4c4ea 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -169,7 +169,8 @@ static void insert_memory_tier(struct memory_tier *memtier)
 	list_add_tail(&memtier->list, &memory_tiers);
 }
 
-static struct memory_tier *register_memory_tier(unsigned int tier)
+static struct memory_tier *register_memory_tier(unsigned int tier,
+						unsigned int rank)
 {
 	int error;
 	struct memory_tier *memtier;
@@ -182,7 +183,7 @@ static struct memory_tier *register_memory_tier(unsigned int tier)
 		return NULL;
 
 	memtier->dev.id = tier;
-	memtier->rank = get_rank_from_tier(tier);
+	memtier->rank = rank;
 	memtier->dev.bus = &memory_tier_subsys;
 	memtier->dev.release = memory_tier_device_release;
 	memtier->dev.groups = memory_tier_dev_groups;
@@ -218,9 +219,53 @@ default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
 }
 static DEVICE_ATTR_RO(default_tier);
 
+
+static struct memory_tier *__get_memory_tier_from_id(int id);
+static ssize_t create_tier_from_rank_store(struct device *dev,
+					   struct device_attribute *attr,
+					   const char *buf, size_t count)
+{
+	int ret, rank;
+	struct memory_tier *memtier;
+
+	ret = kstrtouint(buf, 10, &rank);
+	if (ret)
+		return ret;
+
+	if (ret == MEMORY_RANK_HBM_GPU ||
+	    rank == MEMORY_TIER_DRAM ||
+	    rank == MEMORY_RANK_PMEM)
+		return -EINVAL;
+
+	mutex_lock(&memory_tier_lock);
+	/*
+	 * For now we only support creation of one additional tier via
+	 * this interface.
+	 */
+	memtier = __get_memory_tier_from_id(3);
+	if (!memtier) {
+		memtier = register_memory_tier(3, rank);
+		if (!memtier) {
+			ret = -EINVAL;
+			goto out;
+		}
+	} else {
+		ret = -ENOSPC;
+		goto out;
+	}
+
+	ret = count;
+out:
+	mutex_unlock(&memory_tier_lock);
+	return ret;
+}
+static DEVICE_ATTR_WO(create_tier_from_rank);
+
+
 static struct attribute *memory_tier_attrs[] = {
 	&dev_attr_max_tier.attr,
 	&dev_attr_default_tier.attr,
+	&dev_attr_create_tier_from_rank.attr,
 	NULL
 };
 
@@ -302,7 +347,7 @@ static int __node_set_memory_tier(int node, int tier)
 
 	memtier = __get_memory_tier_from_id(tier);
 	if (!memtier) {
-		memtier = register_memory_tier(tier);
+		memtier = register_memory_tier(tier, get_rank_from_tier(tier));
 		if (!memtier) {
 			ret = -EINVAL;
 			goto out;
@@ -651,7 +696,8 @@ static int __init memory_tier_init(void)
 	 * Register only default memory tier to hide all empty
 	 * memory tier from sysfs.
 	 */
-	memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
+	memtier = register_memory_tier(DEFAULT_MEMORY_TIER,
+				       get_rank_from_tier(DEFAULT_MEMORY_TIER));
 	if (!memtier)
 		panic("%s() failed to register memory tier: %d\n", __func__, ret);
 


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v3)
  2022-05-30 12:50       ` Jonathan Cameron
  2022-05-31  1:57         ` Ying Huang
@ 2022-06-07 19:25         ` Tim Chen
  2022-06-08  4:41           ` Aneesh Kumar K V
  1 sibling, 1 reply; 66+ messages in thread
From: Tim Chen @ 2022-06-07 19:25 UTC (permalink / raw)
  To: Jonathan Cameron, Ying Huang
  Cc: Wei Xu, Aneesh Kumar K V, Andrew Morton, Greg Thelen, Yang Shi,
	Davidlohr Bueso, Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Linux MM,
	Jagdish Gediya, Baolin Wang, David Rientjes

On Mon, 2022-05-30 at 13:50 +0100, Jonathan Cameron wrote:
> 
> > When discussed offline, Tim Chen pointed out that with the proposed
> > interface, it's unconvenient to know the position of a given memory tier
> > in all memory tiers.  We must sort "rank" of all memory tiers to know
> > that.  "possible" file can be used for that.  Although "possible" file
> > can be generated with a shell script, it's more convenient to show it
> > directly.
> > 
> > Another way to address the issue is to add memtierN/pos for each memory
> > tier as suggested by Tim.  It's readonly and will show position of
> > "memtierN" in all memory tiers.  It's even better to show the relative
> > postion to the default memory tier (DRAM with CPU). That is, the
> > position of DRAM memory tier is 0.
> > 
> > Unlike memory tier device ID or rank, the position is relative and
> > dynamic.
> 
> Hi,
> 
> I'm unconvinced.  This is better done with a shell script than
> by adding ABI we'll have to live with for ever..
> 
> I'm no good at shell scripting but this does the job 
> grep "" tier*/rank | sort -n -k 2 -t : 
> 
> tier2/rank:50
> tier0/rank:100
> tier1/rank:200
> tier3/rank:240
> 
> I'm sure someone more knowledgeable will do it in a simpler fashion still.
> 
> 

You can argue that 

$ cat /sys/devices/system/cpu/cpu1/topology/core_siblings
f
$ cat /sys/devices/system/cpu/cpu1/topology/core_siblings_list
0-3

provide exactly the same information and we should get rid of
core_siblings_list.  I think core_siblings_list exists to make
it easier for a human, so he/she doesn't have to parse the mask,
or write a script to find out the ids of CPUs who are siblings.

I think in the same spirit, having an interface to allow a
human to quickly see the hierachical relationship of tiers 
relative to each other is helpful. 

Tim


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-06  9:02                       ` Aneesh Kumar K V
@ 2022-06-08  1:24                         ` Ying Huang
  0 siblings, 0 replies; 66+ messages in thread
From: Ying Huang @ 2022-06-08  1:24 UTC (permalink / raw)
  To: Aneesh Kumar K V
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes, linux-mm,
	akpm

On Mon, 2022-06-06 at 14:32 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 2:22 PM, Ying Huang wrote:
> ....
> > > > > I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> > > > > MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> > > > > with two memory tiers (DRAM and pmem) the demotion continues to work
> > > > > as expected after patch 3 ("mm/demotion: Build demotion targets based on
> > > > > explicit memory tiers"). With that, there will not be any regression in
> > > > > between the patch series.
> > > > > 
> > > > 
> > > > Thanks!  Please do that.  And I think you can add sysfs interface after
> > > > that patch too.  That is, in [1/7]
> > > > 
> > > 
> > > I am not sure why you insist on moving sysfs interfaces later. They are
> > > introduced based on the helper added. It make patch review easier to
> > > look at both the helpers and the user of the helper together in a patch.
> > 
> > Yes.  We should introduce a function and its user in one patch for
> > review.  But this doesn't mean that we should introduce the user space
> > interface as the first step.  I think the user space interface should
> > output correct information when we expose it.
> > 
> 
> If you look at this patchset we are not exposing any wrong information.
> 
> patch 1 -> adds ability to register the memory tiers and expose details 
> of registered memory tier. At this point the patchset only support DRAM 
> tier and hence only one tier is shown

But inside kernel, we actually work with 2 tiers and demote/prmote pages
between them.  With the information from your interface, users would
think that there is no any demotion/promotion in kernel because there's
only 1 tier.

> patch 2 -> adds per node memtier attribute. So only DRAM nodes shows the 
> details, because the patchset yet has not introduced a slower memory 
> tier like PMEM.
> 
> patch 4 -> introducing demotion. Will make that patch 5
> 
> patch 5 -> add dax kmem numa nodes as slower memory tier. Now this 
> becomes patch 4 at which point we will correctly show two memory tiers 
> in the system.
> 
> 
> > > > +struct memory_tier {
> > > > +	nodemask_t nodelist;
> > > > +};
> > > > 
> > > > And struct device can be added after the kernel has switched the
> > > > implementation based on explicit memory tiers.
> > > > 
> > > > +struct memory_tier {
> > > > +	struct device dev;
> > > > +	nodemask_t nodelist;
> > > > +};
> > > > 
> > > 
> > > 
> > > Can you elaborate on this? or possibly review the v5 series indicating
> > > what change you are suggesting here?
> > > 
> > > 
> > > > But I don't think it's a good idea to have "struct device" embedded in
> > > > "struct memory_tier".  We don't have "struct device" embedded in "struct
> > > > pgdata_list"...
> > > > 
> > > 
> > > I avoided creating an array for memory_tier (memory_tier[]) so that we
> > > can keep it dynamic. Keeping dev embedded in struct memory_tier simplify
> > > the life cycle management of that dynamic list. We free the struct
> > > memory_tier allocation via device release function (memtier->dev.release
> > > = memory_tier_device_release )
> > > 
> > > Why do you think it is not a good idea?
> > 
> > I think that we shouldn't bind our kernel internal implementation with
> > user space interface too much.  Yes.  We can expose kernel internal
> > implementation to user space in a direct way.  I suggest you to follow
> > the style of "struct pglist_data" and "struct node".  If we decouple
> > "struct memory_tier" and "struct memory_tier_dev" (or some other name),
> > we can refer to "struct memory_tier" without depending on all device
> > core.  Memory tier should be accessible inside the kernel even without a
> > user interface.  And memory tier isn't a device in concept.
> > 
> 
> memory_tiers are different from pglist_data and struct node in that we 
> also allow the creation of them from userspace.

I don't think that there's much difference.  struct pglist_data and
struct node can be created/destroyed dynamically too.  Please take a
look at

  __try_online_node()
  register_one_node()
  try_offline_node()
  unregister_one_node()

> That is the life time of 
> a memory tier is driven from userspace and it is much easier to manage 
> them via sysfs file lifetime mechanism rather than inventing an 
> independent and more complex way of doing the same.

You needs to manage the lifetime of struct memory_tier in kernel too. 
Because there are kernel users.  And even if you use device core
lifetime mechanism, you don't need to embed struct device in struct
memory_tier too, you can free "separate" struct memory_tier in "release"
callback of struct device.

> > For life cycle management, I think that we can do that without sysfs
> > too.
> > 
> 
> unless there are specific details that you think will be broken by 
> embedding struct device inside struct memory_tier, IMHO I still consider 
> the embedded implementation much simpler and in accordance with other 
> kernel design patterns.

In concept, struct memory_tier isn't a device.  Although we expose it as
a device in sysfs.  That's just an implementation detail.  So I think
it's better to make struct memory_tier independent of struct device if
possible.

Via not embeding struct device in struct memory_tier, it's much easier
to dereference struct memory_tier directly in inline function in ".h". 
We don't need to introduce one accessor function for each field of
struct memory_tier for that.

Best Regards,
Huang, Ying



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RFC: Memory Tiering Kernel Interfaces (v3)
  2022-06-07 19:25         ` Tim Chen
@ 2022-06-08  4:41           ` Aneesh Kumar K V
  0 siblings, 0 replies; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  4:41 UTC (permalink / raw)
  To: Tim Chen, Jonathan Cameron, Ying Huang
  Cc: Wei Xu, Andrew Morton, Greg Thelen, Yang Shi, Davidlohr Bueso,
	Tim C Chen, Brice Goglin, Michal Hocko,
	Linux Kernel Mailing List, Hesham Almatary, Dave Hansen,
	Alistair Popple, Dan Williams, Feng Tang, Linux MM,
	Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 12:55 AM, Tim Chen wrote:
> On Mon, 2022-05-30 at 13:50 +0100, Jonathan Cameron wrote:
>>
>>> When discussed offline, Tim Chen pointed out that with the proposed
>>> interface, it's unconvenient to know the position of a given memory tier
>>> in all memory tiers.  We must sort "rank" of all memory tiers to know
>>> that.  "possible" file can be used for that.  Although "possible" file
>>> can be generated with a shell script, it's more convenient to show it
>>> directly.
>>>
>>> Another way to address the issue is to add memtierN/pos for each memory
>>> tier as suggested by Tim.  It's readonly and will show position of
>>> "memtierN" in all memory tiers.  It's even better to show the relative
>>> postion to the default memory tier (DRAM with CPU). That is, the
>>> position of DRAM memory tier is 0.
>>>
>>> Unlike memory tier device ID or rank, the position is relative and
>>> dynamic.
>>
>> Hi,
>>
>> I'm unconvinced.  This is better done with a shell script than
>> by adding ABI we'll have to live with for ever..
>>
>> I'm no good at shell scripting but this does the job
>> grep "" tier*/rank | sort -n -k 2 -t :
>>
>> tier2/rank:50
>> tier0/rank:100
>> tier1/rank:200
>> tier3/rank:240
>>
>> I'm sure someone more knowledgeable will do it in a simpler fashion still.
>>
>>
> 
> You can argue that
> 
> $ cat /sys/devices/system/cpu/cpu1/topology/core_siblings
> f
> $ cat /sys/devices/system/cpu/cpu1/topology/core_siblings_list
> 0-3
> 
> provide exactly the same information and we should get rid of
> core_siblings_list.  I think core_siblings_list exists to make
> it easier for a human, so he/she doesn't have to parse the mask,
> or write a script to find out the ids of CPUs who are siblings.
> 
> I think in the same spirit, having an interface to allow a
> human to quickly see the hierachical relationship of tiers
> relative to each other is helpful.
> 

We can add that later if we find applications requiring this. I kind of 
have the feeling that we are adding too much based on possible ways 
memory tiers could be used in the future. For now we can look at doing 
bare minimum to address the current constraints and drive more user 
visible changes later based on application requirements.

-aneesh


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-05-27 12:25   ` [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
  2022-06-02  6:07     ` Ying Huang
@ 2022-06-08  7:16     ` Ying Huang
  2022-06-08  8:24       ` Aneesh Kumar K V
  1 sibling, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-06-08  7:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:

[snip]

> 
> +static int __init memory_tier_init(void)
> +{
> +	int ret;
> +
> +	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> +	if (ret)
> +		panic("%s() failed to register subsystem: %d\n", __func__, ret);

I don't think we should go panic for failing to register subsys and
device for memory tiers.  Just pr_err() should be enough.

Best Regards,
Huang, Ying

> +
> +	/*
> +	 * Register only default memory tier to hide all empty
> +	 * memory tier from sysfs.
> +	 */
> +	ret = register_memory_tier(DEFAULT_MEMORY_TIER);
> +	if (ret)
> +		panic("%s() failed to register memory tier: %d\n", __func__, ret);
> +
> +	/*
> +	 * CPU only nodes are not part of memoty tiers.
> +	 */
> +	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
> +
> +	return 0;
> +}
> +subsys_initcall(memory_tier_init);
> +
> +#endif	/* CONFIG_TIERED_MEMORY */



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
  2022-05-27 12:25   ` [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
       [not found]     ` <20220527151531.00002a0c@Huawei.com>
@ 2022-06-08  7:18     ` Ying Huang
  2022-06-08  8:25       ` Aneesh Kumar K V
  1 sibling, 1 reply; 66+ messages in thread
From: Ying Huang @ 2022-06-08  7:18 UTC (permalink / raw)
  To: Aneesh Kumar K.V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> 
> Add support to read/write the memory tierindex for a NUMA node.
> 
> /sys/devices/system/node/nodeN/memtier
> 
> where N = node id
> 
> When read, It list the memory tier that the node belongs to.
> 
> When written, the kernel moves the node into the specified
> memory tier, the tier assignment of all other nodes are not
> affected.
> 
> If the memory tier does not exist, writing to the above file
> create the tier and assign the NUMA node to that tier.
> 
> mutex memory_tier_lock is introduced to protect memory tier
> related chanegs as it can happen from sysfs as well on hot
> plug events.
> 
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  drivers/base/node.c     |  35 ++++++++++++++
>  include/linux/migrate.h |   4 +-
>  mm/migrate.c            | 103 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 141 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ec8bb24a5a22..cf4a58446d8c 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -20,6 +20,7 @@
>  #include <linux/pm_runtime.h>
>  #include <linux/swap.h>
>  #include <linux/slab.h>
> +#include <linux/migrate.h>
>  
> 
> 
> 
>  static struct bus_type node_subsys = {
>  	.name = "node",
> @@ -560,11 +561,45 @@ static ssize_t node_read_distance(struct device *dev,
>  }
>  static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>  
> 
> 
> 
> +#ifdef CONFIG_TIERED_MEMORY
> +static ssize_t memtier_show(struct device *dev,
> +			    struct device_attribute *attr,
> +			    char *buf)
> +{
> +	int node = dev->id;
> +
> +	return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
> +}
> +
> +static ssize_t memtier_store(struct device *dev,
> +			     struct device_attribute *attr,
> +			     const char *buf, size_t count)
> +{
> +	unsigned long tier;
> +	int node = dev->id;
> +
> +	int ret = kstrtoul(buf, 10, &tier);
> +	if (ret)
> +		return ret;
> +
> +	ret = node_reset_memory_tier(node, tier);
> +	if (ret)
> +		return ret;
> +
> +	return count;
> +}
> +
> +static DEVICE_ATTR_RW(memtier);
> +#endif
> +
>  static struct attribute *node_dev_attrs[] = {
>  	&dev_attr_meminfo.attr,
>  	&dev_attr_numastat.attr,
>  	&dev_attr_distance.attr,
>  	&dev_attr_vmstat.attr,
> +#ifdef CONFIG_TIERED_MEMORY
> +	&dev_attr_memtier.attr,
> +#endif
>  	NULL
>  };
>  
> 
> 
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 0ec653623565..d37d1d5dee82 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -177,13 +177,15 @@ enum memory_tier_type {
>  };
>  
> 
> 
> 
>  int next_demotion_node(int node);
> -
>  extern void migrate_on_reclaim_init(void);
>  #ifdef CONFIG_HOTPLUG_CPU
>  extern void set_migration_target_nodes(void);
>  #else
>  static inline void set_migration_target_nodes(void) {}
>  #endif
> +int node_get_memory_tier(int node);
> +int node_set_memory_tier(int node, int tier);
> +int node_reset_memory_tier(int node, int tier);
>  #else
>  #define numa_demotion_enabled	false
>  static inline int next_demotion_node(int node)
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f28ee93fb017..304559ba3372 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2132,6 +2132,7 @@ static struct bus_type memory_tier_subsys = {
>  	.dev_name = "memtier",
>  };
>  
> 
> 
> 
> +DEFINE_MUTEX(memory_tier_lock);
>  static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
>  
> 
> 
> 
>  static ssize_t nodelist_show(struct device *dev,
> @@ -2225,6 +2226,108 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
>  	NULL,
>  };
>  
> 
> 
> 
> +static int __node_get_memory_tier(int node)
> +{
> +	int tier;
> +
> +	for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) {
> +		if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist))
> +			return tier;
> +	}
> +
> +	return -1;
> +}
> +
> +int node_get_memory_tier(int node)
> +{
> +	int tier;
> +
> +	/*
> +	 * Make sure memory tier is not unregistered
> +	 * while it is being read.
> +	 */
> +	mutex_lock(&memory_tier_lock);
> +
> +	tier = __node_get_memory_tier(node);
> +
> +	mutex_unlock(&memory_tier_lock);
> +
> +	return tier;
> +}
> +
> +int __node_set_memory_tier(int node, int tier)
> +{
> +	int ret = 0;
> +	/*
> +	 * As register_memory_tier() for new tier can fail,
> +	 * try it before modifying existing tier. register
> +	 * tier makes tier visible in sysfs.
> +	 */
> +	if (!memory_tiers[tier]) {
> +		ret = register_memory_tier(tier);
> +		if (ret) {
> +			goto out;
> +		}
> +	}
> +
> +	node_set(node, memory_tiers[tier]->nodelist);
> +
> +out:
> +	return ret;
> +}
> +
> +int node_reset_memory_tier(int node, int tier)

I think "reset" isn't a good name here.  Maybe something like "change"
or "move"?

Best Regards,
Huang, Ying

> +{
> +	int current_tier, ret = 0;
> +
> +	mutex_lock(&memory_tier_lock);
> +
> +	current_tier = __node_get_memory_tier(node);
> +	if (current_tier == tier)
> +		goto out;
> +
> +	if (current_tier != -1 )
> +		node_clear(node, memory_tiers[current_tier]->nodelist);
> +
> +	ret = __node_set_memory_tier(node, tier);
> +
> +	if (!ret) {
> +		if (nodes_empty(memory_tiers[current_tier]->nodelist))
> +			unregister_memory_tier(current_tier);
> +	} else {
> +		/* reset it back to older tier */
> +		ret = __node_set_memory_tier(node, current_tier);
> +	}
> +out:
> +	mutex_unlock(&memory_tier_lock);
> +
> +	return ret;
> +}
> +
> +int node_set_memory_tier(int node, int tier)
> +{
> +	int current_tier, ret = 0;
> +
> +	if (tier >= MAX_MEMORY_TIERS)
> +		return -EINVAL;
> +
> +	mutex_lock(&memory_tier_lock);
> +	current_tier = __node_get_memory_tier(node);
> +	/*
> +	 * if node is already part of the tier proceed with the
> +	 * current tier value, because we might want to establish
> +	 * new migration paths now. The node might be added to a tier
> +	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
> +	 * will have skipped this node.
> +	 */
> +	if (current_tier != -1)
> +		tier = current_tier;
> +	ret = __node_set_memory_tier(node, tier);
> +	mutex_unlock(&memory_tier_lock);
> +
> +	return ret;
> +}
> +
>  /*
>   * node_demotion[] example:
>   *



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-08  7:16     ` Ying Huang
@ 2022-06-08  8:24       ` Aneesh Kumar K V
  2022-06-08  8:27         ` Ying Huang
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  8:24 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 12:46 PM, Ying Huang wrote:
> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> 
> [snip]
> 
>>
>> +static int __init memory_tier_init(void)
>> +{
>> +	int ret;
>> +
>> +	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
>> +	if (ret)
>> +		panic("%s() failed to register subsystem: %d\n", __func__, ret);
> 
> I don't think we should go panic for failing to register subsys and
> device for memory tiers.  Just pr_err() should be enough.
> 

So you are suggesting we continue to work with memory tiers with no 
userspace interface?

>> +
>> +	/*
>> +	 * Register only default memory tier to hide all empty
>> +	 * memory tier from sysfs.
>> +	 */
>> +	ret = register_memory_tier(DEFAULT_MEMORY_TIER);
>> +	if (ret)
>> +		panic("%s() failed to register memory tier: %d\n", __func__, ret);
>> +
>> +	/*
>> +	 * CPU only nodes are not part of memoty tiers.
>> +	 */
>> +	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
>> +
>> +	return 0;
>> +}
>> +subsys_initcall(memory_tier_init);
>> +
>> +#endif	/* CONFIG_TIERED_MEMORY */
> 
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
  2022-06-08  7:18     ` Ying Huang
@ 2022-06-08  8:25       ` Aneesh Kumar K V
  2022-06-08  8:29         ` Ying Huang
  0 siblings, 1 reply; 66+ messages in thread
From: Aneesh Kumar K V @ 2022-06-08  8:25 UTC (permalink / raw)
  To: Ying Huang, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On 6/8/22 12:48 PM, Ying Huang wrote:
> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>
>> Add support to read/write the memory tierindex for a NUMA node.
>>
>> /sys/devices/system/node/nodeN/memtier
>>
>> where N = node id
>>
>> When read, It list the memory tier that the node belongs to.
>>
>> When written, the kernel moves the node into the specified
>> memory tier, the tier assignment of all other nodes are not
>> affected.
>>
>> If the memory tier does not exist, writing to the above file
>> create the tier and assign the NUMA node to that tier.
>>
>> mutex memory_tier_lock is introduced to protect memory tier
>> related chanegs as it can happen from sysfs as well on hot
>> plug events.
>>
>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   drivers/base/node.c     |  35 ++++++++++++++
>>   include/linux/migrate.h |   4 +-
>>   mm/migrate.c            | 103 ++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 141 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index ec8bb24a5a22..cf4a58446d8c 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/pm_runtime.h>
>>   #include <linux/swap.h>
>>   #include <linux/slab.h>
>> +#include <linux/migrate.h>
>>   
>>
>>
>>
>>   static struct bus_type node_subsys = {
>>   	.name = "node",
>> @@ -560,11 +561,45 @@ static ssize_t node_read_distance(struct device *dev,
>>   }
>>   static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
>>   
>>
>>
>>
>> +#ifdef CONFIG_TIERED_MEMORY
>> +static ssize_t memtier_show(struct device *dev,
>> +			    struct device_attribute *attr,
>> +			    char *buf)
>> +{
>> +	int node = dev->id;
>> +
>> +	return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
>> +}
>> +
>> +static ssize_t memtier_store(struct device *dev,
>> +			     struct device_attribute *attr,
>> +			     const char *buf, size_t count)
>> +{
>> +	unsigned long tier;
>> +	int node = dev->id;
>> +
>> +	int ret = kstrtoul(buf, 10, &tier);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = node_reset_memory_tier(node, tier);
>> +	if (ret)
>> +		return ret;
>> +
>> +	return count;
>> +}
>> +
>> +static DEVICE_ATTR_RW(memtier);
>> +#endif
>> +
>>   static struct attribute *node_dev_attrs[] = {
>>   	&dev_attr_meminfo.attr,
>>   	&dev_attr_numastat.attr,
>>   	&dev_attr_distance.attr,
>>   	&dev_attr_vmstat.attr,
>> +#ifdef CONFIG_TIERED_MEMORY
>> +	&dev_attr_memtier.attr,
>> +#endif
>>   	NULL
>>   };
>>   
>>
>>
>>
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index 0ec653623565..d37d1d5dee82 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -177,13 +177,15 @@ enum memory_tier_type {
>>   };
>>   
>>
>>
>>
>>   int next_demotion_node(int node);
>> -
>>   extern void migrate_on_reclaim_init(void);
>>   #ifdef CONFIG_HOTPLUG_CPU
>>   extern void set_migration_target_nodes(void);
>>   #else
>>   static inline void set_migration_target_nodes(void) {}
>>   #endif
>> +int node_get_memory_tier(int node);
>> +int node_set_memory_tier(int node, int tier);
>> +int node_reset_memory_tier(int node, int tier);
>>   #else
>>   #define numa_demotion_enabled	false
>>   static inline int next_demotion_node(int node)
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index f28ee93fb017..304559ba3372 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -2132,6 +2132,7 @@ static struct bus_type memory_tier_subsys = {
>>   	.dev_name = "memtier",
>>   };
>>   
>>
>>
>>
>> +DEFINE_MUTEX(memory_tier_lock);
>>   static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
>>   
>>
>>
>>
>>   static ssize_t nodelist_show(struct device *dev,
>> @@ -2225,6 +2226,108 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
>>   	NULL,
>>   };
>>   
>>
>>
>>
>> +static int __node_get_memory_tier(int node)
>> +{
>> +	int tier;
>> +
>> +	for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) {
>> +		if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist))
>> +			return tier;
>> +	}
>> +
>> +	return -1;
>> +}
>> +
>> +int node_get_memory_tier(int node)
>> +{
>> +	int tier;
>> +
>> +	/*
>> +	 * Make sure memory tier is not unregistered
>> +	 * while it is being read.
>> +	 */
>> +	mutex_lock(&memory_tier_lock);
>> +
>> +	tier = __node_get_memory_tier(node);
>> +
>> +	mutex_unlock(&memory_tier_lock);
>> +
>> +	return tier;
>> +}
>> +
>> +int __node_set_memory_tier(int node, int tier)
>> +{
>> +	int ret = 0;
>> +	/*
>> +	 * As register_memory_tier() for new tier can fail,
>> +	 * try it before modifying existing tier. register
>> +	 * tier makes tier visible in sysfs.
>> +	 */
>> +	if (!memory_tiers[tier]) {
>> +		ret = register_memory_tier(tier);
>> +		if (ret) {
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	node_set(node, memory_tiers[tier]->nodelist);
>> +
>> +out:
>> +	return ret;
>> +}
>> +
>> +int node_reset_memory_tier(int node, int tier)
> 
> I think "reset" isn't a good name here.  Maybe something like "change"
> or "move"?
> 

how about node_update_memory_tier()?

-aneesh

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers
  2022-06-08  8:24       ` Aneesh Kumar K V
@ 2022-06-08  8:27         ` Ying Huang
  0 siblings, 0 replies; 66+ messages in thread
From: Ying Huang @ 2022-06-08  8:27 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Wed, 2022-06-08 at 13:54 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:46 PM, Ying Huang wrote:
> > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > 
> > [snip]
> > 
> > > 
> > > +static int __init memory_tier_init(void)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> > > +	if (ret)
> > > +		panic("%s() failed to register subsystem: %d\n", __func__, ret);
> > 
> > I don't think we should go panic for failing to register subsys and
> > device for memory tiers.  Just pr_err() should be enough.
> > 
> 
> So you are suggesting we continue to work with memory tiers with no 
> userspace interface?

Yes.  We don't need to panic system for this.

Best Regards,
Huang, Ying

> > > +
> > > +	/*
> > > +	 * Register only default memory tier to hide all empty
> > > +	 * memory tier from sysfs.
> > > +	 */
> > > +	ret = register_memory_tier(DEFAULT_MEMORY_TIER);
> > > +	if (ret)
> > > +		panic("%s() failed to register memory tier: %d\n", __func__, ret);
> > > +
> > > +	/*
> > > +	 * CPU only nodes are not part of memoty tiers.
> > > +	 */
> > > +	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
> > > +
> > > +	return 0;
> > > +}
> > > +subsys_initcall(memory_tier_init);
> > > +
> > > +#endif	/* CONFIG_TIERED_MEMORY */
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs
  2022-06-08  8:25       ` Aneesh Kumar K V
@ 2022-06-08  8:29         ` Ying Huang
  0 siblings, 0 replies; 66+ messages in thread
From: Ying Huang @ 2022-06-08  8:29 UTC (permalink / raw)
  To: Aneesh Kumar K V, linux-mm, akpm
  Cc: Greg Thelen, Yang Shi, Davidlohr Bueso, Tim C Chen, Brice Goglin,
	Michal Hocko, Linux Kernel Mailing List, Hesham Almatary,
	Dave Hansen, Jonathan Cameron, Alistair Popple, Dan Williams,
	Feng Tang, Jagdish Gediya, Baolin Wang, David Rientjes

On Wed, 2022-06-08 at 13:55 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:48 PM, Ying Huang wrote:
> > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > 
> > > Add support to read/write the memory tierindex for a NUMA node.
> > > 
> > > /sys/devices/system/node/nodeN/memtier
> > > 
> > > where N = node id
> > > 
> > > When read, It list the memory tier that the node belongs to.
> > > 
> > > When written, the kernel moves the node into the specified
> > > memory tier, the tier assignment of all other nodes are not
> > > affected.
> > > 
> > > If the memory tier does not exist, writing to the above file
> > > create the tier and assign the NUMA node to that tier.
> > > 
> > > mutex memory_tier_lock is introduced to protect memory tier
> > > related chanegs as it can happen from sysfs as well on hot
> > > plug events.
> > > 
> > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > ---
> > >   drivers/base/node.c     |  35 ++++++++++++++
> > >   include/linux/migrate.h |   4 +-
> > >   mm/migrate.c            | 103 ++++++++++++++++++++++++++++++++++++++++
> > >   3 files changed, 141 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > > index ec8bb24a5a22..cf4a58446d8c 100644
> > > --- a/drivers/base/node.c
> > > +++ b/drivers/base/node.c
> > > @@ -20,6 +20,7 @@
> > >   #include <linux/pm_runtime.h>
> > >   #include <linux/swap.h>
> > >   #include <linux/slab.h>
> > > +#include <linux/migrate.h>
> > >   
> > > 
> > > 
> > > 
> > > 
> > >   static struct bus_type node_subsys = {
> > >   	.name = "node",
> > > @@ -560,11 +561,45 @@ static ssize_t node_read_distance(struct device *dev,
> > >   }
> > >   static DEVICE_ATTR(distance, 0444, node_read_distance, NULL);
> > >   
> > > 
> > > 
> > > 
> > > 
> > > +#ifdef CONFIG_TIERED_MEMORY
> > > +static ssize_t memtier_show(struct device *dev,
> > > +			    struct device_attribute *attr,
> > > +			    char *buf)
> > > +{
> > > +	int node = dev->id;
> > > +
> > > +	return sysfs_emit(buf, "%d\n", node_get_memory_tier(node));
> > > +}
> > > +
> > > +static ssize_t memtier_store(struct device *dev,
> > > +			     struct device_attribute *attr,
> > > +			     const char *buf, size_t count)
> > > +{
> > > +	unsigned long tier;
> > > +	int node = dev->id;
> > > +
> > > +	int ret = kstrtoul(buf, 10, &tier);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	ret = node_reset_memory_tier(node, tier);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	return count;
> > > +}
> > > +
> > > +static DEVICE_ATTR_RW(memtier);
> > > +#endif
> > > +
> > >   static struct attribute *node_dev_attrs[] = {
> > >   	&dev_attr_meminfo.attr,
> > >   	&dev_attr_numastat.attr,
> > >   	&dev_attr_distance.attr,
> > >   	&dev_attr_vmstat.attr,
> > > +#ifdef CONFIG_TIERED_MEMORY
> > > +	&dev_attr_memtier.attr,
> > > +#endif
> > >   	NULL
> > >   };
> > >   
> > > 
> > > 
> > > 
> > > 
> > > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > > index 0ec653623565..d37d1d5dee82 100644
> > > --- a/include/linux/migrate.h
> > > +++ b/include/linux/migrate.h
> > > @@ -177,13 +177,15 @@ enum memory_tier_type {
> > >   };
> > >   
> > > 
> > > 
> > > 
> > > 
> > >   int next_demotion_node(int node);
> > > -
> > >   extern void migrate_on_reclaim_init(void);
> > >   #ifdef CONFIG_HOTPLUG_CPU
> > >   extern void set_migration_target_nodes(void);
> > >   #else
> > >   static inline void set_migration_target_nodes(void) {}
> > >   #endif
> > > +int node_get_memory_tier(int node);
> > > +int node_set_memory_tier(int node, int tier);
> > > +int node_reset_memory_tier(int node, int tier);
> > >   #else
> > >   #define numa_demotion_enabled	false
> > >   static inline int next_demotion_node(int node)
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index f28ee93fb017..304559ba3372 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -2132,6 +2132,7 @@ static struct bus_type memory_tier_subsys = {
> > >   	.dev_name = "memtier",
> > >   };
> > >   
> > > 
> > > 
> > > 
> > > 
> > > +DEFINE_MUTEX(memory_tier_lock);
> > >   static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
> > >   
> > > 
> > > 
> > > 
> > > 
> > >   static ssize_t nodelist_show(struct device *dev,
> > > @@ -2225,6 +2226,108 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
> > >   	NULL,
> > >   };
> > >   
> > > 
> > > 
> > > 
> > > 
> > > +static int __node_get_memory_tier(int node)
> > > +{
> > > +	int tier;
> > > +
> > > +	for (tier = 0; tier < MAX_MEMORY_TIERS; tier++) {
> > > +		if (memory_tiers[tier] && node_isset(node, memory_tiers[tier]->nodelist))
> > > +			return tier;
> > > +	}
> > > +
> > > +	return -1;
> > > +}
> > > +
> > > +int node_get_memory_tier(int node)
> > > +{
> > > +	int tier;
> > > +
> > > +	/*
> > > +	 * Make sure memory tier is not unregistered
> > > +	 * while it is being read.
> > > +	 */
> > > +	mutex_lock(&memory_tier_lock);
> > > +
> > > +	tier = __node_get_memory_tier(node);
> > > +
> > > +	mutex_unlock(&memory_tier_lock);
> > > +
> > > +	return tier;
> > > +}
> > > +
> > > +int __node_set_memory_tier(int node, int tier)
> > > +{
> > > +	int ret = 0;
> > > +	/*
> > > +	 * As register_memory_tier() for new tier can fail,
> > > +	 * try it before modifying existing tier. register
> > > +	 * tier makes tier visible in sysfs.
> > > +	 */
> > > +	if (!memory_tiers[tier]) {
> > > +		ret = register_memory_tier(tier);
> > > +		if (ret) {
> > > +			goto out;
> > > +		}
> > > +	}
> > > +
> > > +	node_set(node, memory_tiers[tier]->nodelist);
> > > +
> > > +out:
> > > +	return ret;
> > > +}
> > > +
> > > +int node_reset_memory_tier(int node, int tier)
> > 
> > I think "reset" isn't a good name here.  Maybe something like "change"
> > or "move"?
> > 
> 
> how about node_update_memory_tier()?

That sounds OK for me.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2022-06-08  9:11 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-26 21:22 RFC: Memory Tiering Kernel Interfaces (v3) Wei Xu
2022-05-27  2:58 ` Ying Huang
2022-05-27 14:05   ` Hesham Almatary
2022-05-27 16:25     ` Wei Xu
2022-05-27 12:25 ` [RFC PATCH v4 0/7] mm/demotion: Memory tiers and demotion Aneesh Kumar K.V
2022-05-27 12:25   ` [RFC PATCH v4 1/7] mm/demotion: Add support for explicit memory tiers Aneesh Kumar K.V
2022-06-02  6:07     ` Ying Huang
2022-06-06  2:49       ` Ying Huang
2022-06-06  3:56         ` Aneesh Kumar K V
2022-06-06  5:33           ` Ying Huang
2022-06-06  6:01             ` Aneesh Kumar K V
2022-06-06  6:27               ` Aneesh Kumar K.V
2022-06-06  7:53                 ` Ying Huang
2022-06-06  8:01                   ` Aneesh Kumar K V
2022-06-06  8:52                     ` Ying Huang
2022-06-06  9:02                       ` Aneesh Kumar K V
2022-06-08  1:24                         ` Ying Huang
2022-06-08  7:16     ` Ying Huang
2022-06-08  8:24       ` Aneesh Kumar K V
2022-06-08  8:27         ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 2/7] mm/demotion: Expose per node memory tier to sysfs Aneesh Kumar K.V
     [not found]     ` <20220527151531.00002a0c@Huawei.com>
2022-06-03  8:40       ` Aneesh Kumar K V
2022-06-06 14:59         ` Jonathan Cameron
2022-06-06 16:01           ` Aneesh Kumar K V
2022-06-06 16:16             ` Jonathan Cameron
2022-06-06 16:39               ` Aneesh Kumar K V
2022-06-06 17:46                 ` Aneesh Kumar K.V
2022-06-08  7:18     ` Ying Huang
2022-06-08  8:25       ` Aneesh Kumar K V
2022-06-08  8:29         ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 3/7] mm/demotion: Build demotion targets based on explicit memory tiers Aneesh Kumar K.V
2022-05-30  3:35     ` [mm/demotion] 8ebccd60c2: BUG:sleeping_function_called_from_invalid_context_at_mm/compaction.c kernel test robot
2022-05-27 12:25   ` [RFC PATCH v4 4/7] mm/demotion/dax/kmem: Set node's memory tier to MEMORY_TIER_PMEM Aneesh Kumar K.V
2022-06-01  6:29     ` Bharata B Rao
2022-06-01 13:49       ` Aneesh Kumar K V
2022-06-02  6:36         ` Bharata B Rao
2022-06-03  9:04           ` Aneesh Kumar K V
2022-06-06 10:11             ` Bharata B Rao
2022-06-06 10:16               ` Aneesh Kumar K V
2022-06-06 11:54                 ` Aneesh Kumar K.V
2022-06-06 12:09                   ` Bharata B Rao
2022-06-06 13:00                     ` Aneesh Kumar K V
2022-05-27 12:25   ` [RFC PATCH v4 5/7] mm/demotion: Add support to associate rank with memory tier Aneesh Kumar K.V
     [not found]     ` <20220527154557.00002c56@Huawei.com>
2022-05-27 15:45       ` Aneesh Kumar K V
2022-05-30 12:36         ` Jonathan Cameron
2022-06-02  6:41     ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 6/7] mm/demotion: Add support for removing node from demotion memory tiers Aneesh Kumar K.V
2022-06-02  6:43     ` Ying Huang
2022-05-27 12:25   ` [RFC PATCH v4 7/7] mm/demotion: Demote pages according to allocation fallback order Aneesh Kumar K.V
2022-06-02  7:35     ` Ying Huang
2022-06-03 15:09       ` Aneesh Kumar K V
2022-06-06  0:43         ` Ying Huang
2022-06-06  4:07           ` Aneesh Kumar K V
2022-06-06  5:26             ` Ying Huang
2022-06-06  6:21               ` Aneesh Kumar K.V
2022-06-06  7:42                 ` Ying Huang
2022-06-06  8:02                   ` Aneesh Kumar K V
2022-06-06  8:06                     ` Ying Huang
2022-06-06 17:07               ` Yang Shi
2022-05-27 13:40 ` RFC: Memory Tiering Kernel Interfaces (v3) Aneesh Kumar K V
2022-05-27 16:30   ` Wei Xu
2022-05-29  4:31     ` Ying Huang
2022-05-30 12:50       ` Jonathan Cameron
2022-05-31  1:57         ` Ying Huang
2022-06-07 19:25         ` Tim Chen
2022-06-08  4:41           ` Aneesh Kumar K V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).