linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found] <CGME20230221014114epcas2p1687db1d75765a8f9ed0b3495eab1154d@epcas2p1.samsung.com>
@ 2023-02-21  1:41 ` Kyungsan Kim
  2023-02-27 23:14   ` Dan Williams
                     ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-02-21  1:41 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-mm, linux-fsdevel, linux-cxl, a.manzanares, viacheslav.dubeyko

CXL is a promising technology that leads to fundamental changes in computing architecture.
To facilitate adoption and widespread of CXL memory, we are developing a memory tiering solution, called SMDK[1][2].
Using SMDK and CXL RAM device, our team has been working with industry and academic partners over last year.
Also, thanks to many researcher's effort, CXL adoption stage is gradually moving forward from basic enablement to real-world composite usecases.
At this moment, based on the researches and experiences gained working on SMDK, we would like to suggest a session at LSF/MM/BFP this year
to propose possible Linux MM changes with a brief of SMDK.

Adam Manzanares kindly adviced me that it is preferred to discuss implementation details on given problem and consensus at LSF/MM/BFP.
Considering the adoption stage of CXL technology, however, let me suggest a design level discussion on the two MM expansions of SMDK this year.
When we have design consensus with participants, we want to continue follow-up discussions with additional implementation details, hopefully.

 
1. A new zone, ZONE_EXMEM
We added ZONE_EXMEM to manage CXL RAM device(s), separated from ZONE_NORMAL for usual DRAM due to the three reasons below.

1) a CXL RAM has many different characteristics with conventional DRAM because a CXL device inherits and expands PCIe specification.
ex) frequency range, pluggability, link speed/width negotiation, host/device flow control, power throttling, channel-interleaving methodology, error handling, and etc.
It is likely that the primary usecase of CXL RAM would be System RAM.
However, to deal with the hardware differences properly, different MM algorithms are needed accordingly.

2) Historically, zone has been expanded by reflecting the evolution of CPU, IO, and memory devices.
ex) ZONE_DMA(32), ZONE_HIGHMEM, ZONE_DEVICE, and ZONE_MOVABLE.
Each zone applies different MM algorithms such as page reclaim, compaction, migration, and fragmentation.
At first, we tried reuse of existing zones, ZONE_DEVICE and ZONE_MOVABLE, for CXL RAM purpose.
However, the purpose and implementation of the zones are not fit for CXL RAM.

3) Industry is preparing a CXL-capable system that connects dozens of CXL devices in a server system.
When a CXL device becomes a separate node, an administrator/programmer needs to be aware of and manually control all nodes using 3rd party software, such as numactl and libnuma.
ZONE_EXMEM allows the assemble of CXL RAM devices into the single ZONE_EXMEM zone, and provides an abstraction to userspace by seamlessly managing the devices.
Also, the zone is able to interleave assembled devices in a software way to lead to aggregated bandwidth.
We would like to suggest if it is co-existable with HW interleaving like SW/HW raid0.
To help understanding, please refer to the node partition part of the picture[3].


2. User/Kernelspace Programmable Interface
In terms of a memory tiering solution, it is typical that the solution attempts to locate hot data on near memory, and cold data on far memory as accurately as possible.[4][5][6][7]
We noticed that the hot/coldness of data is determined by the memory access pattern of running application and/or kernel context.
Hence, a running context needs a near/far memory identifier to determine near/far memory. 
When CXL RAM(s) is manipulated as a NUMA node, a node id can be function as a CXL identifier more or less.
However, the node id has limitation in that it is an ephemeral information that dynamically varies according to online status of CXL topology and system socket.
In this sense, we provides programmable interfaces for userspace and kernelspace context to explicitly (de)allocate memory from DRAM and CXL RAM regardless of a system change.
Specifically, MAP_EXMEM and GFP_EXMEM flags were added to mmap() syscall and kmalloc() siblings, respectively.

Thanks to Adam Manzanares for reviewing this CFP thoroughly.


[1]SMDK: https://github.com/openMPDK/SMDK
[2]SMT: Software-defined Memory Tiering for Heterogeneous Computing systems with CXL Memory Expander, https://ieeexplore.ieee.org/document/10032695
[3]SMDK node partition: https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
[4]TMO: Transparent Memory Offloading in Datacenters, https://dl.acm.org/doi/10.1145/3503222.3507731
[5]TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, https://arxiv.org/abs/2206.02878
[6]Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, https://dl.acm.org/doi/10.1145/3575693.3578835
[7]Hierarchical NUMA: https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-02-21  1:41 ` [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL Kyungsan Kim
@ 2023-02-27 23:14   ` Dan Williams
       [not found]     ` <CGME20230228043551epcas2p3085444899b00b106c2901e1f51814d2c@epcas2p3.samsung.com>
  2023-03-03  6:07   ` Huang, Ying
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2023-02-27 23:14 UTC (permalink / raw)
  To: Kyungsan Kim, lsf-pc
  Cc: linux-mm, linux-fsdevel, linux-cxl, a.manzanares, viacheslav.dubeyko

Please be sure to log this in the submission spreadsheet as well. From the
CFP:

---

1) Fill out the following Google form to request attendance and
suggest any topics

        https://forms.gle/VKVXjWGBHZbnsz226

In previous years we have accidentally missed people's attendance
requests because they either didn't cc lsf-pc@ or we simply missed them
in the flurry of emails we get.  Our community is large and our
volunteers are busy, filling this out will help us make sure we don't
miss anybody.


Kyungsan Kim wrote:
> CXL is a promising technology that leads to fundamental changes in
> computing architecture.  To facilitate adoption and widespread of CXL
> memory, we are developing a memory tiering solution, called
> SMDK[1][2].  Using SMDK and CXL RAM device, our team has been working
> with industry and academic partners over last year.  Also, thanks to
> many researcher's effort, CXL adoption stage is gradually moving
> forward from basic enablement to real-world composite usecases.  At
> this moment, based on the researches and experiences gained working on
> SMDK, we would like to suggest a session at LSF/MM/BFP this year to
> propose possible Linux MM changes with a brief of SMDK.
> 
> Adam Manzanares kindly adviced me that it is preferred to discuss
> implementation details on given problem and consensus at LSF/MM/BFP.
> Considering the adoption stage of CXL technology, however, let me
> suggest a design level discussion on the two MM expansions of SMDK
> this year.  When we have design consensus with participants, we want
> to continue follow-up discussions with additional implementation
> details, hopefully.
> 
>  
> 1. A new zone, ZONE_EXMEM We added ZONE_EXMEM to manage CXL RAM
> device(s), separated from ZONE_NORMAL for usual DRAM due to the three
> reasons below.
> 
> 1) a CXL RAM has many different characteristics with conventional DRAM
> because a CXL device inherits and expands PCIe specification.  ex)
> frequency range, pluggability, link speed/width negotiation,
> host/device flow control, power throttling, channel-interleaving
> methodology, error handling, and etc.  It is likely that the primary
> usecase of CXL RAM would be System RAM.  However, to deal with the
> hardware differences properly, different MM algorithms are needed
> accordingly.
> 
> 2) Historically, zone has been expanded by reflecting the evolution of
> CPU, IO, and memory devices.  ex) ZONE_DMA(32), ZONE_HIGHMEM,
> ZONE_DEVICE, and ZONE_MOVABLE.  Each zone applies different MM
> algorithms such as page reclaim, compaction, migration, and
> fragmentation.  At first, we tried reuse of existing zones,
> ZONE_DEVICE and ZONE_MOVABLE, for CXL RAM purpose.  However, the
> purpose and implementation of the zones are not fit for CXL RAM.
> 
> 3) Industry is preparing a CXL-capable system that connects dozens of
> CXL devices in a server system.  When a CXL device becomes a separate
> node, an administrator/programmer needs to be aware of and manually
> control all nodes using 3rd party software, such as numactl and
> libnuma.  ZONE_EXMEM allows the assemble of CXL RAM devices into the
> single ZONE_EXMEM zone, and provides an abstraction to userspace by
> seamlessly managing the devices.  Also, the zone is able to interleave
> assembled devices in a software way to lead to aggregated bandwidth.
> We would like to suggest if it is co-existable with HW interleaving
> like SW/HW raid0.  To help understanding, please refer to the node
> partition part of the picture[3].
> 
> 
> 2. User/Kernelspace Programmable Interface In terms of a memory
> tiering solution, it is typical that the solution attempts to locate
> hot data on near memory, and cold data on far memory as accurately as
> possible.[4][5][6][7] We noticed that the hot/coldness of data is
> determined by the memory access pattern of running application and/or
> kernel context.  Hence, a running context needs a near/far memory
> identifier to determine near/far memory.  When CXL RAM(s) is
> manipulated as a NUMA node, a node id can be function as a CXL
> identifier more or less.  However, the node id has limitation in that
> it is an ephemeral information that dynamically varies according to
> online status of CXL topology and system socket.  In this sense, we
> provides programmable interfaces for userspace and kernelspace context
> to explicitly (de)allocate memory from DRAM and CXL RAM regardless of
> a system change.  Specifically, MAP_EXMEM and GFP_EXMEM flags were
> added to mmap() syscall and kmalloc() siblings, respectively.
> 
> Thanks to Adam Manzanares for reviewing this CFP thoroughly.
> 
> 
> [1]SMDK: https://github.com/openMPDK/SMDK
> [2]SMT: Software-defined Memory Tiering for Heterogeneous Computing systems with CXL Memory Expander, https://ieeexplore.ieee.org/document/10032695
> [3]SMDK node partition: https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
> [4]TMO: Transparent Memory Offloading in Datacenters, https://dl.acm.org/doi/10.1145/3503222.3507731
> [5]TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, https://arxiv.org/abs/2206.02878
> [6]Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, https://dl.acm.org/doi/10.1145/3575693.3578835
> [7]Hierarchical NUMA: https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf



^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]     ` <CGME20230228043551epcas2p3085444899b00b106c2901e1f51814d2c@epcas2p3.samsung.com>
@ 2023-02-28  4:35       ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-02-28  4:35 UTC (permalink / raw)
  To: dan.j.williams
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko

Thank you dan for kind reminder of the submission. 
I filled out the form with topic suggestions and required attendees.
Hopefully, we can elaborate the topics with wider opinions revisiting previous kernel designs related.


>Please be sure to log this in the submission spreadsheet as well. From the
>CFP:
>
>---
>
>1) Fill out the following Google form to request attendance and
>suggest any topics
>
>        https://forms.gle/VKVXjWGBHZbnsz226
>
>In previous years we have accidentally missed people's attendance
>requests because they either didn't cc lsf-pc@ or we simply missed them
>in the flurry of emails we get.  Our community is large and our
>volunteers are busy, filling this out will help us make sure we don't
>miss anybody.
>
>
>Kyungsan Kim wrote:
>> CXL is a promising technology that leads to fundamental changes in
>> computing architecture.  To facilitate adoption and widespread of CXL
>> memory, we are developing a memory tiering solution, called
>> SMDK[1][2].  Using SMDK and CXL RAM device, our team has been working
>> with industry and academic partners over last year.  Also, thanks to
>> many researcher's effort, CXL adoption stage is gradually moving
>> forward from basic enablement to real-world composite usecases.  At
>> this moment, based on the researches and experiences gained working on
>> SMDK, we would like to suggest a session at LSF/MM/BFP this year to
>> propose possible Linux MM changes with a brief of SMDK.
>>
>> Adam Manzanares kindly adviced me that it is preferred to discuss
>> implementation details on given problem and consensus at LSF/MM/BFP.
>> Considering the adoption stage of CXL technology, however, let me
>> suggest a design level discussion on the two MM expansions of SMDK
>> this year.  When we have design consensus with participants, we want
>> to continue follow-up discussions with additional implementation
>> details, hopefully.
>>
>> 
>> 1. A new zone, ZONE_EXMEM We added ZONE_EXMEM to manage CXL RAM
>> device(s), separated from ZONE_NORMAL for usual DRAM due to the three
>> reasons below.
>>
>> 1) a CXL RAM has many different characteristics with conventional DRAM
>> because a CXL device inherits and expands PCIe specification.  ex)
>> frequency range, pluggability, link speed/width negotiation,
>> host/device flow control, power throttling, channel-interleaving
>> methodology, error handling, and etc.  It is likely that the primary
>> usecase of CXL RAM would be System RAM.  However, to deal with the
>> hardware differences properly, different MM algorithms are needed
>> accordingly.
>>
>> 2) Historically, zone has been expanded by reflecting the evolution of
>> CPU, IO, and memory devices.  ex) ZONE_DMA(32), ZONE_HIGHMEM,
>> ZONE_DEVICE, and ZONE_MOVABLE.  Each zone applies different MM
>> algorithms such as page reclaim, compaction, migration, and
>> fragmentation.  At first, we tried reuse of existing zones,
>> ZONE_DEVICE and ZONE_MOVABLE, for CXL RAM purpose.  However, the
>> purpose and implementation of the zones are not fit for CXL RAM.
>>
>> 3) Industry is preparing a CXL-capable system that connects dozens of
>> CXL devices in a server system.  When a CXL device becomes a separate
>> node, an administrator/programmer needs to be aware of and manually
>> control all nodes using 3rd party software, such as numactl and
>> libnuma.  ZONE_EXMEM allows the assemble of CXL RAM devices into the
>> single ZONE_EXMEM zone, and provides an abstraction to userspace by
>> seamlessly managing the devices.  Also, the zone is able to interleave
>> assembled devices in a software way to lead to aggregated bandwidth.
>> We would like to suggest if it is co-existable with HW interleaving
>> like SW/HW raid0.  To help understanding, please refer to the node
>> partition part of the picture[3].
>>
>>
>> 2. User/Kernelspace Programmable Interface In terms of a memory
>> tiering solution, it is typical that the solution attempts to locate
>> hot data on near memory, and cold data on far memory as accurately as
>> possible.[4][5][6][7] We noticed that the hot/coldness of data is
>> determined by the memory access pattern of running application and/or
>> kernel context.  Hence, a running context needs a near/far memory
>> identifier to determine near/far memory.  When CXL RAM(s) is
>> manipulated as a NUMA node, a node id can be function as a CXL
>> identifier more or less.  However, the node id has limitation in that
>> it is an ephemeral information that dynamically varies according to
>> online status of CXL topology and system socket.  In this sense, we
>> provides programmable interfaces for userspace and kernelspace context
>> to explicitly (de)allocate memory from DRAM and CXL RAM regardless of
>> a system change.  Specifically, MAP_EXMEM and GFP_EXMEM flags were
>> added to mmap() syscall and kmalloc() siblings, respectively.
>>
>> Thanks to Adam Manzanares for reviewing this CFP thoroughly.
>>
>>
>> [1]SMDK: https://github.com/openMPDK/SMDK
>> [2]SMT: Software-defined Memory Tiering for Heterogeneous Computing systems with CXL Memory Expander, https://ieeexplore.ieee.org/document/10032695
>> [3]SMDK node partition: https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
>> [4]TMO: Transparent Memory Offloading in Datacenters, https://dl.acm.org/doi/10.1145/3503222.3507731
>> [5]TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, https://arxiv.org/abs/2206.02878
>> [6]Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, https://dl.acm.org/doi/10.1145/3575693.3578835
>> [7]Hierarchical NUMA: https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-02-21  1:41 ` [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL Kyungsan Kim
  2023-02-27 23:14   ` Dan Williams
@ 2023-03-03  6:07   ` Huang, Ying
       [not found]     ` <CGME20230322043354epcas2p2227bcad190a470d635b92f92587dc69e@epcas2p2.samsung.com>
  2023-03-30 22:02   ` Dragan Stancevic
       [not found]   ` <CGME20230414084120epcas2p37f105901350410772a3115a5a490c215@epcas2p3.samsung.com>
  3 siblings, 1 reply; 66+ messages in thread
From: Huang, Ying @ 2023-03-03  6:07 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, Aneesh Kumar K.V, Jonathan Cameron,
	Johannes Weiner, Wei Xu, Yang Shi

Hi, Kyungsan,

Kyungsan Kim <ks0204.kim@samsung.com> writes:

> CXL is a promising technology that leads to fundamental changes in computing architecture.
> To facilitate adoption and widespread of CXL memory, we are developing a memory tiering solution, called SMDK[1][2].
> Using SMDK and CXL RAM device, our team has been working with industry and academic partners over last year.
> Also, thanks to many researcher's effort, CXL adoption stage is gradually moving forward from basic enablement to real-world composite usecases.
> At this moment, based on the researches and experiences gained working on SMDK, we would like to suggest a session at LSF/MM/BFP this year
> to propose possible Linux MM changes with a brief of SMDK.
>
> Adam Manzanares kindly adviced me that it is preferred to discuss implementation details on given problem and consensus at LSF/MM/BFP.
> Considering the adoption stage of CXL technology, however, let me suggest a design level discussion on the two MM expansions of SMDK this year.
> When we have design consensus with participants, we want to continue follow-up discussions with additional implementation details, hopefully.
>
>  
> 1. A new zone, ZONE_EXMEM
> We added ZONE_EXMEM to manage CXL RAM device(s), separated from ZONE_NORMAL for usual DRAM due to the three reasons below.
>
> 1) a CXL RAM has many different characteristics with conventional DRAM because a CXL device inherits and expands PCIe specification.
> ex) frequency range, pluggability, link speed/width negotiation, host/device flow control, power throttling, channel-interleaving methodology, error handling, and etc.
> It is likely that the primary usecase of CXL RAM would be System RAM.
> However, to deal with the hardware differences properly, different MM algorithms are needed accordingly.
>
> 2) Historically, zone has been expanded by reflecting the evolution of CPU, IO, and memory devices.
> ex) ZONE_DMA(32), ZONE_HIGHMEM, ZONE_DEVICE, and ZONE_MOVABLE.
> Each zone applies different MM algorithms such as page reclaim, compaction, migration, and fragmentation.
> At first, we tried reuse of existing zones, ZONE_DEVICE and ZONE_MOVABLE, for CXL RAM purpose.
> However, the purpose and implementation of the zones are not fit for CXL RAM.
>
> 3) Industry is preparing a CXL-capable system that connects dozens of CXL devices in a server system.
> When a CXL device becomes a separate node, an administrator/programmer needs to be aware of and manually control all nodes using 3rd party software, such as numactl and libnuma.
> ZONE_EXMEM allows the assemble of CXL RAM devices into the single ZONE_EXMEM zone, and provides an abstraction to userspace by seamlessly managing the devices.
> Also, the zone is able to interleave assembled devices in a software way to lead to aggregated bandwidth.
> We would like to suggest if it is co-existable with HW interleaving like SW/HW raid0.
> To help understanding, please refer to the node partition part of the picture[3].

In addition to CXL memory, we may have other kind of memory in the
system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
memory in GPU card, etc.  I guess that we need to consider them
together.  Do we need to add one zone type for each kind of memory?

>
> 2. User/Kernelspace Programmable Interface
> In terms of a memory tiering solution, it is typical that the solution attempts to locate hot data on near memory, and cold data on far memory as accurately as possible.[4][5][6][7]
> We noticed that the hot/coldness of data is determined by the memory access pattern of running application and/or kernel context.
> Hence, a running context needs a near/far memory identifier to determine near/far memory. 
> When CXL RAM(s) is manipulated as a NUMA node, a node id can be function as a CXL identifier more or less.
> However, the node id has limitation in that it is an ephemeral information that dynamically varies according to online status of CXL topology and system socket.
> In this sense, we provides programmable interfaces for userspace and kernelspace context to explicitly (de)allocate memory from DRAM and CXL RAM regardless of a system change.
> Specifically, MAP_EXMEM and GFP_EXMEM flags were added to mmap() syscall and kmalloc() siblings, respectively.

In addition to NUMA node, we have defined the following interfaces to
expose information about different kind of memory in the system.

https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html#abi-sys-devices-virtual-memory-tiering

Best Regards,
Huang, Ying

> Thanks to Adam Manzanares for reviewing this CFP thoroughly.
>
>
> [1]SMDK: https://github.com/openMPDK/SMDK
> [2]SMT: Software-defined Memory Tiering for Heterogeneous Computing systems with CXL Memory Expander, https://ieeexplore.ieee.org/document/10032695
> [3]SMDK node partition: https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
> [4]TMO: Transparent Memory Offloading in Datacenters, https://dl.acm.org/doi/10.1145/3503222.3507731
> [5]TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, https://arxiv.org/abs/2206.02878
> [6]Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, https://dl.acm.org/doi/10.1145/3575693.3578835
> [7]Hierarchical NUMA: https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]     ` <CGME20230322043354epcas2p2227bcad190a470d635b92f92587dc69e@epcas2p2.samsung.com>
@ 2023-03-22  4:33       ` Kyungsan Kim
  2023-03-22 22:03         ` Dan Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-22  4:33 UTC (permalink / raw)
  To: ying.huang
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams

Hi Huang Ying,

I apologize late reply for personal schedule.
Thank you for sharing your viewpoint and the information.


>Hi, Kyungsan,
>
>Kyungsan Kim <ks0204.kim@samsung.com> writes:
>
>> CXL is a promising technology that leads to fundamental changes in computing architecture.
>> To facilitate adoption and widespread of CXL memory, we are developing a memory tiering solution, called SMDK[1][2].
>> Using SMDK and CXL RAM device, our team has been working with industry and academic partners over last year.
>> Also, thanks to many researcher's effort, CXL adoption stage is gradually moving forward from basic enablement to real-world composite usecases.
>> At this moment, based on the researches and experiences gained working on SMDK, we would like to suggest a session at LSF/MM/BFP this year
>> to propose possible Linux MM changes with a brief of SMDK.
>>
>> Adam Manzanares kindly adviced me that it is preferred to discuss implementation details on given problem and consensus at LSF/MM/BFP.
>> Considering the adoption stage of CXL technology, however, let me suggest a design level discussion on the two MM expansions of SMDK this year.
>> When we have design consensus with participants, we want to continue follow-up discussions with additional implementation details, hopefully.
>>
>> 
>> 1. A new zone, ZONE_EXMEM
>> We added ZONE_EXMEM to manage CXL RAM device(s), separated from ZONE_NORMAL for usual DRAM due to the three reasons below.
>>
>> 1) a CXL RAM has many different characteristics with conventional DRAM because a CXL device inherits and expands PCIe specification.
>> ex) frequency range, pluggability, link speed/width negotiation, host/device flow control, power throttling, channel-interleaving methodology, error handling, and etc.
>> It is likely that the primary usecase of CXL RAM would be System RAM.
>> However, to deal with the hardware differences properly, different MM algorithms are needed accordingly.
>>
>> 2) Historically, zone has been expanded by reflecting the evolution of CPU, IO, and memory devices.
>> ex) ZONE_DMA(32), ZONE_HIGHMEM, ZONE_DEVICE, and ZONE_MOVABLE.
>> Each zone applies different MM algorithms such as page reclaim, compaction, migration, and fragmentation.
>> At first, we tried reuse of existing zones, ZONE_DEVICE and ZONE_MOVABLE, for CXL RAM purpose.
>> However, the purpose and implementation of the zones are not fit for CXL RAM.
>>
>> 3) Industry is preparing a CXL-capable system that connects dozens of CXL devices in a server system.
>> When a CXL device becomes a separate node, an administrator/programmer needs to be aware of and manually control all nodes using 3rd party software, such as numactl and libnuma.
>> ZONE_EXMEM allows the assemble of CXL RAM devices into the single ZONE_EXMEM zone, and provides an abstraction to userspace by seamlessly managing the devices.
>> Also, the zone is able to interleave assembled devices in a software way to lead to aggregated bandwidth.
>> We would like to suggest if it is co-existable with HW interleaving like SW/HW raid0.
>> To help understanding, please refer to the node partition part of the picture[3].
>
>In addition to CXL memory, we may have other kind of memory in the
>system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>memory in GPU card, etc.  I guess that we need to consider them
>together.  Do we need to add one zone type for each kind of memory?

We also don't think a new zone is needed for every single memory device.
Our viewpoint is the sole ZONE_NORMAL becomes not enough to manage multiple volatile memory devices due to the increased device types.
Including CXL DRAM, we think the ZONE_EXMEM can be used to represent extended volatile memories that have different HW characteristics.
 
>
>>
>> 2. User/Kernelspace Programmable Interface
>> In terms of a memory tiering solution, it is typical that the solution attempts to locate hot data on near memory, and cold data on far memory as accurately as possible.[4][5][6][7]
>> We noticed that the hot/coldness of data is determined by the memory access pattern of running application and/or kernel context.
>> Hence, a running context needs a near/far memory identifier to determine near/far memory.
>> When CXL RAM(s) is manipulated as a NUMA node, a node id can be function as a CXL identifier more or less.
>> However, the node id has limitation in that it is an ephemeral information that dynamically varies according to online status of CXL topology and system socket.
>> In this sense, we provides programmable interfaces for userspace and kernelspace context to explicitly (de)allocate memory from DRAM and CXL RAM regardless of a system change.
>> Specifically, MAP_EXMEM and GFP_EXMEM flags were added to mmap() syscall and kmalloc() siblings, respectively.
>
>In addition to NUMA node, we have defined the following interfaces to
>expose information about different kind of memory in the system.
>
>https://www.kernel.org/doc/html/latest/admin-guide/abi-testing.html#abi-sys-devices-virtual-memory-tiering
>
>Best Regards,
>Huang, Ying

The sysfs looks useful to prioritize a group of fast/slow memory-node using a list of node id.
We would say it is collaborative with the programmable interfaces we suggested.

                  User/Kernel context (MAP_EXMEM/GFP_EXMEM)
                                   |
               ---------------------------------------------
               |                                                    |
[sysfs/memory_tier0 - DDR Node list]   [sysfs/memory_tier1 - CXL Node list]

>
>> Thanks to Adam Manzanares for reviewing this CFP thoroughly.
>>
>>
>> [1]SMDK: https://github.com/openMPDK/SMDK
>> [2]SMT: Software-defined Memory Tiering for Heterogeneous Computing systems with CXL Memory Expander, https://ieeexplore.ieee.org/document/10032695
>> [3]SMDK node partition: https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
>> [4]TMO: Transparent Memory Offloading in Datacenters, https://dl.acm.org/doi/10.1145/3503222.3507731
>> [5]TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, https://arxiv.org/abs/2206.02878
>> [6]Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, https://dl.acm.org/doi/10.1145/3575693.3578835
>> [7]Hierarchical NUMA: https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-22  4:33       ` FW: " Kyungsan Kim
@ 2023-03-22 22:03         ` Dan Williams
       [not found]           ` <CGME20230323105106epcas2p39ea8de619622376a4698db425c6a6fb3@epcas2p3.samsung.com>
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2023-03-22 22:03 UTC (permalink / raw)
  To: Kyungsan Kim, ying.huang
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams

Kyungsan Kim wrote:
[..]
> >In addition to CXL memory, we may have other kind of memory in the
> >system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
> >memory in GPU card, etc.  I guess that we need to consider them
> >together.  Do we need to add one zone type for each kind of memory?
> 
> We also don't think a new zone is needed for every single memory
> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
> manage multiple volatile memory devices due to the increased device
> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
> represent extended volatile memories that have different HW
> characteristics.

Some advice for the LSF/MM discussion, the rationale will need to be
more than "we think the ZONE_EXMEM can be used to represent extended
volatile memories that have different HW characteristics". It needs to
be along the lines of "yes, to date Linux has been able to describe DDR
with NUMA effects, PMEM with high write overhead, and HBM with improved
bandwidth not necessarily latency, all without adding a new ZONE, but a
new ZONE is absolutely required now to enable use case FOO, or address
unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
maintainability concern of "fewer degress of freedom in the ZONE
dimension" starts to dominate.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]           ` <CGME20230323105106epcas2p39ea8de619622376a4698db425c6a6fb3@epcas2p3.samsung.com>
@ 2023-03-23 10:51             ` Kyungsan Kim
  2023-03-23 12:25               ` David Hildenbrand
                                 ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-23 10:51 UTC (permalink / raw)
  To: dan.j.williams
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, ying.huang

I appreciate dan for the careful advice.

>Kyungsan Kim wrote:
>[..]
>> >In addition to CXL memory, we may have other kind of memory in the
>> >system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>> >memory in GPU card, etc.  I guess that we need to consider them
>> >together.  Do we need to add one zone type for each kind of memory?
>> 
>> We also don't think a new zone is needed for every single memory
>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>> manage multiple volatile memory devices due to the increased device
>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>> represent extended volatile memories that have different HW
>> characteristics.
>
>Some advice for the LSF/MM discussion, the rationale will need to be
>more than "we think the ZONE_EXMEM can be used to represent extended
>volatile memories that have different HW characteristics". It needs to
>be along the lines of "yes, to date Linux has been able to describe DDR
>with NUMA effects, PMEM with high write overhead, and HBM with improved
>bandwidth not necessarily latency, all without adding a new ZONE, but a
>new ZONE is absolutely required now to enable use case FOO, or address
>unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>maintainability concern of "fewer degress of freedom in the ZONE
>dimension" starts to dominate.

One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
So, we thought it could be a graceful approach adding a new zone and separately manage the new features.

Kindly let me know any advice or comment on our thoughts.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-23 10:51             ` RE(2): " Kyungsan Kim
@ 2023-03-23 12:25               ` David Hildenbrand
       [not found]                 ` <CGME20230324090923epcas2p2710ba4dc8157f9141c03104cf66e9d26@epcas2p2.samsung.com>
  2023-03-24  0:41               ` RE(2): " Huang, Ying
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: David Hildenbrand @ 2023-03-23 12:25 UTC (permalink / raw)
  To: Kyungsan Kim, dan.j.williams
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, ying.huang

On 23.03.23 11:51, Kyungsan Kim wrote:
> I appreciate dan for the careful advice.
> 
>> Kyungsan Kim wrote:
>> [..]
>>>> In addition to CXL memory, we may have other kind of memory in the
>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>> together.  Do we need to add one zone type for each kind of memory?
>>>
>>> We also don't think a new zone is needed for every single memory
>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>> manage multiple volatile memory devices due to the increased device
>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>> represent extended volatile memories that have different HW
>>> characteristics.
>>
>> Some advice for the LSF/MM discussion, the rationale will need to be
>> more than "we think the ZONE_EXMEM can be used to represent extended
>> volatile memories that have different HW characteristics". It needs to
>> be along the lines of "yes, to date Linux has been able to describe DDR
>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>> new ZONE is absolutely required now to enable use case FOO, or address
>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>> maintainability concern of "fewer degress of freedom in the ZONE
>> dimension" starts to dominate.
> 
> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.

That sounds like a bad hack :) .

> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.

I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's 
similar to what you have in mind here. In general, adding new zones is 
frowned upon.

> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
> 
> Kindly let me know any advice or comment on our thoughts.

[1] https://www.lkml.org/lkml/2020/9/9/667

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-23 10:51             ` RE(2): " Kyungsan Kim
  2023-03-23 12:25               ` David Hildenbrand
@ 2023-03-24  0:41               ` Huang, Ying
       [not found]                 ` <CGME20230324084808epcas2p354865d38dccddcb5cd46b17610345a5f@epcas2p3.samsung.com>
  2023-03-24 14:55               ` RE(2): " Matthew Wilcox
  2023-03-26  7:21               ` Mike Rapoport
  3 siblings, 1 reply; 66+ messages in thread
From: Huang, Ying @ 2023-03-24  0:41 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: dan.j.williams, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko

Kyungsan Kim <ks0204.kim@samsung.com> writes:

> I appreciate dan for the careful advice.
>
>>Kyungsan Kim wrote:
>>[..]
>>> >In addition to CXL memory, we may have other kind of memory in the
>>> >system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>> >memory in GPU card, etc.  I guess that we need to consider them
>>> >together.  Do we need to add one zone type for each kind of memory?
>>> 
>>> We also don't think a new zone is needed for every single memory
>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>> manage multiple volatile memory devices due to the increased device
>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>> represent extended volatile memories that have different HW
>>> characteristics.
>>
>>Some advice for the LSF/MM discussion, the rationale will need to be
>>more than "we think the ZONE_EXMEM can be used to represent extended
>>volatile memories that have different HW characteristics". It needs to
>>be along the lines of "yes, to date Linux has been able to describe DDR
>>with NUMA effects, PMEM with high write overhead, and HBM with improved
>>bandwidth not necessarily latency, all without adding a new ZONE, but a
>>new ZONE is absolutely required now to enable use case FOO, or address
>>unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>maintainability concern of "fewer degress of freedom in the ZONE
>>dimension" starts to dominate.
>
> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.

Sorry, I don't get your idea.  You want the memory range

 1. can be hot-removed
 2. allow kernel context allocation

This appears impossible for me.  Why cannot you just use ZONE_MOVABLE?

Best Regards,
Huang, Ying

> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>
> Kindly let me know any advice or comment on our thoughts.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE(4): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                 ` <CGME20230324084808epcas2p354865d38dccddcb5cd46b17610345a5f@epcas2p3.samsung.com>
@ 2023-03-24  8:48                   ` Kyungsan Kim
  2023-03-24 13:46                     ` Gregory Price
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-24  8:48 UTC (permalink / raw)
  To: ying.huang
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams

>Kyungsan Kim <ks0204.kim@samsung.com> writes:
>
>> I appreciate dan for the careful advice.
>>
>>>Kyungsan Kim wrote:
>>>[..]
>>>> >In addition to CXL memory, we may have other kind of memory in the
>>>> >system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>> >memory in GPU card, etc.  I guess that we need to consider them
>>>> >together.  Do we need to add one zone type for each kind of memory?
>>>> 
>>>> We also don't think a new zone is needed for every single memory
>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>> manage multiple volatile memory devices due to the increased device
>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>> represent extended volatile memories that have different HW
>>>> characteristics.
>>>
>>>Some advice for the LSF/MM discussion, the rationale will need to be
>>>more than "we think the ZONE_EXMEM can be used to represent extended
>>>volatile memories that have different HW characteristics". It needs to
>>>be along the lines of "yes, to date Linux has been able to describe DDR
>>>with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>new ZONE is absolutely required now to enable use case FOO, or address
>>>unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>maintainability concern of "fewer degress of freedom in the ZONE
>>>dimension" starts to dominate.
>>
>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>
>Sorry, I don't get your idea.  You want the memory range
>
> 1. can be hot-removed
> 2. allow kernel context allocation
>
>This appears impossible for me.  Why cannot you just use ZONE_MOVABLE?

Indeed, we tried the approach. It was able to allocate a kernel context from ZONE_MOVABLE using GFP_MOVABLE.
However, we think it would be a bad practice for the 2 reasons.
1. It causes oops and system hang occasionally due to kernel page migration while swap or compaction. 
2. Literally, the design intention of ZONE_MOVABLE is to a page movable. So, we thought allocating a kernel context from the zone hurts the intention.

Allocating a kernel context out of ZONE_EXMEM is unmovable.
  a kernel context -  alloc_pages(GFP_EXMEM,)
Allocating a user context out of ZONE_EXMEM is movable.
  a user context - mmap(,,MAP_EXMEM,) - syscall - alloc_pages(GFP_EXMEM | GFP_MOVABLE,)
This is how ZONE_EXMEM supports the two cases.

>
>Best Regards,
>Huang, Ying
>
>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>
>> Kindly let me know any advice or comment on our thoughts.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE(4): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                 ` <CGME20230324090923epcas2p2710ba4dc8157f9141c03104cf66e9d26@epcas2p2.samsung.com>
@ 2023-03-24  9:09                   ` Kyungsan Kim
  2023-03-24  9:12                     ` David Hildenbrand
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-24  9:09 UTC (permalink / raw)
  To: david
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams

Thank you David Hinderbrand for your interest on this topic.

>>
>>> Kyungsan Kim wrote:
>>> [..]
>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>
>>>> We also don't think a new zone is needed for every single memory
>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>> manage multiple volatile memory devices due to the increased device
>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>> represent extended volatile memories that have different HW
>>>> characteristics.
>>>
>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>> volatile memories that have different HW characteristics". It needs to
>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>> new ZONE is absolutely required now to enable use case FOO, or address
>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>> maintainability concern of "fewer degress of freedom in the ZONE
>>> dimension" starts to dominate.
>>
>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.

>That sounds like a bad hack :) .
I consent you.

>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.

>I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>similar to what you have in mind here. In general, adding new zones is
>frowned upon.

Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.

We think ZONE_EXMEM also helps less fragmentation.
Because it is a separated zone and handles a page allocation as movable by default.

>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>
>> Kindly let me know any advice or comment on our thoughts.
>
>[1] https://www.lkml.org/lkml/2020/9/9/667
>
>--
>Thanks,
>
>David / dhildenb

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-24  9:09                   ` RE(4): " Kyungsan Kim
@ 2023-03-24  9:12                     ` David Hildenbrand
       [not found]                       ` <CGME20230324092731epcas2p315c348bd76ef9fc84bffdb158e4c1aa4@epcas2p3.samsung.com>
  0 siblings, 1 reply; 66+ messages in thread
From: David Hildenbrand @ 2023-03-24  9:12 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams

On 24.03.23 10:09, Kyungsan Kim wrote:
> Thank you David Hinderbrand for your interest on this topic.
> 
>>>
>>>> Kyungsan Kim wrote:
>>>> [..]
>>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>>
>>>>> We also don't think a new zone is needed for every single memory
>>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>>> manage multiple volatile memory devices due to the increased device
>>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>>> represent extended volatile memories that have different HW
>>>>> characteristics.
>>>>
>>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>>> volatile memories that have different HW characteristics". It needs to
>>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>> new ZONE is absolutely required now to enable use case FOO, or address
>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>> maintainability concern of "fewer degress of freedom in the ZONE
>>>> dimension" starts to dominate.
>>>
>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
> 
>> That sounds like a bad hack :) .
> I consent you.
> 
>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
> 
>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>> similar to what you have in mind here. In general, adding new zones is
>> frowned upon.
> 
> Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.
> 
> We think ZONE_EXMEM also helps less fragmentation.
> Because it is a separated zone and handles a page allocation as movable by default.

So how is it different that it would justify a different (more confusing 
IMHO) name? :) Of course, names don't matter that much, but I'd be 
interested in which other aspect that zone would be "special".

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                       ` <CGME20230324092731epcas2p315c348bd76ef9fc84bffdb158e4c1aa4@epcas2p3.samsung.com>
@ 2023-03-24  9:27                         ` Kyungsan Kim
  2023-03-24  9:30                           ` David Hildenbrand
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-24  9:27 UTC (permalink / raw)
  To: david
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams

>On 24.03.23 10:09, Kyungsan Kim wrote:
>> Thank you David Hinderbrand for your interest on this topic.
>> 
>>>>
>>>>> Kyungsan Kim wrote:
>>>>> [..]
>>>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>>>
>>>>>> We also don't think a new zone is needed for every single memory
>>>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>>>> manage multiple volatile memory devices due to the increased device
>>>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>>>> represent extended volatile memories that have different HW
>>>>>> characteristics.
>>>>>
>>>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>>>> volatile memories that have different HW characteristics". It needs to
>>>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>>> new ZONE is absolutely required now to enable use case FOO, or address
>>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>>> maintainability concern of "fewer degress of freedom in the ZONE
>>>>> dimension" starts to dominate.
>>>>
>>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>> 
>>> That sounds like a bad hack :) .
>> I consent you.
>> 
>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>> 
>>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>>> similar to what you have in mind here. In general, adding new zones is
>>> frowned upon.
>> 
>> Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
>> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
>> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.
>> 
>> We think ZONE_EXMEM also helps less fragmentation.
>> Because it is a separated zone and handles a page allocation as movable by default.
>
>So how is it different that it would justify a different (more confusing 
>IMHO) name? :) Of course, names don't matter that much, but I'd be 
>interested in which other aspect that zone would be "special".

FYI for the first time I named it as ZONE_CXLMEM, but we thought it would be needed to cover other extended memory types as well.
So I changed it as ZONE_EXMEM. 
We also would like to point out a "special" zone aspeact, which is different from ZONE_NORMAL for tranditional DDR DRAM.
Of course, a symbol naming is important more or less to represent it very nicely, though.
Do you prefer ZONE_SPECIAL? :)

>
>-- 
>Thanks,
>
>David / dhildenb
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-24  9:27                         ` RE(2): " Kyungsan Kim
@ 2023-03-24  9:30                           ` David Hildenbrand
       [not found]                             ` <CGME20230324095031epcas2p284095ae90b25a47360b5098478dffdaa@epcas2p2.samsung.com>
  0 siblings, 1 reply; 66+ messages in thread
From: David Hildenbrand @ 2023-03-24  9:30 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams

On 24.03.23 10:27, Kyungsan Kim wrote:
>> On 24.03.23 10:09, Kyungsan Kim wrote:
>>> Thank you David Hinderbrand for your interest on this topic.
>>>
>>>>>
>>>>>> Kyungsan Kim wrote:
>>>>>> [..]
>>>>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>>>>
>>>>>>> We also don't think a new zone is needed for every single memory
>>>>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>>>>> manage multiple volatile memory devices due to the increased device
>>>>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>>>>> represent extended volatile memories that have different HW
>>>>>>> characteristics.
>>>>>>
>>>>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>>>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>>>>> volatile memories that have different HW characteristics". It needs to
>>>>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>>>> new ZONE is absolutely required now to enable use case FOO, or address
>>>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>>>> maintainability concern of "fewer degress of freedom in the ZONE
>>>>>> dimension" starts to dominate.
>>>>>
>>>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>
>>>> That sounds like a bad hack :) .
>>> I consent you.
>>>
>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>
>>>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>>>> similar to what you have in mind here. In general, adding new zones is
>>>> frowned upon.
>>>
>>> Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
>>> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
>>> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.
>>>
>>> We think ZONE_EXMEM also helps less fragmentation.
>>> Because it is a separated zone and handles a page allocation as movable by default.
>>
>> So how is it different that it would justify a different (more confusing
>> IMHO) name? :) Of course, names don't matter that much, but I'd be
>> interested in which other aspect that zone would be "special".
> 
> FYI for the first time I named it as ZONE_CXLMEM, but we thought it would be needed to cover other extended memory types as well.
> So I changed it as ZONE_EXMEM.
> We also would like to point out a "special" zone aspeact, which is different from ZONE_NORMAL for tranditional DDR DRAM.
> Of course, a symbol naming is important more or less to represent it very nicely, though.
> Do you prefer ZONE_SPECIAL? :)

I called it ZONE_PREFER_MOVABLE. If you studied that approach there must 
be a good reason to name it differently?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE(3): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                             ` <CGME20230324095031epcas2p284095ae90b25a47360b5098478dffdaa@epcas2p2.samsung.com>
@ 2023-03-24  9:50                               ` Kyungsan Kim
  2023-03-24 13:08                                 ` Jørgen Hansen
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-24  9:50 UTC (permalink / raw)
  To: david
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams

>On 24.03.23 10:27, Kyungsan Kim wrote:
>>> On 24.03.23 10:09, Kyungsan Kim wrote:
>>>> Thank you David Hinderbrand for your interest on this topic.
>>>>
>>>>>>
>>>>>>> Kyungsan Kim wrote:
>>>>>>> [..]
>>>>>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>>>>>
>>>>>>>> We also don't think a new zone is needed for every single memory
>>>>>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>>>>>> manage multiple volatile memory devices due to the increased device
>>>>>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>>>>>> represent extended volatile memories that have different HW
>>>>>>>> characteristics.
>>>>>>>
>>>>>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>>>>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>>>>>> volatile memories that have different HW characteristics". It needs to
>>>>>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>>>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>>>>> new ZONE is absolutely required now to enable use case FOO, or address
>>>>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>>>>> maintainability concern of "fewer degress of freedom in the ZONE
>>>>>>> dimension" starts to dominate.
>>>>>>
>>>>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>
>>>>> That sounds like a bad hack :) .
>>>> I consent you.
>>>>
>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>
>>>>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>>>>> similar to what you have in mind here. In general, adding new zones is
>>>>> frowned upon.
>>>>
>>>> Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
>>>> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
>>>> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.
>>>>
>>>> We think ZONE_EXMEM also helps less fragmentation.
>>>> Because it is a separated zone and handles a page allocation as movable by default.
>>>
>>> So how is it different that it would justify a different (more confusing
>>> IMHO) name? :) Of course, names don't matter that much, but I'd be
>>> interested in which other aspect that zone would be "special".
>>
>> FYI for the first time I named it as ZONE_CXLMEM, but we thought it would be needed to cover other extended memory types as well.
>> So I changed it as ZONE_EXMEM.
>> We also would like to point out a "special" zone aspeact, which is different from ZONE_NORMAL for tranditional DDR DRAM.
>> Of course, a symbol naming is important more or less to represent it very nicely, though.
>> Do you prefer ZONE_SPECIAL? :)
>
>I called it ZONE_PREFER_MOVABLE. If you studied that approach there must
>be a good reason to name it differently?
>

The intention of ZONE_EXMEM is a separated logical management dimension originated from the HW diffrences of extended memory devices.
Althought the ZONE_EXMEM considers the movable and frementation aspect, it is not all what ZONE_EXMEM considers.
So it is named as it.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE(3): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-24  9:50                               ` RE(3): " Kyungsan Kim
@ 2023-03-24 13:08                                 ` Jørgen Hansen
  2023-03-24 22:33                                   ` David Hildenbrand
       [not found]                                   ` <CGME20230331113147epcas2p12655777fec6839f7070ffcc446e3581b@epcas2p1.samsung.com>
  0 siblings, 2 replies; 66+ messages in thread
From: Jørgen Hansen @ 2023-03-24 13:08 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: david, lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams


> On 24 Mar 2023, at 10.50, Kyungsan Kim <ks0204.kim@samsung.com> wrote:
> 
>> On 24.03.23 10:27, Kyungsan Kim wrote:
>>>> On 24.03.23 10:09, Kyungsan Kim wrote:
>>>>> Thank you David Hinderbrand for your interest on this topic.
>>>>> 
>>>>>>> 
>>>>>>>> Kyungsan Kim wrote:
>>>>>>>> [..]
>>>>>>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>>>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>>>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>>>>>> 
>>>>>>>>> We also don't think a new zone is needed for every single memory
>>>>>>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>>>>>>> manage multiple volatile memory devices due to the increased device
>>>>>>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>>>>>>> represent extended volatile memories that have different HW
>>>>>>>>> characteristics.
>>>>>>>> 
>>>>>>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>>>>>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>>>>>>> volatile memories that have different HW characteristics". It needs to
>>>>>>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>>>>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>>>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>>>>>> new ZONE is absolutely required now to enable use case FOO, or address
>>>>>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>>>>>> maintainability concern of "fewer degress of freedom in the ZONE
>>>>>>>> dimension" starts to dominate.
>>>>>>> 
>>>>>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>> 
>>>>>> That sounds like a bad hack :) .
>>>>> I consent you.
>>>>> 
>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>> 
>>>>>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>>>>>> similar to what you have in mind here. In general, adding new zones is
>>>>>> frowned upon.
>>>>> 
>>>>> Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
>>>>> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
>>>>> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.
>>>>> 
>>>>> We think ZONE_EXMEM also helps less fragmentation.
>>>>> Because it is a separated zone and handles a page allocation as movable by default.
>>>> 
>>>> So how is it different that it would justify a different (more confusing
>>>> IMHO) name? :) Of course, names don't matter that much, but I'd be
>>>> interested in which other aspect that zone would be "special".
>>> 
>>> FYI for the first time I named it as ZONE_CXLMEM, but we thought it would be needed to cover other extended memory types as well.
>>> So I changed it as ZONE_EXMEM.
>>> We also would like to point out a "special" zone aspeact, which is different from ZONE_NORMAL for tranditional DDR DRAM.
>>> Of course, a symbol naming is important more or less to represent it very nicely, though.
>>> Do you prefer ZONE_SPECIAL? :)
>> 
>> I called it ZONE_PREFER_MOVABLE. If you studied that approach there must
>> be a good reason to name it differently?
>> 
> 
> The intention of ZONE_EXMEM is a separated logical management dimension originated from the HW diffrences of extended memory devices.
> Althought the ZONE_EXMEM considers the movable and frementation aspect, it is not all what ZONE_EXMEM considers.
> So it is named as it.

Given that CXL memory devices can potentially cover a wide range of technologies with quite different latency and bandwidth metrics, will one zone serve as the management vehicle that you seek? If a system contains both CXL attached DRAM and, let say, a byte-addressable CXL SSD - both used as (different) byte addressable tiers in a tiered memory hierarchy, allocating memory from the ZONE_EXMEM doesn’t really tell you much about what you get. So the client would still need an orthogonal method to characterize the desired performance characteristics. This method could be combined with a fabric independent zone such as ZONE_PREFER_MOVABLE to address the kernel allocation issue. At the same time, this new zone could also be useful in other cases, such as virtio-mem.

Thanks,
Jorgen


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE(4): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-24  8:48                   ` RE(4): " Kyungsan Kim
@ 2023-03-24 13:46                     ` Gregory Price
       [not found]                       ` <CGME20230331113417epcas2p20a886e1712dbdb1f8eec03a2ac0a47e2@epcas2p2.samsung.com>
  0 siblings, 1 reply; 66+ messages in thread
From: Gregory Price @ 2023-03-24 13:46 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: ying.huang, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko, dan.j.williams

On Fri, Mar 24, 2023 at 05:48:08PM +0900, Kyungsan Kim wrote:
> 
> Indeed, we tried the approach. It was able to allocate a kernel context from ZONE_MOVABLE using GFP_MOVABLE.
> However, we think it would be a bad practice for the 2 reasons.
> 1. It causes oops and system hang occasionally due to kernel page migration while swap or compaction. 
> 2. Literally, the design intention of ZONE_MOVABLE is to a page movable. So, we thought allocating a kernel context from the zone hurts the intention.
> 
> Allocating a kernel context out of ZONE_EXMEM is unmovable.
>   a kernel context -  alloc_pages(GFP_EXMEM,)

What is the specific use case of this?  If the answer is flexibility in
low-memory situations, why wouldn't the kernel simply change to free up
ZONE_NORMAL (swapping user memory, migrating user memory, etc) and
allocate as needed?

I could see allocating kernel memory from local memory expanders
(directly attached to local CXL port), but I can't think of a case where
it would be preferable for kernel resources to live on remote memory.
Since local memory expanders are static devices, there shouldn't be a
great need for hotplug, which means the memory could be mapped
ZONE_NORMAL without issue.

> Allocating a user context out of ZONE_EXMEM is movable.
>   a user context - mmap(,,MAP_EXMEM,) - syscall - alloc_pages(GFP_EXMEM | GFP_MOVABLE,)
> This is how ZONE_EXMEM supports the two cases.
> 

Is it intended for a user to explicitly request MAP_EXMEM for it to get
used at all?  As in, if i simply mmap() without MAP_EXMEM, will it
remain unutilized?

~Gregory

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-23 10:51             ` RE(2): " Kyungsan Kim
  2023-03-23 12:25               ` David Hildenbrand
  2023-03-24  0:41               ` RE(2): " Huang, Ying
@ 2023-03-24 14:55               ` Matthew Wilcox
  2023-03-24 17:49                 ` Matthew Wilcox
  2023-03-26  7:21               ` Mike Rapoport
  3 siblings, 1 reply; 66+ messages in thread
From: Matthew Wilcox @ 2023-03-24 14:55 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: dan.j.williams, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko, ying.huang

On Thu, Mar 23, 2023 at 07:51:05PM +0900, Kyungsan Kim wrote:
> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.

No, that's not true.  You can allocate kernel memory from ZONE_MOVABLE.
You have to be careful when you do that, but eg filesystems put symlinks
and directories in ZONE_MOVABLE, and zswap allocates memory from
ZONE_MOVABLE.  Of course, then you have to be careful that the kernel
doesn't try to move it while you're accessing it.  That's the tradeoff.

> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.

I think you mean "migrated".  It can't be swapped unless you put the
page on the LRU list, inviting the kernel to swap it.

> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.

This sounds dangerously confused.  Do you want the EXMEM to be removable
or not?  If you do, then allocations from it have to be movable.  If
you don't, why go to all this trouble?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-24 14:55               ` RE(2): " Matthew Wilcox
@ 2023-03-24 17:49                 ` Matthew Wilcox
       [not found]                   ` <CGME20230331113715epcas2p13127b95af4000ec1ed96a2e9d89b7444@epcas2p1.samsung.com>
       [not found]                   ` <CGME20230331113845epcas2p313118617918ae2bf634c3c475fc5dbd8@epcas2p3.samsung.com>
  0 siblings, 2 replies; 66+ messages in thread
From: Matthew Wilcox @ 2023-03-24 17:49 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: dan.j.williams, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko, ying.huang

On Fri, Mar 24, 2023 at 02:55:02PM +0000, Matthew Wilcox wrote:
> No, that's not true.  You can allocate kernel memory from ZONE_MOVABLE.
> You have to be careful when you do that, but eg filesystems put symlinks
> and directories in ZONE_MOVABLE, and zswap allocates memory from
> ZONE_MOVABLE.  Of course, then you have to be careful that the kernel
> doesn't try to move it while you're accessing it.  That's the tradeoff.

I want to talk a little bit about what it would take to use MOVABLE
allocations for slab.

Initially, one might presume that it is impossible to have slab use a
movable allocation.  Usually, we need a relatively complex mechanism of
reference counting where one takes a reference on the page, uses it,
then puts the reference.  Then migration can check the page reference
and if it's unused, it knows it's safe to migrate (much handwaving here,
of course it's more complex).

The general case of kmalloc slabs cannot use MOVABLE allocations.
The API has no concept of "this pointer is temporarily not in use",
so we can never migrate any slab which has allocated objects.

But for slab caches, individual objects may have access rules which allow
them to be moved.  For example, we might be able to migrate every dentry
in a slab, then RCU-free the slab.  Similarly for radix_tree_nodes.

There was some work along these lines a few years ago:
https://lore.kernel.org/all/20190603042637.2018-16-tobin@kernel.org/

There are various practical problems with that patchset, but they can
be overcome with sufficient work.  The question is: Why do we need to do
this work?  What is the high-level motivation to make slab caches movable?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-24 13:08                                 ` Jørgen Hansen
@ 2023-03-24 22:33                                   ` David Hildenbrand
       [not found]                                     ` <CGME20230331114220epcas2p2d5734efcbdd8956f861f8e7178cd5288@epcas2p2.samsung.com>
       [not found]                                   ` <CGME20230331113147epcas2p12655777fec6839f7070ffcc446e3581b@epcas2p1.samsung.com>
  1 sibling, 1 reply; 66+ messages in thread
From: David Hildenbrand @ 2023-03-24 22:33 UTC (permalink / raw)
  To: Jørgen Hansen, Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams

On 24.03.23 14:08, Jørgen Hansen wrote:
> 
>> On 24 Mar 2023, at 10.50, Kyungsan Kim <ks0204.kim@samsung.com> wrote:
>>
>>> On 24.03.23 10:27, Kyungsan Kim wrote:
>>>>> On 24.03.23 10:09, Kyungsan Kim wrote:
>>>>>> Thank you David Hinderbrand for your interest on this topic.
>>>>>>
>>>>>>>>
>>>>>>>>> Kyungsan Kim wrote:
>>>>>>>>> [..]
>>>>>>>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>>>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>>>>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>>>>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>>>>>>>
>>>>>>>>>> We also don't think a new zone is needed for every single memory
>>>>>>>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>>>>>>>> manage multiple volatile memory devices due to the increased device
>>>>>>>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>>>>>>>> represent extended volatile memories that have different HW
>>>>>>>>>> characteristics.
>>>>>>>>>
>>>>>>>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>>>>>>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>>>>>>>> volatile memories that have different HW characteristics". It needs to
>>>>>>>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>>>>>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>>>>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>>>>>>> new ZONE is absolutely required now to enable use case FOO, or address
>>>>>>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>>>>>>> maintainability concern of "fewer degress of freedom in the ZONE
>>>>>>>>> dimension" starts to dominate.
>>>>>>>>
>>>>>>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>>
>>>>>>> That sounds like a bad hack :) .
>>>>>> I consent you.
>>>>>>
>>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>>
>>>>>>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>>>>>>> similar to what you have in mind here. In general, adding new zones is
>>>>>>> frowned upon.
>>>>>>
>>>>>> Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
>>>>>> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
>>>>>> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.
>>>>>>
>>>>>> We think ZONE_EXMEM also helps less fragmentation.
>>>>>> Because it is a separated zone and handles a page allocation as movable by default.
>>>>>
>>>>> So how is it different that it would justify a different (more confusing
>>>>> IMHO) name? :) Of course, names don't matter that much, but I'd be
>>>>> interested in which other aspect that zone would be "special".
>>>>
>>>> FYI for the first time I named it as ZONE_CXLMEM, but we thought it would be needed to cover other extended memory types as well.
>>>> So I changed it as ZONE_EXMEM.
>>>> We also would like to point out a "special" zone aspeact, which is different from ZONE_NORMAL for tranditional DDR DRAM.
>>>> Of course, a symbol naming is important more or less to represent it very nicely, though.
>>>> Do you prefer ZONE_SPECIAL? :)
>>>
>>> I called it ZONE_PREFER_MOVABLE. If you studied that approach there must
>>> be a good reason to name it differently?
>>>
>>
>> The intention of ZONE_EXMEM is a separated logical management dimension originated from the HW diffrences of extended memory devices.
>> Althought the ZONE_EXMEM considers the movable and frementation aspect, it is not all what ZONE_EXMEM considers.
>> So it is named as it.
> 
> Given that CXL memory devices can potentially cover a wide range of technologies with quite different latency and bandwidth metrics, will one zone serve as the management vehicle that you seek? If a system contains both CXL attached DRAM and, let say, a byte-addressable CXL SSD - both used as (different) byte addressable tiers in a tiered memory hierarchy, allocating memory from the ZONE_EXMEM doesn’t really tell you much about what you get. So the client would still need an orthogonal method to characterize the desired performance characteristics. This method could be combined with a fabric independent zone such as ZONE_PREFER_MOVABLE to address the kernel allocation issue. At the same time, this new zone could also be useful in other cases, such as virtio-mem.

Yes. I still did not get a satisfying answer to my original question: 
what would be the differences between both zones from a MM point of 
view? We can discuss that in the session, of course.

Regarding performance differences, I thought the idea was to go with 
different nodes to express (and model) such.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-23 10:51             ` RE(2): " Kyungsan Kim
                                 ` (2 preceding siblings ...)
  2023-03-24 14:55               ` RE(2): " Matthew Wilcox
@ 2023-03-26  7:21               ` Mike Rapoport
  2023-03-30 22:03                 ` Dragan Stancevic
       [not found]                 ` <CGME20230331114526epcas2p2b6f1d4c8c1c0b2e3c12a425b6e48c0d8@epcas2p2.samsung.com>
  3 siblings, 2 replies; 66+ messages in thread
From: Mike Rapoport @ 2023-03-26  7:21 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: dan.j.williams, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko, ying.huang

Hi,

On Thu, Mar 23, 2023 at 07:51:05PM +0900, Kyungsan Kim wrote:
> I appreciate dan for the careful advice.
> 
> >Kyungsan Kim wrote:
> >[..]
> >> >In addition to CXL memory, we may have other kind of memory in the
> >> >system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
> >> >memory in GPU card, etc.  I guess that we need to consider them
> >> >together.  Do we need to add one zone type for each kind of memory?
> >> 
> >> We also don't think a new zone is needed for every single memory
> >> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
> >> manage multiple volatile memory devices due to the increased device
> >> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
> >> represent extended volatile memories that have different HW
> >> characteristics.
> >
> >Some advice for the LSF/MM discussion, the rationale will need to be
> >more than "we think the ZONE_EXMEM can be used to represent extended
> >volatile memories that have different HW characteristics". It needs to
> >be along the lines of "yes, to date Linux has been able to describe DDR
> >with NUMA effects, PMEM with high write overhead, and HBM with improved
> >bandwidth not necessarily latency, all without adding a new ZONE, but a
> >new ZONE is absolutely required now to enable use case FOO, or address
> >unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
> >maintainability concern of "fewer degress of freedom in the ZONE
> >dimension" starts to dominate.
> 
> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.

This still does not describe what are the use cases that require having
kernel allocations on CXL.mem. 

I believe it's important to start with explanation *why* it is important to
have kernel allocations on removable devices.
 
> Kindly let me know any advice or comment on our thoughts.
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-02-21  1:41 ` [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL Kyungsan Kim
  2023-02-27 23:14   ` Dan Williams
  2023-03-03  6:07   ` Huang, Ying
@ 2023-03-30 22:02   ` Dragan Stancevic
       [not found]     ` <CGME20230331114649epcas2p23d52cd1d224085e6192a0aaf22948e3e@epcas2p2.samsung.com>
       [not found]   ` <CGME20230414084120epcas2p37f105901350410772a3115a5a490c215@epcas2p3.samsung.com>
  3 siblings, 1 reply; 66+ messages in thread
From: Dragan Stancevic @ 2023-03-30 22:02 UTC (permalink / raw)
  To: Kyungsan Kim, lsf-pc
  Cc: linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, nil-migration

On 2/20/23 19:41, Kyungsan Kim wrote:
> CXL is a promising technology that leads to fundamental changes in computing architecture.
> To facilitate adoption and widespread of CXL memory, we are developing a memory tiering solution, called SMDK[1][2].
> Using SMDK and CXL RAM device, our team has been working with industry and academic partners over last year.
> Also, thanks to many researcher's effort, CXL adoption stage is gradually moving forward from basic enablement to real-world composite usecases.
> At this moment, based on the researches and experiences gained working on SMDK, we would like to suggest a session at LSF/MM/BFP this year
> to propose possible Linux MM changes with a brief of SMDK.
> 
> Adam Manzanares kindly adviced me that it is preferred to discuss implementation details on given problem and consensus at LSF/MM/BFP.
> Considering the adoption stage of CXL technology, however, let me suggest a design level discussion on the two MM expansions of SMDK this year.
> When we have design consensus with participants, we want to continue follow-up discussions with additional implementation details, hopefully.
> 
>   
> 1. A new zone, ZONE_EXMEM
> We added ZONE_EXMEM to manage CXL RAM device(s), separated from ZONE_NORMAL for usual DRAM due to the three reasons below.

Hi Kyungsan-

I read through your links and I am very interested in this 
talk/discussion from the perspective of cloud/virtualization hypervisor 
loads.

The problem that I am starting to tackle is clustering of hypervisors 
over cxl.mem for high availability of virtual machines. Or live 
migration of virtual machines between hypervisors using cxl.mem [1].


So I was wondering, with regards to the ZONE_XMEM, has any thought been 
given to the shared memory across virtual hierarchies [2], where you 
have cxl.mem access over cxl switches by multiple VH connections. It 
seems to me that there might be a need for differentiation of direct 
cxl.mem and switched cxl.mem. At least from the point of view where you 
have multiple hypervisors sharing the memory over a switch. Where they 
would potentially have to synchronize state/metadata about the memory.


[1] A high-level explanation is at http://nil-migration.org
[2] Compute Express Link Specification r3.0, v1.0 8/1/22, Page 51, 
figure 1-4, black color scheme circle(3) and bars.


--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-26  7:21               ` Mike Rapoport
@ 2023-03-30 22:03                 ` Dragan Stancevic
  2023-04-03  8:44                   ` Mike Rapoport
       [not found]                 ` <CGME20230331114526epcas2p2b6f1d4c8c1c0b2e3c12a425b6e48c0d8@epcas2p2.samsung.com>
  1 sibling, 1 reply; 66+ messages in thread
From: Dragan Stancevic @ 2023-03-30 22:03 UTC (permalink / raw)
  To: Mike Rapoport, Kyungsan Kim
  Cc: dan.j.williams, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko, ying.huang, nil-migration

On 3/26/23 02:21, Mike Rapoport wrote:
> Hi,
> 
> [..] >> One problem we experienced was occured in the combination of 
hot-remove and kerelspace allocation usecases.
>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
> 
> This still does not describe what are the use cases that require having
> kernel allocations on CXL.mem.
> 
> I believe it's important to start with explanation *why* it is important to
> have kernel allocations on removable devices.

Hi Mike,

not speaking for Kyungsan here, but I am starting to tackle hypervisor 
clustering and VM migration over cxl.mem [1].

And in my mind, at least one reason that I can think of having kernel 
allocations from cxl.mem devices is where you have multiple VH 
connections sharing the memory [2]. Where for example you have a user 
space application stored in cxl.mem, and then you want the metadata 
about this process/application that the kernel keeps on one hypervisor 
be "passed on" to another hypervisor. So basically the same way 
processors in a single hypervisors cooperate on memory, you extend that 
across processors that span over physical hypervisors. If that makes 
sense...


[1] A high-level explanation is at http://nil-migration.org
[2] Compute Express Link Specification r3.0, v1.0 8/1/22, Page 51, 
figure 1-4, black color scheme circle(3) and bars.

--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: RE: RE(3): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                                   ` <CGME20230331113147epcas2p12655777fec6839f7070ffcc446e3581b@epcas2p1.samsung.com>
@ 2023-03-31 11:31                                     ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-31 11:31 UTC (permalink / raw)
  To: Jorgen.Hansen
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

Hi Jorgen Hansen.
Thank you for joining this topic and share your thoughts.
I'm sorry for late reply due to some major tasks of our team this week.

>> On 24 Mar 2023, at 10.50, Kyungsan Kim <ks0204.kim@samsung.com> wrote:
>> 
>>> On 24.03.23 10:27, Kyungsan Kim wrote:
>>>>> On 24.03.23 10:09, Kyungsan Kim wrote:
>>>>>> Thank you David Hinderbrand for your interest on this topic.
>>>>>> 
>>>>>>>> 
>>>>>>>>> Kyungsan Kim wrote:
>>>>>>>>> [..]
>>>>>>>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>>>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>>>>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>>>>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>>>>>>> 
>>>>>>>>>> We also don't think a new zone is needed for every single memory
>>>>>>>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>>>>>>>> manage multiple volatile memory devices due to the increased device
>>>>>>>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>>>>>>>> represent extended volatile memories that have different HW
>>>>>>>>>> characteristics.
>>>>>>>>> 
>>>>>>>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>>>>>>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>>>>>>>> volatile memories that have different HW characteristics". It needs to
>>>>>>>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>>>>>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>>>>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>>>>>>> new ZONE is absolutely required now to enable use case FOO, or address
>>>>>>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>>>>>>> maintainability concern of "fewer degress of freedom in the ZONE
>>>>>>>>> dimension" starts to dominate.
>>>>>>>> 
>>>>>>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>> 
>>>>>>> That sounds like a bad hack :) .
>>>>>> I consent you.
>>>>>> 
>>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>> 
>>>>>>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>>>>>>> similar to what you have in mind here. In general, adding new zones is
>>>>>>> frowned upon.
>>>>>> 
>>>>>> Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
>>>>>> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
>>>>>> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.
>>>>>> 
>>>>>> We think ZONE_EXMEM also helps less fragmentation.
>>>>>> Because it is a separated zone and handles a page allocation as movable by default.
>>>>> 
>>>>> So how is it different that it would justify a different (more confusing
>>>>> IMHO) name? :) Of course, names don't matter that much, but I'd be
>>>>> interested in which other aspect that zone would be "special".
>>>> 
>>>> FYI for the first time I named it as ZONE_CXLMEM, but we thought it would be needed to cover other extended memory types as well.
>>>> So I changed it as ZONE_EXMEM.
>>>> We also would like to point out a "special" zone aspeact, which is different from ZONE_NORMAL for tranditional DDR DRAM.
>>>> Of course, a symbol naming is important more or less to represent it very nicely, though.
>>>> Do you prefer ZONE_SPECIAL? :)
>>> 
>>> I called it ZONE_PREFER_MOVABLE. If you studied that approach there must
>>> be a good reason to name it differently?
>>> 
>> 
>> The intention of ZONE_EXMEM is a separated logical management dimension originated from the HW diffrences of extended memory devices.
>> Althought the ZONE_EXMEM considers the movable and frementation aspect, it is not all what ZONE_EXMEM considers.
>> So it is named as it.
>
>Given that CXL memory devices can potentially cover a wide range of technologies with quite different latency and bandwidth metrics, will one zone serve as the management vehicle that you seek? If a system contains both CXL attached DRAM and, let say, a byte-addressable CXL SSD - both used as (different) byte addressable tiers in a tiered memory hierarchy, allocating memory from the ZONE_EXMEM doesn’t really tell you much about what you get. So the client would still need an orthogonal method to characterize the desired performance characteristics. 

I agree that a heterogeneous system would be able to adopt multiple types of extended memory devices.
We think ZONE_EXMEM can apply different management algorithms for each extended memory type. 
What we think is ZONE_NORMAL : ZONE_EXMEM = 1 : N, where N is the number of HW device type.
ZONE_NORMAL is for conventional DDR DRAM on DIMM F/F, while ZONE_EXMEM is for extended memories, CXL DRAM, CXL SSD, etc on other F/Fs such as EDSFF. 

We think the movable attribute is a requirement for CXL DRAM device. 
However, there are other SW points we are concerning - implicit allocation and unintended migration - with CXL HW differences.
So, I'm not sure if it is possible or good to cover the matters by combination of ZONE_MOVABLE and ZONE_PREFER_MOVABLE design.
Let me point out again, we proposed the ZONE_EXMEM for the special logical management of extended memory devices.

Specifically, for the performance metric, we think it would be handled not in the zone, but in a node unit.


>This method could be combined with a fabric independent zone such as ZONE_PREFER_MOVABLE to address the kernel allocation issue. At the same time, this new zone could also be useful in other cases, such as virtio-mem.

We agree with your thought. Along with adoption of CXL memory pool and fabric, virtualization SW layers would be added.
Considering not only baremetal OS, but memory inflation/deflation between baremetal OS and a hypervisor, we think ZONE_EXMEM can be useful as the identifier for CXL memory.


>
>Thanks,
>Jorgen

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: RE: RE(4): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                       ` <CGME20230331113417epcas2p20a886e1712dbdb1f8eec03a2ac0a47e2@epcas2p2.samsung.com>
@ 2023-03-31 11:34                         ` Kyungsan Kim
  2023-03-31 15:53                           ` Gregory Price
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-31 11:34 UTC (permalink / raw)
  To: gregory.price
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

Hi Gregory Price. 
Thank you for joining this topic and share your viewpoint.
I'm sorry for late reply due to some major tasks of our team this week.

>On Fri, Mar 24, 2023 at 05:48:08PM +0900, Kyungsan Kim wrote:
>> 
>> Indeed, we tried the approach. It was able to allocate a kernel context from ZONE_MOVABLE using GFP_MOVABLE.
>> However, we think it would be a bad practice for the 2 reasons.
>> 1. It causes oops and system hang occasionally due to kernel page migration while swap or compaction. 
>> 2. Literally, the design intention of ZONE_MOVABLE is to a page movable. So, we thought allocating a kernel context from the zone hurts the intention.
>> 
>> Allocating a kernel context out of ZONE_EXMEM is unmovable.
>>   a kernel context -  alloc_pages(GFP_EXMEM,)
>
>What is the specific use case of this?  If the answer is flexibility in
>low-memory situations, why wouldn't the kernel simply change to free up
>ZONE_NORMAL (swapping user memory, migrating user memory, etc) and
>allocate as needed?
>
>I could see allocating kernel memory from local memory expanders
>(directly attached to local CXL port), but I can't think of a case where
>it would be preferable for kernel resources to live on remote memory.

We have thought kernelspace memory tiering cases.
What memory tiering we assumes is to locate a hot data in fast memory and a cold data in slow memory.
We think zswap, pagecache, and Meta TPP(page promotion/demotion among nodes) as the kernelspace memory tiering cases.

>Since local memory expanders are static devices, there shouldn't be a
>great need for hotplug, which means the memory could be mapped
>ZONE_NORMAL without issue.
>

IMHO, we think hot-add/remove is one of the key feature of CXL due to the composability aspect.
Right now, CXL device and system connection is limited. 
But industry is preparing a CXL capable system that allows more than 10 CXL channels at backplane, pluggable with EDSFF. 
Not only that, along with the progress of CXL topology - from direct-attached to switch, multi-level switch, and fabric connection -
I think the hot-add/remove usecase would become more important.


>> Allocating a user context out of ZONE_EXMEM is movable.
>>   a user context - mmap(,,MAP_EXMEM,) - syscall - alloc_pages(GFP_EXMEM | GFP_MOVABLE,)
>> This is how ZONE_EXMEM supports the two cases.
>> 
>
>Is it intended for a user to explicitly request MAP_EXMEM for it to get
>used at all?  As in, if i simply mmap() without MAP_EXMEM, will it
>remain unutilized?

Our intention is to allow below 3 cases
1. Explicit DDR allocation - mmap(,,MAP_NORMAL,)
 : allocation from ZONE_NORMAL or ZONE_MOVABLE, or allocation fails.
2. Explicit CXL allocation - mmap(,,MAP_EXMEM,)
 : allocation from ZONE_EXMEM, of allocation fails.
3. Implicit Memory allocation - mmap(,,,) 
 : allocation from ZONE_NORMAL, ZONE_MOVABLE, or ZONE_EXMEM. In other words, no matter where DDR or CXL DRAM.

Among that, 3 is similar with vanilla kernel operation in that the allocation request traverses among multiple zones or nodes.
We think it would be good or bad for the mmap caller point of view.
It is good because memory is allocated, while it could be bad because the caller does not have idea of allocated memory type.
The later would hurt QoS metrics or userspace memory tiering operation, which expects near/far memory.

>
>~Gregory

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                   ` <CGME20230331113715epcas2p13127b95af4000ec1ed96a2e9d89b7444@epcas2p1.samsung.com>
@ 2023-03-31 11:37                     ` Kyungsan Kim
  2023-03-31 12:54                       ` Matthew Wilcox
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-31 11:37 UTC (permalink / raw)
  To: willy
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

Hi Matthew Wilcox. 
We appreciate you join this topic and revise our sentences sophisticatedly.

>On Thu, Mar 23, 2023 at 07:51:05PM +0900, Kyungsan Kim wrote:
>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>
>No, that's not true.  You can allocate kernel memory from ZONE_MOVABLE.
>You have to be careful when you do that, but eg filesystems put symlinks
>and directories in ZONE_MOVABLE, and zswap allocates memory from
>ZONE_MOVABLE.  Of course, then you have to be careful that the kernel
>doesn't try to move it while you're accessing it.  That's the tradeoff.

You are correct.
In fact, the intention of the sentence was to generally explain the movability preference of a kernel and user context.
We have been aware that a kernel context is able to allocate from ZONE_MOVABLE
using GFP_MOVABLE and implementing the movable callbacks, migrate_page(), putback_page(), isolate_page().
We had studied that the z3fold/zsmalloc allocator of zswap also allocate from ZONE_MOVABLE.
But we did not aware that symlinks and dentries are allocated from ZONE_MOVABLE.
Thank you for letting us know the additional cases.

Let me revise the part. In regards to page movability, 
a kernel context prefers unmovable in general, but some kernel contexts are movable such as symlink, dentry, and zswap.
an user context prefers movable in general, but some user contexts are unmovable such as DMA buffer.

>
>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>
>I think you mean "migrated".  It can't be swapped unless you put the
>page on the LRU list, inviting the kernel to swap it.

"migrated" is correct.

>
>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>
>This sounds dangerously confused.  Do you want the EXMEM to be removable
>or not?  If you do, then allocations from it have to be movable.  If
>you don't, why go to all this trouble?

I'm sorry to make you confused. We will try more to clearly explain our thought.
We think the CXL DRAM device should be removable along with HW pluggable nature.
For MM point of view, we think a page of CXL DRAM can be both movable and unmovable. 
An user or kernel context should be able to determine it. Thus, we think dedication on the ZONE_NORMAL or the ZONE_MOVABLE is not enough.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                   ` <CGME20230331113845epcas2p313118617918ae2bf634c3c475fc5dbd8@epcas2p3.samsung.com>
@ 2023-03-31 11:38                     ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-31 11:38 UTC (permalink / raw)
  To: willy
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>On Fri, Mar 24, 2023 at 02:55:02PM +0000, Matthew Wilcox wrote:
>> No, that's not true.  You can allocate kernel memory from ZONE_MOVABLE.
>> You have to be careful when you do that, but eg filesystems put symlinks
>> and directories in ZONE_MOVABLE, and zswap allocates memory from
>> ZONE_MOVABLE.  Of course, then you have to be careful that the kernel
>> doesn't try to move it while you're accessing it.  That's the tradeoff.
>
>I want to talk a little bit about what it would take to use MOVABLE
>allocations for slab.
>
>Initially, one might presume that it is impossible to have slab use a
>movable allocation.  Usually, we need a relatively complex mechanism of
>reference counting where one takes a reference on the page, uses it,
>then puts the reference.  Then migration can check the page reference
>and if it's unused, it knows it's safe to migrate (much handwaving here,
>of course it's more complex).
>
>The general case of kmalloc slabs cannot use MOVABLE allocations.
>The API has no concept of "this pointer is temporarily not in use",
>so we can never migrate any slab which has allocated objects.
>
>But for slab caches, individual objects may have access rules which allow
>them to be moved.  For example, we might be able to migrate every dentry
>in a slab, then RCU-free the slab.  Similarly for radix_tree_nodes.
>
>There was some work along these lines a few years ago:
>https://lore.kernel.org/all/20190603042637.2018-16-tobin@kernel.org/
>
>There are various practical problems with that patchset, but they can
>be overcome with sufficient work.  The question is: Why do we need to do
>this work?  What is the high-level motivation to make slab caches movable?


Thank you for sharing us the case.
We studied your sentences and the patchset. Let me summarize our understanding.
A kernel context is migratable at certain in case the reference count of a page is traceable and it is 0.

As I answered previously, our intention is the attribute of CXL DRAM page can be movable as well as unmovable.
A memory allocator context should be able to determine it.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                                     ` <CGME20230331114220epcas2p2d5734efcbdd8956f861f8e7178cd5288@epcas2p2.samsung.com>
@ 2023-03-31 11:42                                       ` Kyungsan Kim
  2023-03-31 13:42                                         ` Matthew Wilcox
  2023-04-03  8:28                                         ` David Hildenbrand
  0 siblings, 2 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-31 11:42 UTC (permalink / raw)
  To: david
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>On 24.03.23 14:08, Jørgen Hansen wrote:
>> 
>>> On 24 Mar 2023, at 10.50, Kyungsan Kim <ks0204.kim@samsung.com> wrote:
>>>
>>>> On 24.03.23 10:27, Kyungsan Kim wrote:
>>>>>> On 24.03.23 10:09, Kyungsan Kim wrote:
>>>>>>> Thank you David Hinderbrand for your interest on this topic.
>>>>>>>
>>>>>>>>>
>>>>>>>>>> Kyungsan Kim wrote:
>>>>>>>>>> [..]
>>>>>>>>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>>>>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>>>>>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>>>>>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>>>>>>>>
>>>>>>>>>>> We also don't think a new zone is needed for every single memory
>>>>>>>>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>>>>>>>>> manage multiple volatile memory devices due to the increased device
>>>>>>>>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>>>>>>>>> represent extended volatile memories that have different HW
>>>>>>>>>>> characteristics.
>>>>>>>>>>
>>>>>>>>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>>>>>>>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>>>>>>>>> volatile memories that have different HW characteristics". It needs to
>>>>>>>>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>>>>>>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>>>>>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>>>>>>>> new ZONE is absolutely required now to enable use case FOO, or address
>>>>>>>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>>>>>>>> maintainability concern of "fewer degress of freedom in the ZONE
>>>>>>>>>> dimension" starts to dominate.
>>>>>>>>>
>>>>>>>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>>>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>>>
>>>>>>>> That sounds like a bad hack :) .
>>>>>>> I consent you.
>>>>>>>
>>>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>>>
>>>>>>>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>>>>>>>> similar to what you have in mind here. In general, adding new zones is
>>>>>>>> frowned upon.
>>>>>>>
>>>>>>> Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
>>>>>>> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
>>>>>>> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.
>>>>>>>
>>>>>>> We think ZONE_EXMEM also helps less fragmentation.
>>>>>>> Because it is a separated zone and handles a page allocation as movable by default.
>>>>>>
>>>>>> So how is it different that it would justify a different (more confusing
>>>>>> IMHO) name? :) Of course, names don't matter that much, but I'd be
>>>>>> interested in which other aspect that zone would be "special".
>>>>>
>>>>> FYI for the first time I named it as ZONE_CXLMEM, but we thought it would be needed to cover other extended memory types as well.
>>>>> So I changed it as ZONE_EXMEM.
>>>>> We also would like to point out a "special" zone aspeact, which is different from ZONE_NORMAL for tranditional DDR DRAM.
>>>>> Of course, a symbol naming is important more or less to represent it very nicely, though.
>>>>> Do you prefer ZONE_SPECIAL? :)
>>>>
>>>> I called it ZONE_PREFER_MOVABLE. If you studied that approach there must
>>>> be a good reason to name it differently?
>>>>
>>>
>>> The intention of ZONE_EXMEM is a separated logical management dimension originated from the HW diffrences of extended memory devices.
>>> Althought the ZONE_EXMEM considers the movable and frementation aspect, it is not all what ZONE_EXMEM considers.
>>> So it is named as it.
>> 
>> Given that CXL memory devices can potentially cover a wide range of technologies with quite different latency and bandwidth metrics, will one zone serve as the management vehicle that you seek? If a system contains both CXL attached DRAM and, let say, a byte-addressable CXL SSD - both used as (different) byte addressable tiers in a tiered memory hierarchy, allocating memory from the ZONE_EXMEM doesn’t really tell you much about what you get. So the client would still need an orthogonal method to characterize the desired performance characteristics. This method could be combined with a fabric independent zone such as ZONE_PREFER_MOVABLE to address the kernel allocation issue. At the same time, this new zone could also be useful in other cases, such as virtio-mem.
>
>Yes. I still did not get a satisfying answer to my original question: 
>what would be the differences between both zones from a MM point of 
>view? We can discuss that in the session, of course.
>
>Regarding performance differences, I thought the idea was to go with 
>different nodes to express (and model) such.
>

From a MM point of view on the movability aspect, a kernel context is not allocated from ZONE_EXMEM without using GFP_EXMEM explicitly.
In contrast, if we understand the design of ZONE_PREFER_MOVABLE correctly, a kernel context can be allocated from ZONE_PREFER_MOVABLE implicitly as the fallback of ZONE_NORMAL allocation.
However, the movable attribute is not all we are concerning.
In addition, we experienced page allocation and migration issue on the heterogeneous memories.

Given our experiences/design and industry's viewpoints/inquiries,
I will prepare a few slides in the session to explain 
  1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
  2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
  3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)


>-- 
>Thanks,
>
>David / dhildenb

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                 ` <CGME20230331114526epcas2p2b6f1d4c8c1c0b2e3c12a425b6e48c0d8@epcas2p2.samsung.com>
@ 2023-03-31 11:45                   ` Kyungsan Kim
  2023-04-04  8:31                     ` Mike Rapoport
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-31 11:45 UTC (permalink / raw)
  To: rppt
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

Thank you Mike Rapoport for participating discussion and adding your thought.

>Hi,
>
>On Thu, Mar 23, 2023 at 07:51:05PM +0900, Kyungsan Kim wrote:
>> I appreciate dan for the careful advice.
>> 
>> >Kyungsan Kim wrote:
>> >[..]
>> >> >In addition to CXL memory, we may have other kind of memory in the
>> >> >system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>> >> >memory in GPU card, etc.  I guess that we need to consider them
>> >> >together.  Do we need to add one zone type for each kind of memory?
>> >> 
>> >> We also don't think a new zone is needed for every single memory
>> >> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>> >> manage multiple volatile memory devices due to the increased device
>> >> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>> >> represent extended volatile memories that have different HW
>> >> characteristics.
>> >
>> >Some advice for the LSF/MM discussion, the rationale will need to be
>> >more than "we think the ZONE_EXMEM can be used to represent extended
>> >volatile memories that have different HW characteristics". It needs to
>> >be along the lines of "yes, to date Linux has been able to describe DDR
>> >with NUMA effects, PMEM with high write overhead, and HBM with improved
>> >bandwidth not necessarily latency, all without adding a new ZONE, but a
>> >new ZONE is absolutely required now to enable use case FOO, or address
>> >unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>> >maintainability concern of "fewer degress of freedom in the ZONE
>> >dimension" starts to dominate.
>> 
>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>
>This still does not describe what are the use cases that require having
>kernel allocations on CXL.mem. 
>
>I believe it's important to start with explanation *why* it is important to
>have kernel allocations on removable devices.
> 

In general, a memory system with DDR/CXL DRAM will have near/far memory.
And, we think kernel already includes memory tiering solutions - Meta TPP, zswap, and pagecache.
Some kernel contexts would prefer fast memory. For example, a hot data with time locality or a data for fast processing such as metadata or indexing.
Others would enough with slow memory. For example, a zswap page which is being used while swapping. 

>> Kindly let me know any advice or comment on our thoughts.
>> 
>> 
>
>-- 
>Sincerely yours,
>Mike.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Re: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]     ` <CGME20230331114649epcas2p23d52cd1d224085e6192a0aaf22948e3e@epcas2p2.samsung.com>
@ 2023-03-31 11:46       ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-03-31 11:46 UTC (permalink / raw)
  To: dragan
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

Hi Dragan Stancevic.
Thank you for your interests and joning the discussion.

>On 2/20/23 19:41, Kyungsan Kim wrote:
>> CXL is a promising technology that leads to fundamental changes in computing architecture.
>> To facilitate adoption and widespread of CXL memory, we are developing a memory tiering solution, called SMDK[1][2].
>> Using SMDK and CXL RAM device, our team has been working with industry and academic partners over last year.
>> Also, thanks to many researcher's effort, CXL adoption stage is gradually moving forward from basic enablement to real-world composite usecases.
>> At this moment, based on the researches and experiences gained working on SMDK, we would like to suggest a session at LSF/MM/BFP this year
>> to propose possible Linux MM changes with a brief of SMDK.
>> 
>> Adam Manzanares kindly adviced me that it is preferred to discuss implementation details on given problem and consensus at LSF/MM/BFP.
>> Considering the adoption stage of CXL technology, however, let me suggest a design level discussion on the two MM expansions of SMDK this year.
>> When we have design consensus with participants, we want to continue follow-up discussions with additional implementation details, hopefully.
>> 
>>   
>> 1. A new zone, ZONE_EXMEM
>> We added ZONE_EXMEM to manage CXL RAM device(s), separated from ZONE_NORMAL for usual DRAM due to the three reasons below.
>
>Hi Kyungsan-
>
>I read through your links and I am very interested in this 
>talk/discussion from the perspective of cloud/virtualization hypervisor 
>loads.
>
>The problem that I am starting to tackle is clustering of hypervisors 
>over cxl.mem for high availability of virtual machines. Or live 
>migration of virtual machines between hypervisors using cxl.mem [1].
>
>
>So I was wondering, with regards to the ZONE_XMEM, has any thought been 
>given to the shared memory across virtual hierarchies [2], where you 
>have cxl.mem access over cxl switches by multiple VH connections. It 
>seems to me that there might be a need for differentiation of direct 
>cxl.mem and switched cxl.mem. At least from the point of view where you 
>have multiple hypervisors sharing the memory over a switch. Where they 
>would potentially have to synchronize state/metadata about the memory.

At first, in general we have thought that more SW layers(baremetal, virtualization, orchestration) would be related
along with the progress of CXL topology(direct attached, switch/multilevel switch, rackscale/inter-rackscale with fabric).
We think ZONE_EXMEM can be used as a static CXL identifier between hypervisor and host OS interaction for memory inflation/deflation, transcendent memory interface(frontswap/cleancache)[1], and isolation.


[1] https://lwn.net/Articles/454795

>
>[1] A high-level explanation is at https://protect2.fireeye.com/v1/url?k=6962eb99-098076c4-696360d6-000babd9f1ba-f4ae8300c44044a7&q=1&e=fca5fea0-6b57-4874-8ec1-637a6c1019b6&u=http%3A%2F%2Fnil-migration.org%2F
>[2] Compute Express Link Specification r3.0, v1.0 8/1/22, Page 51, 
>figure 1-4, black color scheme circle(3) and bars.
>
>
>--
>Peace can only come as a natural consequence
>of universal enlightenment -Dr. Nikola Tesla

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-31 11:37                     ` Kyungsan Kim
@ 2023-03-31 12:54                       ` Matthew Wilcox
       [not found]                         ` <CGME20230405020027epcas2p4682d43446a493385b60c39a1dbbf07d6@epcas2p4.samsung.com>
  0 siblings, 1 reply; 66+ messages in thread
From: Matthew Wilcox @ 2023-03-31 12:54 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

On Fri, Mar 31, 2023 at 08:37:15PM +0900, Kyungsan Kim wrote:
> >> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
> >
> >This sounds dangerously confused.  Do you want the EXMEM to be removable
> >or not?  If you do, then allocations from it have to be movable.  If
> >you don't, why go to all this trouble?
> 
> I'm sorry to make you confused. We will try more to clearly explain our thought.
> We think the CXL DRAM device should be removable along with HW pluggable nature.
> For MM point of view, we think a page of CXL DRAM can be both movable and unmovable. 
> An user or kernel context should be able to determine it. Thus, we think dedication on the ZONE_NORMAL or the ZONE_MOVABLE is not enough.

No, this is not the right approach.  If CXL is to be hot-pluggable,
then all CXL allocations must be movable.  If even one allocation on a
device is not movable, then the device cannot be removed.  ZONE_EXMEM
feels like a solution in search of a problem.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-31 11:42                                       ` Kyungsan Kim
@ 2023-03-31 13:42                                         ` Matthew Wilcox
  2023-03-31 15:56                                           ` Frank van der Linden
       [not found]                                           ` <CGME20230405020121epcas2p2d9d39c151b6c5ab9e568ab9e2ab826ce@epcas2p2.samsung.com>
  2023-04-03  8:28                                         ` David Hildenbrand
  1 sibling, 2 replies; 66+ messages in thread
From: Matthew Wilcox @ 2023-03-31 13:42 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: david, lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

On Fri, Mar 31, 2023 at 08:42:20PM +0900, Kyungsan Kim wrote:
> Given our experiences/design and industry's viewpoints/inquiries,
> I will prepare a few slides in the session to explain 
>   1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
>   2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
>   3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)

I think you'll find everybody else in the room understands these issues
rather better than you do.  This is hardly the first time that we've
talked about CXL, and CXL is not the first time that people have
proposed disaggregated memory, nor heterogenous latency/bandwidth
systems.  All the previous attempts have failed, and I expect this
one to fail too.  Maybe there's something novel that means this time
it really will work, so any slides you do should focus on that.

A more profitable discussion might be:

1. Should we have the page allocator return pages from CXL or should
   CXL memory be allocated another way?
2. Should there be a way for userspace to indicate that it prefers CXL
   memory when it calls mmap(), or should it always be at the discretion
   of the kernel?
3. Do we continue with the current ZONE_DEVICE model, or do we come up
   with something new?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE: RE(4): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-31 11:34                         ` Kyungsan Kim
@ 2023-03-31 15:53                           ` Gregory Price
       [not found]                             ` <CGME20230405020257epcas2p11b253f8c97a353890b96e6ae6eb515d3@epcas2p1.samsung.com>
  0 siblings, 1 reply; 66+ messages in thread
From: Gregory Price @ 2023-03-31 15:53 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

On Fri, Mar 31, 2023 at 08:34:17PM +0900, Kyungsan Kim wrote:
> Hi Gregory Price. 
> Thank you for joining this topic and share your viewpoint.
> I'm sorry for late reply due to some major tasks of our team this week.
> 
> >On Fri, Mar 24, 2023 at 05:48:08PM +0900, Kyungsan Kim wrote:
> >> 
> >> Indeed, we tried the approach. It was able to allocate a kernel context from ZONE_MOVABLE using GFP_MOVABLE.
> >> However, we think it would be a bad practice for the 2 reasons.
> >> 1. It causes oops and system hang occasionally due to kernel page migration while swap or compaction. 
> >> 2. Literally, the design intention of ZONE_MOVABLE is to a page movable. So, we thought allocating a kernel context from the zone hurts the intention.
> >> 
> >> Allocating a kernel context out of ZONE_EXMEM is unmovable.
> >>   a kernel context -  alloc_pages(GFP_EXMEM,)
> >
> >What is the specific use case of this?  If the answer is flexibility in
> >low-memory situations, why wouldn't the kernel simply change to free up
> >ZONE_NORMAL (swapping user memory, migrating user memory, etc) and
> >allocate as needed?
> >
> >I could see allocating kernel memory from local memory expanders
> >(directly attached to local CXL port), but I can't think of a case where
> >it would be preferable for kernel resources to live on remote memory.
> 
> We have thought kernelspace memory tiering cases.
> What memory tiering we assumes is to locate a hot data in fast memory and a cold data in slow memory.
> We think zswap, pagecache, and Meta TPP(page promotion/demotion among nodes) as the kernelspace memory tiering cases.
>

So, to clarify, when you say "kernel space memory tiering cases", do you
mean "to support a kernel-space controlled memory tiering service" or do
you mean "tiering of kernel memory"?

Because if it's the former, rather than a new zone, it seems like a
better proposal would be to extend the numa system to add additional
"cost/feature" attributes, rather than modifying the zone of the memory
blocks backing the node.

Note that memory zones can apply to individual blocks within a node, and
not the entire node uniformly.  So when making tiering decisions, it
seems more expedient to investigate a node rather than a block.


> >Since local memory expanders are static devices, there shouldn't be a
> >great need for hotplug, which means the memory could be mapped
> >ZONE_NORMAL without issue.
> >
> 
> IMHO, we think hot-add/remove is one of the key feature of CXL due to the composability aspect.
> Right now, CXL device and system connection is limited. 
> But industry is preparing a CXL capable system that allows more than 10 CXL channels at backplane, pluggable with EDSFF. 
> Not only that, along with the progress of CXL topology - from direct-attached to switch, multi-level switch, and fabric connection -
> I think the hot-add/remove usecase would become more important.
> 
> 

Hot add/remove is somewhat fairly represented by ZONE_MOVABLE. What's I
think confusing many people is that creating a new zone that's intended
to be hot-pluggable *and* usable by kernel for kernel-resources/memory
are presently exclusive operations.

The underlying question is what situation is being hit in which kernel
memory wants to be located in ZONE_MOVABLE/ZONE_EXMEM that cannot simply
be serviced by demoting other, movable memory to these regions.

The concept being that kernel allocations are a higher-priority
allocation than userland, and as such should have priority in DRAM.

For example - there is at least one paper that examined the cost of
placing page tables on CXL Memory Expansion (on the local CXL complex,
not remote) and found the cost is significant.  Page tables are likely
the single largest allocation the kernel will make to service large
memory structures, so the answer to this problem is not necessarily to
place that memory in CXL as well, but to use larger page sizes (which is
less wasteful as memory usage is high and memory is abundant).

I just don't understand what kernel resources would meet the following
attributes:

1) Do not have major system performance impacts in high-latency memory
2) Are sufficiently large to warrant tiering
and
3) Are capable of being moved (i.e. no pinned areas, no dma areas, etc)

> >> Allocating a user context out of ZONE_EXMEM is movable.
> >>   a user context - mmap(,,MAP_EXMEM,) - syscall - alloc_pages(GFP_EXMEM | GFP_MOVABLE,)
> >> This is how ZONE_EXMEM supports the two cases.
> >> 

So if MAP_EXMEM is not used, EXMEM would not be used?

That seems counter intuitive.  If an allocation via mmap would be
eligible for ZONE_MOVABLE, why wouldn't it be eligible for ZONE_EXMEM?

I believe this is another reason why some folks are confused what the
distinction between MOVABLE and EXMEM are.  They seem to ultimately
reduce to whether the memory can be moved.

> >
> >Is it intended for a user to explicitly request MAP_EXMEM for it to get
> >used at all?  As in, if i simply mmap() without MAP_EXMEM, will it
> >remain unutilized?
> 
> Our intention is to allow below 3 cases
> 1. Explicit DDR allocation - mmap(,,MAP_NORMAL,)
>  : allocation from ZONE_NORMAL or ZONE_MOVABLE, or allocation fails.
> 2. Explicit CXL allocation - mmap(,,MAP_EXMEM,)
>  : allocation from ZONE_EXMEM, of allocation fails.
> 3. Implicit Memory allocation - mmap(,,,) 
>  : allocation from ZONE_NORMAL, ZONE_MOVABLE, or ZONE_EXMEM. In other words, no matter where DDR or CXL DRAM.
> 
> Among that, 3 is similar with vanilla kernel operation in that the allocation request traverses among multiple zones or nodes.
> We think it would be good or bad for the mmap caller point of view.
> It is good because memory is allocated, while it could be bad because the caller does not have idea of allocated memory type.
> The later would hurt QoS metrics or userspace memory tiering operation, which expects near/far memory.
> 

For what it's worth, mmap is not the correct api for userland to provide
kernel hints on data placement.  That would be madvise and friends.

But further, allocation of memory from userland must be ok with having
its memory moved/swapped/whatever unless additional assistance from the
kernel is provided (page pinning, mlock, whatever) to ensure it will
not be moved.  Presumably this is done to ensure the kernel can make
runtime adjustments to protect itself from being denied memory and
causing instability and/or full system faults.


I think you need to clarify your intents for this zone, in particular
your intent for exactly what data can and cannot live in this zone and
the reasons for this.  "To assist kernel tiering operations" is very
vague and not a description of what memory is and is not allowed in the
zone.

~Gregory

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-31 13:42                                         ` Matthew Wilcox
@ 2023-03-31 15:56                                           ` Frank van der Linden
  2023-04-03  8:34                                             ` David Hildenbrand
       [not found]                                             ` <CGME20230405020631epcas2p1c85058b28a70bbd46d587e78a9c9c7ad@epcas2p1.samsung.com>
       [not found]                                           ` <CGME20230405020121epcas2p2d9d39c151b6c5ab9e568ab9e2ab826ce@epcas2p2.samsung.com>
  1 sibling, 2 replies; 66+ messages in thread
From: Frank van der Linden @ 2023-03-31 15:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kyungsan Kim, david, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko, dan.j.williams, seungjun.ha,
	wj28.lee

On Fri, Mar 31, 2023 at 6:42 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Mar 31, 2023 at 08:42:20PM +0900, Kyungsan Kim wrote:
> > Given our experiences/design and industry's viewpoints/inquiries,
> > I will prepare a few slides in the session to explain
> >   1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
> >   2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
> >   3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)
>
> I think you'll find everybody else in the room understands these issues
> rather better than you do.  This is hardly the first time that we've
> talked about CXL, and CXL is not the first time that people have
> proposed disaggregated memory, nor heterogenous latency/bandwidth
> systems.  All the previous attempts have failed, and I expect this
> one to fail too.  Maybe there's something novel that means this time
> it really will work, so any slides you do should focus on that.
>
> A more profitable discussion might be:
>
> 1. Should we have the page allocator return pages from CXL or should
>    CXL memory be allocated another way?
> 2. Should there be a way for userspace to indicate that it prefers CXL
>    memory when it calls mmap(), or should it always be at the discretion
>    of the kernel?
> 3. Do we continue with the current ZONE_DEVICE model, or do we come up
>    with something new?
>
>

Point 2 is what I proposed talking about here:
https://lore.kernel.org/linux-mm/a80a4d4b-25aa-a38a-884f-9f119c03a1da@google.com/T/

With the current cxl-as-numa-node model, an application can express a
preference through mbind(). But that also means that mempolicy and
madvise (e.g. MADV_COLD) are starting to overlap if the intention is
to use cxl as a second tier for colder memory.  Are these the right
abstractions? Might it be more flexible to attach properties to memory
ranges, and have applications hint which properties they prefer?

It's an interesting discussion, and I hope it'll be touched on at
LSF/MM, happy to participate there.

- Frank

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-04 17:58                       ` Adam Manzanares
@ 2023-04-01 10:51                         ` Gregory Price
  2023-04-04 18:59                           ` [External] " Viacheslav A.Dubeyko
  0 siblings, 1 reply; 66+ messages in thread
From: Gregory Price @ 2023-04-01 10:51 UTC (permalink / raw)
  To: Adam Manzanares
  Cc: Mike Rapoport, Kyungsan Kim, lsf-pc, linux-mm, linux-fsdevel,
	linux-cxl, viacheslav.dubeyko, dan.j.williams, seungjun.ha,
	wj28.lee

On Tue, Apr 04, 2023 at 05:58:05PM +0000, Adam Manzanares wrote:
> On Tue, Apr 04, 2023 at 11:31:08AM +0300, Mike Rapoport wrote:
> > 
> > The point of zswap IIUC is to have small and fast swap device and
> > compression is required to better utilize DRAM capacity at expense of CPU
> > time.
> > 
> > Presuming CXL memory will have larger capacity than DRAM, why not skip the
> > compression and use CXL as a swap device directly?
> 
> I like to shy away from saying CXL memory should be used for swap. I see a 
> swap device as storing pages in a manner that is no longer directly addressable
> by the cpu. 
> 
> Migrating pages to a CXL device is a reasonable approach and I believe we
> have the ability to do this in the page reclaim code. 
> 

The argument is "why do you need swap if memory itself is elastic", and
I think there are open questions about how performant using large
amounts of high-latency memory is.

Think 1us-1.5us+ cross-rack attached memory.

Does it make sense to use that as CPU-addressible and migrate it on
first use?  Isn't that just swap with more steps?  What happens if we
just use it as swap, is the performance all that different?

I think there's a reasonable argument for exploring the idea at the
higher ends of the latency spectrum.  And the simplicity of using an
existing system (swap) to implement a form of proto-tiering is rather
attractive in my opinion.

~Gregory

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [External] RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-04 18:59                           ` [External] " Viacheslav A.Dubeyko
@ 2023-04-01 11:51                             ` Gregory Price
  2023-04-04 21:09                               ` Viacheslav A.Dubeyko
                                                 ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Gregory Price @ 2023-04-01 11:51 UTC (permalink / raw)
  To: Viacheslav A.Dubeyko
  Cc: Adam Manzanares, Mike Rapoport, Kyungsan Kim, lsf-pc, linux-mm,
	linux-fsdevel, linux-cxl, dan.j.williams, seungjun.ha, wj28.lee

On Tue, Apr 04, 2023 at 11:59:22AM -0700, Viacheslav A.Dubeyko wrote:
> 
> 
> > On Apr 1, 2023, at 3:51 AM, Gregory Price <gregory.price@memverge.com> wrote:
> > 
> > On Tue, Apr 04, 2023 at 05:58:05PM +0000, Adam Manzanares wrote:
> >> On Tue, Apr 04, 2023 at 11:31:08AM +0300, Mike Rapoport wrote:
> >>> 
> >>> The point of zswap IIUC is to have small and fast swap device and
> >>> compression is required to better utilize DRAM capacity at expense of CPU
> >>> time.
> >>> 
> >>> Presuming CXL memory will have larger capacity than DRAM, why not skip the
> >>> compression and use CXL as a swap device directly?
> >> 
> >> I like to shy away from saying CXL memory should be used for swap. I see a 
> >> swap device as storing pages in a manner that is no longer directly addressable
> >> by the cpu. 
> >> 
> >> Migrating pages to a CXL device is a reasonable approach and I believe we
> >> have the ability to do this in the page reclaim code. 
> >> 
> > 
> > The argument is "why do you need swap if memory itself is elastic", and
> > I think there are open questions about how performant using large
> > amounts of high-latency memory is.
> > 
> > Think 1us-1.5us+ cross-rack attached memory.
> > 
> > Does it make sense to use that as CPU-addressible and migrate it on
> > first use?  Isn't that just swap with more steps?  What happens if we
> > just use it as swap, is the performance all that different?
> > 
> > I think there's a reasonable argument for exploring the idea at the
> > higher ends of the latency spectrum.  And the simplicity of using an
> > existing system (swap) to implement a form of proto-tiering is rather
> > attractive in my opinion.
> > 
> 
> I think the problem with swap that we need to take into account the additional
> latency of swap-in/swap-out logic. I assume that this logic is expensive enough.
> And if we considering the huge graph, for example, I am afraid the swap-in/swap-out
> logic could be expensive. So, the question here is about use-case. Which use-case could
> have benefits to employ the swap as a big space of high-latency memory? I see your point
> that such swap could be faster than persistent storage. But which use-case can be happy
> user of this space of high-latency memory?
> 
> Thanks,
> Slava.
> 

Just spitballing here - to me this problem is two fold:

I think the tiering use case and the swap use case are exactly the same.
If tiering is sufficiently valuable, there exists a spectrum of compute
density (cpu:dram:cxl:far-cxl) where simply using far-cxl as fast-swap
becomes easier and less expensive than a complex tiering system.

So rather than a single use-case question, it reads like a tiering
question to me:

1) Where on the 1us-20us (far cxl : nvme) spectrum does it make sense to
   switch from a swap mechanism to simply byte-addressable memory?
   There's a point, somewhere, where promote on first access (effectively
   swap) is the same performance as active tiering (for a given workload).

   If that point is under 2us, there's a good chance that a high-latency
   CXL swap-system would be a major win for any workload on any cloud-based
   system.  It's simple, clean, and reclaim doesn't have to worry about the
   complexities of hotpluggable memory zones.


Beyond that, to your point, what use-case is happy with this class of
memory, and in what form?

2) This is likely obscurred by the fact that many large-memory
   applications avoid swap like the plague by sharding data and creating
   clusters. So it's hard to answer this until it's tested, and you
   can't test it unless you make it... woo!

   Bit of a chicken/egg in here.  I don't know that anyone can say
   definitively what workload can make use of it, but that doesn't mean
   there isn't one.  So in the spectrum of risk/reward, at least
   enabling some simple mechanism for the sake of exploration feels
   exciting to say the least.


More generally, I think a cxl-swap (cswap? ;V) would be useful exactly to
help identify when watch-and-wait tiering becomes more performant than
promote-on-first-use.  If you can't beat a simple fast-swap, why bother?

Again, I think this is narrowly applicable to high-latency CXL. My gut
tells me that anything under 1us is better used in a byte-addressable
manner, but once you start hitting 1us "It makes me go hmmm..."

I concede this is largely conjecture until someone tests it out, but
certainly a fun thing to discess.

~Gregory

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-31 11:42                                       ` Kyungsan Kim
  2023-03-31 13:42                                         ` Matthew Wilcox
@ 2023-04-03  8:28                                         ` David Hildenbrand
       [not found]                                           ` <CGME20230405020916epcas2p24cf04f5354c12632eba50b64b217e403@epcas2p2.samsung.com>
  1 sibling, 1 reply; 66+ messages in thread
From: David Hildenbrand @ 2023-04-03  8:28 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

On 31.03.23 13:42, Kyungsan Kim wrote:
>> On 24.03.23 14:08, Jørgen Hansen wrote:
>>>
>>>> On 24 Mar 2023, at 10.50, Kyungsan Kim <ks0204.kim@samsung.com> wrote:
>>>>
>>>>> On 24.03.23 10:27, Kyungsan Kim wrote:
>>>>>>> On 24.03.23 10:09, Kyungsan Kim wrote:
>>>>>>>> Thank you David Hinderbrand for your interest on this topic.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Kyungsan Kim wrote:
>>>>>>>>>>> [..]
>>>>>>>>>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>>>>>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>>>>>>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>>>>>>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>>>>>>>>>
>>>>>>>>>>>> We also don't think a new zone is needed for every single memory
>>>>>>>>>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>>>>>>>>>> manage multiple volatile memory devices due to the increased device
>>>>>>>>>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>>>>>>>>>> represent extended volatile memories that have different HW
>>>>>>>>>>>> characteristics.
>>>>>>>>>>>
>>>>>>>>>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>>>>>>>>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>>>>>>>>>> volatile memories that have different HW characteristics". It needs to
>>>>>>>>>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>>>>>>>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>>>>>>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>>>>>>>>> new ZONE is absolutely required now to enable use case FOO, or address
>>>>>>>>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>>>>>>>>> maintainability concern of "fewer degress of freedom in the ZONE
>>>>>>>>>>> dimension" starts to dominate.
>>>>>>>>>>
>>>>>>>>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>>>>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>>>>
>>>>>>>>> That sounds like a bad hack :) .
>>>>>>>> I consent you.
>>>>>>>>
>>>>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>>>>
>>>>>>>>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>>>>>>>>> similar to what you have in mind here. In general, adding new zones is
>>>>>>>>> frowned upon.
>>>>>>>>
>>>>>>>> Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
>>>>>>>> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
>>>>>>>> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.
>>>>>>>>
>>>>>>>> We think ZONE_EXMEM also helps less fragmentation.
>>>>>>>> Because it is a separated zone and handles a page allocation as movable by default.
>>>>>>>
>>>>>>> So how is it different that it would justify a different (more confusing
>>>>>>> IMHO) name? :) Of course, names don't matter that much, but I'd be
>>>>>>> interested in which other aspect that zone would be "special".
>>>>>>
>>>>>> FYI for the first time I named it as ZONE_CXLMEM, but we thought it would be needed to cover other extended memory types as well.
>>>>>> So I changed it as ZONE_EXMEM.
>>>>>> We also would like to point out a "special" zone aspeact, which is different from ZONE_NORMAL for tranditional DDR DRAM.
>>>>>> Of course, a symbol naming is important more or less to represent it very nicely, though.
>>>>>> Do you prefer ZONE_SPECIAL? :)
>>>>>
>>>>> I called it ZONE_PREFER_MOVABLE. If you studied that approach there must
>>>>> be a good reason to name it differently?
>>>>>
>>>>
>>>> The intention of ZONE_EXMEM is a separated logical management dimension originated from the HW diffrences of extended memory devices.
>>>> Althought the ZONE_EXMEM considers the movable and frementation aspect, it is not all what ZONE_EXMEM considers.
>>>> So it is named as it.
>>>
>>> Given that CXL memory devices can potentially cover a wide range of technologies with quite different latency and bandwidth metrics, will one zone serve as the management vehicle that you seek? If a system contains both CXL attached DRAM and, let say, a byte-addressable CXL SSD - both used as (different) byte addressable tiers in a tiered memory hierarchy, allocating memory from the ZONE_EXMEM doesn’t really tell you much about what you get. So the client would still need an orthogonal method to characterize the desired performance characteristics. This method could be combined with a fabric independent zone such as ZONE_PREFER_MOVABLE to address the kernel allocation issue. At the same time, this new zone could also be useful in other cases, such as virtio-mem.
>>
>> Yes. I still did not get a satisfying answer to my original question:
>> what would be the differences between both zones from a MM point of
>> view? We can discuss that in the session, of course.
>>
>> Regarding performance differences, I thought the idea was to go with
>> different nodes to express (and model) such.
>>
> 
>  From a MM point of view on the movability aspect, a kernel context is not allocated from ZONE_EXMEM without using GFP_EXMEM explicitly.
> In contrast, if we understand the design of ZONE_PREFER_MOVABLE correctly, a kernel context can be allocated from ZONE_PREFER_MOVABLE implicitly as the fallback of ZONE_NORMAL allocation.
> However, the movable attribute is not all we are concerning.
> In addition, we experienced page allocation and migration issue on the heterogeneous memories.
> 
> Given our experiences/design and industry's viewpoints/inquiries,
> I will prepare a few slides in the session to explain
>    1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
>    2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
>    3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)

Yes, especially a motivation for GFP_EXMEM and ZONE_EXMEM would be 
great. New GFP flags and zone are very likely a lot of upstream 
pushback. So we need a clear motivation and discussion of alternatives 
(and why this memory has to be treated so special but still wants to be 
managed by the buddy).

Willy raises some very good points.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-31 15:56                                           ` Frank van der Linden
@ 2023-04-03  8:34                                             ` David Hildenbrand
       [not found]                                               ` <CGME20230405021655epcas2p2364b1f56dcde629bbd05bc796c2896aa@epcas2p2.samsung.com>
       [not found]                                             ` <CGME20230405020631epcas2p1c85058b28a70bbd46d587e78a9c9c7ad@epcas2p1.samsung.com>
  1 sibling, 1 reply; 66+ messages in thread
From: David Hildenbrand @ 2023-04-03  8:34 UTC (permalink / raw)
  To: Frank van der Linden, Matthew Wilcox
  Cc: Kyungsan Kim, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko, dan.j.williams, seungjun.ha,
	wj28.lee

On 31.03.23 17:56, Frank van der Linden wrote:
> On Fri, Mar 31, 2023 at 6:42 AM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Fri, Mar 31, 2023 at 08:42:20PM +0900, Kyungsan Kim wrote:
>>> Given our experiences/design and industry's viewpoints/inquiries,
>>> I will prepare a few slides in the session to explain
>>>    1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
>>>    2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
>>>    3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)
>>
>> I think you'll find everybody else in the room understands these issues
>> rather better than you do.  This is hardly the first time that we've
>> talked about CXL, and CXL is not the first time that people have
>> proposed disaggregated memory, nor heterogenous latency/bandwidth
>> systems.  All the previous attempts have failed, and I expect this
>> one to fail too.  Maybe there's something novel that means this time
>> it really will work, so any slides you do should focus on that.
>>
>> A more profitable discussion might be:
>>
>> 1. Should we have the page allocator return pages from CXL or should
>>     CXL memory be allocated another way?
>> 2. Should there be a way for userspace to indicate that it prefers CXL
>>     memory when it calls mmap(), or should it always be at the discretion
>>     of the kernel?
>> 3. Do we continue with the current ZONE_DEVICE model, or do we come up
>>     with something new?
>>
>>
> 
> Point 2 is what I proposed talking about here:
> https://lore.kernel.org/linux-mm/a80a4d4b-25aa-a38a-884f-9f119c03a1da@google.com/T/
> 
> With the current cxl-as-numa-node model, an application can express a
> preference through mbind(). But that also means that mempolicy and
> madvise (e.g. MADV_COLD) are starting to overlap if the intention is
> to use cxl as a second tier for colder memory.  Are these the right
> abstractions? Might it be more flexible to attach properties to memory
> ranges, and have applications hint which properties they prefer?

I think history told us that the discussions always go like "but user 
space wants more control, let's give user space all the power", and a 
couple of months later we get "but we cannot possibly enlighten all 
applications, and user space does not have sufficient information: we 
need the kernel to handle this transparently."

It seems to be a steady back and forth. Most probably we want something 
in between: cxl-as-numa-node model is already a pretty good and 
simplistic abstractions. Avoid too many new special user-space knobs is 
most probably the way to go.

Interesting discussion, I agree. And we had plenty of similar ones 
already with PMEM and NUMA in general.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-30 22:03                 ` Dragan Stancevic
@ 2023-04-03  8:44                   ` Mike Rapoport
  2023-04-04  4:27                     ` Dragan Stancevic
  0 siblings, 1 reply; 66+ messages in thread
From: Mike Rapoport @ 2023-04-03  8:44 UTC (permalink / raw)
  To: Dragan Stancevic
  Cc: Kyungsan Kim, dan.j.williams, lsf-pc, linux-mm, linux-fsdevel,
	linux-cxl, a.manzanares, viacheslav.dubeyko, ying.huang,
	nil-migration

Hi Dragan,

On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
> On 3/26/23 02:21, Mike Rapoport wrote:
> > Hi,
> > 
> > [..] >> One problem we experienced was occured in the combination of
> hot-remove and kerelspace allocation usecases.
> > > ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
> > > ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
> > > Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
> > > In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
> > > We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
> > > As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
> > > So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
> > 
> > This still does not describe what are the use cases that require having
> > kernel allocations on CXL.mem.
> > 
> > I believe it's important to start with explanation *why* it is important to
> > have kernel allocations on removable devices.
> 
> Hi Mike,
> 
> not speaking for Kyungsan here, but I am starting to tackle hypervisor
> clustering and VM migration over cxl.mem [1].
> 
> And in my mind, at least one reason that I can think of having kernel
> allocations from cxl.mem devices is where you have multiple VH connections
> sharing the memory [2]. Where for example you have a user space application
> stored in cxl.mem, and then you want the metadata about this
> process/application that the kernel keeps on one hypervisor be "passed on"
> to another hypervisor. So basically the same way processors in a single
> hypervisors cooperate on memory, you extend that across processors that span
> over physical hypervisors. If that makes sense...

Let me reiterate to make sure I understand your example.
If we focus on VM usecase, your suggestion is to store VM's memory and
associated KVM structures on a CXL.mem device shared by several nodes.  
Even putting aside the aspect of keeping KVM structures on presumably
slower memory, what ZONE_EXMEM will provide that cannot be accomplished
with having the cxl memory in a memoryless node and using that node to
allocate VM metadata?
 
> [1] A high-level explanation is at http://nil-migration.org
> [2] Compute Express Link Specification r3.0, v1.0 8/1/22, Page 51, figure
> 1-4, black color scheme circle(3) and bars.
> 
> --
> Peace can only come as a natural consequence
> of universal enlightenment -Dr. Nikola Tesla
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-03  8:44                   ` Mike Rapoport
@ 2023-04-04  4:27                     ` Dragan Stancevic
  2023-04-04  6:47                       ` Huang, Ying
       [not found]                       ` <CGME20230405101840epcas2p4c92037ceba77dfe963d24791a9058450@epcas2p4.samsung.com>
  0 siblings, 2 replies; 66+ messages in thread
From: Dragan Stancevic @ 2023-04-04  4:27 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kyungsan Kim, dan.j.williams, lsf-pc, linux-mm, linux-fsdevel,
	linux-cxl, a.manzanares, viacheslav.dubeyko, ying.huang,
	nil-migration

Hi Mike,

On 4/3/23 03:44, Mike Rapoport wrote:
> Hi Dragan,
> 
> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
>> On 3/26/23 02:21, Mike Rapoport wrote:
>>> Hi,
>>>
>>> [..] >> One problem we experienced was occured in the combination of
>> hot-remove and kerelspace allocation usecases.
>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>>
>>> This still does not describe what are the use cases that require having
>>> kernel allocations on CXL.mem.
>>>
>>> I believe it's important to start with explanation *why* it is important to
>>> have kernel allocations on removable devices.
>>
>> Hi Mike,
>>
>> not speaking for Kyungsan here, but I am starting to tackle hypervisor
>> clustering and VM migration over cxl.mem [1].
>>
>> And in my mind, at least one reason that I can think of having kernel
>> allocations from cxl.mem devices is where you have multiple VH connections
>> sharing the memory [2]. Where for example you have a user space application
>> stored in cxl.mem, and then you want the metadata about this
>> process/application that the kernel keeps on one hypervisor be "passed on"
>> to another hypervisor. So basically the same way processors in a single
>> hypervisors cooperate on memory, you extend that across processors that span
>> over physical hypervisors. If that makes sense...
> 
> Let me reiterate to make sure I understand your example.
> If we focus on VM usecase, your suggestion is to store VM's memory and
> associated KVM structures on a CXL.mem device shared by several nodes.

Yes correct. That is what I am exploring, two different approaches:

Approach 1: Use CXL.mem for VM migration between hypervisors. In this 
approach the VM and the metadata executes/resides on a traditional NUMA 
node (cpu+dram) and only uses CXL.mem to transition between hypervisors. 
It's not kept permanently there. So basically on hypervisor A you would 
do something along the lines of migrate_pages into cxl.mem and then on 
hypervisor B you would migrate_pages from cxl.mem and onto the regular 
NUMA node (cpu+dram).

Approach 2: Use CXL.mem to cluster hypervisors to improve high 
availability of VMs. In this approach the VM and metadata would be kept 
in CXL.mem permanently and each hypervisor accessing this shared memory 
could have the potential to schedule/run the VM if the other hypervisor 
experienced a failure.

> Even putting aside the aspect of keeping KVM structures on presumably
> slower memory, 

Totally agree, presumption of memory speed dully noted. As far as I am 
aware, CXL.mem at this point has higher latency than DRAM, and switched 
CXL.mem has an additional latency. That may or may not change in the 
future, but even with actual CXL induced latency I think there are 
benefits to the approaches.

In the example #1 above, I think even if you had a very noisy VM that is 
dirtying pages at a high rate, once migrate_pages has occurred, it 
wouldn't have to be quiesced for the migration to happen. A migration 
could basically occur in-between the CPU slices, once VCPU is done with 
it's slice on hypervisor A, the next slice could be on hypervisor B.

And the example #2 above, you are trading memory speed for 
high-availability. Where either hypervisor A or B could run the CPU load 
of the VM. You could even have a VM where some of the VCPUs are 
executing on hypervisor A and others on hypervisor B to be able to shift 
CPU load across hypervisors in quasi real-time.


> what ZONE_EXMEM will provide that cannot be accomplished
> with having the cxl memory in a memoryless node and using that node to
> allocate VM metadata?

It has crossed my mind to perhaps use NUMA node distance for the two 
approaches above. But I think that is not sufficient because we can have 
varying distance, and distance in itself doesn't indicate 
switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly 
speaking just for myself here, with the two approaches above, the 
crucial differentiator in order for #1 and #2 to work would be that 
switched/shared CXL.mem would have to be indicated as such in a way. 
Because switched memory would have to be treated and formatted in some 
kind of ABI way that would allow hypervisors to cooperate and follow 
certain protocols when using this memory.


I can't answer what ZONE_EXMEM will provide since we haven's seen 
Kyungsan's talk yet, that's why I myself was very curious to find out 
more about ZONE_EXMEM proposal and if it includes some provisions for 
CXL switched/shared memory.

To me, I don't think it makes a difference if pages are coming from 
ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was if 
I could allocate from or migrate_pages to (ZONE_EXMEM | type 
"SWITCHED/SHARED"). So it's not the zone that is crucial for me,  it's 
the typing. That's what I meant with my initial response but I guess it 
wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in my 
case, this is where you'd have kernel allocations on CXL.mem"


Sorry if it got long, hope that makes sense... :)


>   
>> [1] A high-level explanation is at http://nil-migration.org
>> [2] Compute Express Link Specification r3.0, v1.0 8/1/22, Page 51, figure
>> 1-4, black color scheme circle(3) and bars.
>>



--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-04  4:27                     ` Dragan Stancevic
@ 2023-04-04  6:47                       ` Huang, Ying
  2023-04-06 22:27                         ` Dragan Stancevic
       [not found]                       ` <CGME20230405101840epcas2p4c92037ceba77dfe963d24791a9058450@epcas2p4.samsung.com>
  1 sibling, 1 reply; 66+ messages in thread
From: Huang, Ying @ 2023-04-04  6:47 UTC (permalink / raw)
  To: Dragan Stancevic
  Cc: Mike Rapoport, Kyungsan Kim, dan.j.williams, lsf-pc, linux-mm,
	linux-fsdevel, linux-cxl, a.manzanares, viacheslav.dubeyko,
	nil-migration

Dragan Stancevic <dragan@stancevic.com> writes:

> Hi Mike,
>
> On 4/3/23 03:44, Mike Rapoport wrote:
>> Hi Dragan,
>> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
>>> On 3/26/23 02:21, Mike Rapoport wrote:
>>>> Hi,
>>>>
>>>> [..] >> One problem we experienced was occured in the combination of
>>> hot-remove and kerelspace allocation usecases.
>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>>>
>>>> This still does not describe what are the use cases that require having
>>>> kernel allocations on CXL.mem.
>>>>
>>>> I believe it's important to start with explanation *why* it is important to
>>>> have kernel allocations on removable devices.
>>>
>>> Hi Mike,
>>>
>>> not speaking for Kyungsan here, but I am starting to tackle hypervisor
>>> clustering and VM migration over cxl.mem [1].
>>>
>>> And in my mind, at least one reason that I can think of having kernel
>>> allocations from cxl.mem devices is where you have multiple VH connections
>>> sharing the memory [2]. Where for example you have a user space application
>>> stored in cxl.mem, and then you want the metadata about this
>>> process/application that the kernel keeps on one hypervisor be "passed on"
>>> to another hypervisor. So basically the same way processors in a single
>>> hypervisors cooperate on memory, you extend that across processors that span
>>> over physical hypervisors. If that makes sense...
>> Let me reiterate to make sure I understand your example.
>> If we focus on VM usecase, your suggestion is to store VM's memory and
>> associated KVM structures on a CXL.mem device shared by several nodes.
>
> Yes correct. That is what I am exploring, two different approaches:
>
> Approach 1: Use CXL.mem for VM migration between hypervisors. In this
> approach the VM and the metadata executes/resides on a traditional
> NUMA node (cpu+dram) and only uses CXL.mem to transition between
> hypervisors. It's not kept permanently there. So basically on
> hypervisor A you would do something along the lines of migrate_pages
> into cxl.mem and then on hypervisor B you would migrate_pages from
> cxl.mem and onto the regular NUMA node (cpu+dram).
>
> Approach 2: Use CXL.mem to cluster hypervisors to improve high
> availability of VMs. In this approach the VM and metadata would be
> kept in CXL.mem permanently and each hypervisor accessing this shared
> memory could have the potential to schedule/run the VM if the other
> hypervisor experienced a failure.
>
>> Even putting aside the aspect of keeping KVM structures on presumably
>> slower memory, 
>
> Totally agree, presumption of memory speed dully noted. As far as I am
> aware, CXL.mem at this point has higher latency than DRAM, and
> switched CXL.mem has an additional latency. That may or may not change
> in the future, but even with actual CXL induced latency I think there
> are benefits to the approaches.
>
> In the example #1 above, I think even if you had a very noisy VM that
> is dirtying pages at a high rate, once migrate_pages has occurred, it 
> wouldn't have to be quiesced for the migration to happen. A migration
> could basically occur in-between the CPU slices, once VCPU is done
> with it's slice on hypervisor A, the next slice could be on hypervisor
> B.
>
> And the example #2 above, you are trading memory speed for
> high-availability. Where either hypervisor A or B could run the CPU
> load of the VM. You could even have a VM where some of the VCPUs are 
> executing on hypervisor A and others on hypervisor B to be able to
> shift CPU load across hypervisors in quasi real-time.
>
>
>> what ZONE_EXMEM will provide that cannot be accomplished
>> with having the cxl memory in a memoryless node and using that node to
>> allocate VM metadata?
>
> It has crossed my mind to perhaps use NUMA node distance for the two
> approaches above. But I think that is not sufficient because we can
> have varying distance, and distance in itself doesn't indicate 
> switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly
> speaking just for myself here, with the two approaches above, the 
> crucial differentiator in order for #1 and #2 to work would be that
> switched/shared CXL.mem would have to be indicated as such in a way. 
> Because switched memory would have to be treated and formatted in some
> kind of ABI way that would allow hypervisors to cooperate and follow 
> certain protocols when using this memory.
>
>
> I can't answer what ZONE_EXMEM will provide since we haven's seen
> Kyungsan's talk yet, that's why I myself was very curious to find out 
> more about ZONE_EXMEM proposal and if it includes some provisions for
> CXL switched/shared memory.
>
> To me, I don't think it makes a difference if pages are coming from
> ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was
> if I could allocate from or migrate_pages to (ZONE_EXMEM | type 
> "SWITCHED/SHARED"). So it's not the zone that is crucial for me,  it's
> the typing. That's what I meant with my initial response but I guess
> it wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in
> my case, this is where you'd have kernel allocations on CXL.mem"
>

We have 2 choices here.

a) Put CXL.mem in a separate NUMA node, with an existing ZONE type
(normal or movable).  Then you can migrate pages there with
move_pages(2) or migrate_pages(2).  Or you can run your workload on the
CXL.mem with numactl.

b) Put CXL.mem in an existing NUMA node, with a new ZONE type.  To
control your workloads in user space, you need a set of new ABIs.
Anything you cannot do in a)?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-03-31 11:45                   ` RE: RE(2): " Kyungsan Kim
@ 2023-04-04  8:31                     ` Mike Rapoport
  2023-04-04 17:58                       ` Adam Manzanares
  0 siblings, 1 reply; 66+ messages in thread
From: Mike Rapoport @ 2023-04-04  8:31 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

On Fri, Mar 31, 2023 at 08:45:25PM +0900, Kyungsan Kim wrote:
> Thank you Mike Rapoport for participating discussion and adding your thought.
> 
> >Hi,
> >
> >On Thu, Mar 23, 2023 at 07:51:05PM +0900, Kyungsan Kim wrote:
> >> I appreciate dan for the careful advice.
> >> 
> >> >Kyungsan Kim wrote:
> >> >[..]
> >> >> >In addition to CXL memory, we may have other kind of memory in the
> >> >> >system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
> >> >> >memory in GPU card, etc.  I guess that we need to consider them
> >> >> >together.  Do we need to add one zone type for each kind of memory?
> >> >> 
> >> >> We also don't think a new zone is needed for every single memory
> >> >> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
> >> >> manage multiple volatile memory devices due to the increased device
> >> >> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
> >> >> represent extended volatile memories that have different HW
> >> >> characteristics.
> >> >
> >> >Some advice for the LSF/MM discussion, the rationale will need to be
> >> >more than "we think the ZONE_EXMEM can be used to represent extended
> >> >volatile memories that have different HW characteristics". It needs to
> >> >be along the lines of "yes, to date Linux has been able to describe DDR
> >> >with NUMA effects, PMEM with high write overhead, and HBM with improved
> >> >bandwidth not necessarily latency, all without adding a new ZONE, but a
> >> >new ZONE is absolutely required now to enable use case FOO, or address
> >> >unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
> >> >maintainability concern of "fewer degress of freedom in the ZONE
> >> >dimension" starts to dominate.
> >> 
> >> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
> >> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
> >> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
> >> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
> >> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
> >> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
> >> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
> >> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
> >
> >This still does not describe what are the use cases that require having
> >kernel allocations on CXL.mem. 
> >
> >I believe it's important to start with explanation *why* it is important to
> >have kernel allocations on removable devices.
> > 
> 
> In general, a memory system with DDR/CXL DRAM will have near/far memory.
> And, we think kernel already includes memory tiering solutions - Meta TPP, zswap, and pagecache.
> Some kernel contexts would prefer fast memory. For example, a hot data with time locality or a data for fast processing such as metadata or indexing.
> Others would enough with slow memory. For example, a zswap page which is being used while swapping. 

The point of zswap IIUC is to have small and fast swap device and
compression is required to better utilize DRAM capacity at expense of CPU
time.

Presuming CXL memory will have larger capacity than DRAM, why not skip the
compression and use CXL as a swap device directly? 

And even supposing there's an advantage in putting zswap on CXL memory,
why that can be done with node-based APIs but warrants a new zone?

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: RE: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-04  8:31                     ` Mike Rapoport
@ 2023-04-04 17:58                       ` Adam Manzanares
  2023-04-01 10:51                         ` Gregory Price
  0 siblings, 1 reply; 66+ messages in thread
From: Adam Manzanares @ 2023-04-04 17:58 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Kyungsan Kim, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

On Tue, Apr 04, 2023 at 11:31:08AM +0300, Mike Rapoport wrote:
> On Fri, Mar 31, 2023 at 08:45:25PM +0900, Kyungsan Kim wrote:
> > Thank you Mike Rapoport for participating discussion and adding your thought.
> > 
> > >Hi,
> > >
> > >On Thu, Mar 23, 2023 at 07:51:05PM +0900, Kyungsan Kim wrote:
> > >> I appreciate dan for the careful advice.
> > >> 
> > >> >Kyungsan Kim wrote:
> > >> >[..]
> > >> >> >In addition to CXL memory, we may have other kind of memory in the
> > >> >> >system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
> > >> >> >memory in GPU card, etc.  I guess that we need to consider them
> > >> >> >together.  Do we need to add one zone type for each kind of memory?
> > >> >> 
> > >> >> We also don't think a new zone is needed for every single memory
> > >> >> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
> > >> >> manage multiple volatile memory devices due to the increased device
> > >> >> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
> > >> >> represent extended volatile memories that have different HW
> > >> >> characteristics.
> > >> >
> > >> >Some advice for the LSF/MM discussion, the rationale will need to be
> > >> >more than "we think the ZONE_EXMEM can be used to represent extended
> > >> >volatile memories that have different HW characteristics". It needs to
> > >> >be along the lines of "yes, to date Linux has been able to describe DDR
> > >> >with NUMA effects, PMEM with high write overhead, and HBM with improved
> > >> >bandwidth not necessarily latency, all without adding a new ZONE, but a
> > >> >new ZONE is absolutely required now to enable use case FOO, or address
> > >> >unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
> > >> >maintainability concern of "fewer degress of freedom in the ZONE
> > >> >dimension" starts to dominate.
> > >> 
> > >> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
> > >> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
> > >> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
> > >> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
> > >> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
> > >> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
> > >> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
> > >> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
> > >
> > >This still does not describe what are the use cases that require having
> > >kernel allocations on CXL.mem. 
> > >
> > >I believe it's important to start with explanation *why* it is important to
> > >have kernel allocations on removable devices.
> > > 
> > 
> > In general, a memory system with DDR/CXL DRAM will have near/far memory.
> > And, we think kernel already includes memory tiering solutions - Meta TPP, zswap, and pagecache.
> > Some kernel contexts would prefer fast memory. For example, a hot data with time locality or a data for fast processing such as metadata or indexing.
> > Others would enough with slow memory. For example, a zswap page which is being used while swapping. 
> 
> The point of zswap IIUC is to have small and fast swap device and
> compression is required to better utilize DRAM capacity at expense of CPU
> time.
> 
> Presuming CXL memory will have larger capacity than DRAM, why not skip the
> compression and use CXL as a swap device directly?

I like to shy away from saying CXL memory should be used for swap. I see a 
swap device as storing pages in a manner that is no longer directly addressable
by the cpu. 

Migrating pages to a CXL device is a reasonable approach and I believe we
have the ability to do this in the page reclaim code. 

> 
> And even supposing there's an advantage in putting zswap on CXL memory,
> why that can be done with node-based APIs but warrants a new zone?
> 
> -- 
> Sincerely yours,
> Mike.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [External] RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-01 10:51                         ` Gregory Price
@ 2023-04-04 18:59                           ` Viacheslav A.Dubeyko
  2023-04-01 11:51                             ` Gregory Price
  0 siblings, 1 reply; 66+ messages in thread
From: Viacheslav A.Dubeyko @ 2023-04-04 18:59 UTC (permalink / raw)
  To: Gregory Price
  Cc: Adam Manzanares, Mike Rapoport, Kyungsan Kim, lsf-pc, linux-mm,
	linux-fsdevel, linux-cxl, dan.j.williams, seungjun.ha, wj28.lee



> On Apr 1, 2023, at 3:51 AM, Gregory Price <gregory.price@memverge.com> wrote:
> 
> On Tue, Apr 04, 2023 at 05:58:05PM +0000, Adam Manzanares wrote:
>> On Tue, Apr 04, 2023 at 11:31:08AM +0300, Mike Rapoport wrote:
>>> 
>>> The point of zswap IIUC is to have small and fast swap device and
>>> compression is required to better utilize DRAM capacity at expense of CPU
>>> time.
>>> 
>>> Presuming CXL memory will have larger capacity than DRAM, why not skip the
>>> compression and use CXL as a swap device directly?
>> 
>> I like to shy away from saying CXL memory should be used for swap. I see a 
>> swap device as storing pages in a manner that is no longer directly addressable
>> by the cpu. 
>> 
>> Migrating pages to a CXL device is a reasonable approach and I believe we
>> have the ability to do this in the page reclaim code. 
>> 
> 
> The argument is "why do you need swap if memory itself is elastic", and
> I think there are open questions about how performant using large
> amounts of high-latency memory is.
> 
> Think 1us-1.5us+ cross-rack attached memory.
> 
> Does it make sense to use that as CPU-addressible and migrate it on
> first use?  Isn't that just swap with more steps?  What happens if we
> just use it as swap, is the performance all that different?
> 
> I think there's a reasonable argument for exploring the idea at the
> higher ends of the latency spectrum.  And the simplicity of using an
> existing system (swap) to implement a form of proto-tiering is rather
> attractive in my opinion.
> 

I think the problem with swap that we need to take into account the additional
latency of swap-in/swap-out logic. I assume that this logic is expensive enough.
And if we considering the huge graph, for example, I am afraid the swap-in/swap-out
logic could be expensive. So, the question here is about use-case. Which use-case could
have benefits to employ the swap as a big space of high-latency memory? I see your point
that such swap could be faster than persistent storage. But which use-case can be happy
user of this space of high-latency memory?

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [External] RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-01 11:51                             ` Gregory Price
@ 2023-04-04 21:09                               ` Viacheslav A.Dubeyko
       [not found]                               ` <642cb7ec58c71_21a829453@dwillia2-xfh.jf.intel.com.notmuch>
       [not found]                               ` <CGME20230405101843epcas2p2c819c8d60b2a9a776124c2b4bc25af14@epcas2p2.samsung.com>
  2 siblings, 0 replies; 66+ messages in thread
From: Viacheslav A.Dubeyko @ 2023-04-04 21:09 UTC (permalink / raw)
  To: Gregory Price
  Cc: Adam Manzanares, Mike Rapoport, Kyungsan Kim, lsf-pc, linux-mm,
	linux-fsdevel, linux-cxl, dan.j.williams, seungjun.ha, wj28.lee



> On Apr 1, 2023, at 4:51 AM, Gregory Price <gregory.price@memverge.com> wrote:
> 
> On Tue, Apr 04, 2023 at 11:59:22AM -0700, Viacheslav A.Dubeyko wrote:
>> 
>> 
>>> On Apr 1, 2023, at 3:51 AM, Gregory Price <gregory.price@memverge.com> wrote:
>>> 
>>> On Tue, Apr 04, 2023 at 05:58:05PM +0000, Adam Manzanares wrote:
>>>> On Tue, Apr 04, 2023 at 11:31:08AM +0300, Mike Rapoport wrote:
>>>>> 
>>>>> The point of zswap IIUC is to have small and fast swap device and
>>>>> compression is required to better utilize DRAM capacity at expense of CPU
>>>>> time.
>>>>> 
>>>>> Presuming CXL memory will have larger capacity than DRAM, why not skip the
>>>>> compression and use CXL as a swap device directly?
>>>> 
>>>> I like to shy away from saying CXL memory should be used for swap. I see a 
>>>> swap device as storing pages in a manner that is no longer directly addressable
>>>> by the cpu. 
>>>> 
>>>> Migrating pages to a CXL device is a reasonable approach and I believe we
>>>> have the ability to do this in the page reclaim code. 
>>>> 
>>> 
>>> The argument is "why do you need swap if memory itself is elastic", and
>>> I think there are open questions about how performant using large
>>> amounts of high-latency memory is.
>>> 
>>> Think 1us-1.5us+ cross-rack attached memory.
>>> 
>>> Does it make sense to use that as CPU-addressible and migrate it on
>>> first use?  Isn't that just swap with more steps?  What happens if we
>>> just use it as swap, is the performance all that different?
>>> 
>>> I think there's a reasonable argument for exploring the idea at the
>>> higher ends of the latency spectrum.  And the simplicity of using an
>>> existing system (swap) to implement a form of proto-tiering is rather
>>> attractive in my opinion.
>>> 
>> 
>> I think the problem with swap that we need to take into account the additional
>> latency of swap-in/swap-out logic. I assume that this logic is expensive enough.
>> And if we considering the huge graph, for example, I am afraid the swap-in/swap-out
>> logic could be expensive. So, the question here is about use-case. Which use-case could
>> have benefits to employ the swap as a big space of high-latency memory? I see your point
>> that such swap could be faster than persistent storage. But which use-case can be happy
>> user of this space of high-latency memory?
>> 
>> Thanks,
>> Slava.
>> 
> 
> Just spitballing here - to me this problem is two fold:
> 
> I think the tiering use case and the swap use case are exactly the same.
> If tiering is sufficiently valuable, there exists a spectrum of compute
> density (cpu:dram:cxl:far-cxl) where simply using far-cxl as fast-swap
> becomes easier and less expensive than a complex tiering system.
> 
> So rather than a single use-case question, it reads like a tiering
> question to me:
> 
> 1) Where on the 1us-20us (far cxl : nvme) spectrum does it make sense to
>   switch from a swap mechanism to simply byte-addressable memory?
>   There's a point, somewhere, where promote on first access (effectively
>   swap) is the same performance as active tiering (for a given workload).
> 
>   If that point is under 2us, there's a good chance that a high-latency
>   CXL swap-system would be a major win for any workload on any cloud-based
>   system.  It's simple, clean, and reclaim doesn't have to worry about the
>   complexities of hotpluggable memory zones.
> 
> 
> Beyond that, to your point, what use-case is happy with this class of
> memory, and in what form?
> 
> 2) This is likely obscurred by the fact that many large-memory
>   applications avoid swap like the plague by sharding data and creating
>   clusters. So it's hard to answer this until it's tested, and you
>   can't test it unless you make it... woo!
> 
>   Bit of a chicken/egg in here.  I don't know that anyone can say
>   definitively what workload can make use of it, but that doesn't mean
>   there isn't one.  So in the spectrum of risk/reward, at least
>   enabling some simple mechanism for the sake of exploration feels
>   exciting to say the least.
> 
> 
> More generally, I think a cxl-swap (cswap? ;V) would be useful exactly to
> help identify when watch-and-wait tiering becomes more performant than
> promote-on-first-use.  If you can't beat a simple fast-swap, why bother?
> 
> Again, I think this is narrowly applicable to high-latency CXL. My gut
> tells me that anything under 1us is better used in a byte-addressable
> manner, but once you start hitting 1us "It makes me go hmmm..."
> 
> I concede this is largely conjecture until someone tests it out, but
> certainly a fun thing to discess.
> 

OK. I am buying your point. :) But, at first I need to allocate memory.
The really important point of CXL memory is the opportunity to extend
the memory space. So, swap is not addressable memory and it is useless
for memory space extension. Let’s imagine I have small local DRAM (and
maybe some amount of “fast” CXL) + huge far CXL as swap space. But I cannot
use the swap space for allocation. So, this swap looks like useless space.
At first, I need to extend my memory by means of “fast” CXL. And if I have
enough “fast” CXL, then I don’t need in far CXL memory. OK, it’s always
not enough memory but we are hungry for addressable memory.

Large memory application would like to see the whole data set in memory.
But it means that this data set needs to be addressable. Technically speaking,
it is possible to imagine that partially data set can be in the swap.
But the first step is memory allocation and prefetching data from persistent
memory. Bus, as far as I can imagine, memory allocator will be limited by
addressable memory. So, I cannot have the whole data set in memory because
memory allocator stops me.

Thanks,
Slava.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                         ` <CGME20230405020027epcas2p4682d43446a493385b60c39a1dbbf07d6@epcas2p4.samsung.com>
@ 2023-04-05  2:00                           ` Kyungsan Kim
  2023-04-05  4:48                             ` Dan Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-05  2:00 UTC (permalink / raw)
  To: willy
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>On Fri, Mar 31, 2023 at 08:37:15PM +0900, Kyungsan Kim wrote:
>> >> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>> >
>> >This sounds dangerously confused.  Do you want the EXMEM to be removable
>> >or not?  If you do, then allocations from it have to be movable.  If
>> >you don't, why go to all this trouble?
>> 
>> I'm sorry to make you confused. We will try more to clearly explain our thought.
>> We think the CXL DRAM device should be removable along with HW pluggable nature.
>> For MM point of view, we think a page of CXL DRAM can be both movable and unmovable. 
>> An user or kernel context should be able to determine it. Thus, we think dedication on the ZONE_NORMAL or the ZONE_MOVABLE is not enough.
>
>No, this is not the right approach.  If CXL is to be hot-pluggable,
>then all CXL allocations must be movable.  If even one allocation on a
>device is not movable, then the device cannot be removed.  ZONE_EXMEM
>feels like a solution in search of a problem

We know the situation. When a CXL DRAM channel is located under ZONE_NORMAL,
a random allocation of a kernel object by calling kmalloc() siblings makes the entire CXL DRAM unremovable.
Also, not all kernel objects can be allocated from ZONE_MOVABLE.

ZONE_EXMEM does not confine a movability attribute(movable or unmovable), rather it allows a calling context can decide it.
In that aspect, it is the same with ZONE_NORMAL but ZONE_EXMEM works for extended memory device.
It does not mean ZONE_EXMEM support both movability and kernel object allocation at the same time.
In case multiple CXL DRAM channels are connected, we think a memory consumer possibly dedicate a channel for movable or unmovable purpose.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                                           ` <CGME20230405020121epcas2p2d9d39c151b6c5ab9e568ab9e2ab826ce@epcas2p2.samsung.com>
@ 2023-04-05  2:01                                             ` Kyungsan Kim
  2023-04-05  3:11                                               ` Matthew Wilcox
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-05  2:01 UTC (permalink / raw)
  To: willy
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>On Fri, Mar 31, 2023 at 08:42:20PM +0900, Kyungsan Kim wrote:
>> Given our experiences/design and industry's viewpoints/inquiries,
>> I will prepare a few slides in the session to explain 
>>   1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
>>   2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
>>   3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)
>
>I think you'll find everybody else in the room understands these issues
>rather better than you do.  This is hardly the first time that we've
>talked about CXL, and CXL is not the first time that people have
>proposed disaggregated memory, nor heterogenous latency/bandwidth
>systems.  All the previous attempts have failed, and I expect this
>one to fail too.  Maybe there's something novel that means this time
>it really will work, so any slides you do should focus on that.
>
>A more profitable discussion might be:
>
>1. Should we have the page allocator return pages from CXL or should
>   CXL memory be allocated another way?
I think yes. Using CXL DRAM as System RAM interface would be the primary use case in real-world application in regards to compatibility.
So, on the System RAM interface, we think it should be managed by Linux MM subsystem. (Node - Zonelist - buddy page allocator)

>2. Should there be a way for userspace to indicate that it prefers CXL
>   memory when it calls mmap(), or should it always be at the discretion
>   of the kernel?
I think yes. Both implcit and explict ways are meaningful for users on a different purpose.
The dynamic performance variation of CXL DRAM is likely bigger than other memory types due to the topology expansion and link negotiation.
I think it strengthens the needs.


>3. Do we continue with the current ZONE_DEVICE model, or do we come up
>   with something new?
In fact, ZONE_DEVICE was the our first candidate for CXL DRAM.
But because ZONE_DEVICE is not managed by buddy, we thought it does not fit to provide System RAM interface.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: RE: RE(4): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                             ` <CGME20230405020257epcas2p11b253f8c97a353890b96e6ae6eb515d3@epcas2p1.samsung.com>
@ 2023-04-05  2:02                               ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-05  2:02 UTC (permalink / raw)
  To: gregory.price
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>On Fri, Mar 31, 2023 at 08:34:17PM +0900, Kyungsan Kim wrote:
>> Hi Gregory Price. 
>> Thank you for joining this topic and share your viewpoint.
>> I'm sorry for late reply due to some major tasks of our team this week.
>> 
>> >On Fri, Mar 24, 2023 at 05:48:08PM +0900, Kyungsan Kim wrote:
>> >> 
>> >> Indeed, we tried the approach. It was able to allocate a kernel context from ZONE_MOVABLE using GFP_MOVABLE.
>> >> However, we think it would be a bad practice for the 2 reasons.
>> >> 1. It causes oops and system hang occasionally due to kernel page migration while swap or compaction. 
>> >> 2. Literally, the design intention of ZONE_MOVABLE is to a page movable. So, we thought allocating a kernel context from the zone hurts the intention.
>> >> 
>> >> Allocating a kernel context out of ZONE_EXMEM is unmovable.
>> >>   a kernel context -  alloc_pages(GFP_EXMEM,)
>> >
>> >What is the specific use case of this?  If the answer is flexibility in
>> >low-memory situations, why wouldn't the kernel simply change to free up
>> >ZONE_NORMAL (swapping user memory, migrating user memory, etc) and
>> >allocate as needed?
>> >
>> >I could see allocating kernel memory from local memory expanders
>> >(directly attached to local CXL port), but I can't think of a case where
>> >it would be preferable for kernel resources to live on remote memory.
>> 
>> We have thought kernelspace memory tiering cases.
>> What memory tiering we assumes is to locate a hot data in fast memory and a cold data in slow memory.
>> We think zswap, pagecache, and Meta TPP(page promotion/demotion among nodes) as the kernelspace memory tiering cases.
>>
>
>So, to clarify, when you say "kernel space memory tiering cases", do you
>mean "to support a kernel-space controlled memory tiering service" or do
>you mean "tiering of kernel memory"?

Actually, both. 
Bollowing your expression :), we imply "kernel-space controlled memory tiering service that tiers kernel memory".
For example, while zswap operation (=a kernel space memory tiering case) of vanilla kernel,
when an user page from CXL DRAM is swapped-out, zbud allocator of zswap can allocate a zswap page from DDR_DRAM(=tiering of kernel memory).
We think it is odd, because the swapped page is promoted from CXL DRAM(far memory) to DDR DRAM(near memory).

>Because if it's the former, rather than a new zone, it seems like a
>better proposal would be to extend the numa system to add additional
>"cost/feature" attributes, rather than modifying the zone of the memory
>blocks backing the node.
>
>Note that memory zones can apply to individual blocks within a node, and
>not the entire node uniformly.  So when making tiering decisions, it
>seems more expedient to investigate a node rather than a block.
>
>
>> >Since local memory expanders are static devices, there shouldn't be a
>> >great need for hotplug, which means the memory could be mapped
>> >ZONE_NORMAL without issue.
>> >
>> 
>> IMHO, we think hot-add/remove is one of the key feature of CXL due to the composability aspect.
>> Right now, CXL device and system connection is limited. 
>> But industry is preparing a CXL capable system that allows more than 10 CXL channels at backplane, pluggable with EDSFF. 
>> Not only that, along with the progress of CXL topology - from direct-attached to switch, multi-level switch, and fabric connection -
>> I think the hot-add/remove usecase would become more important.
>> 
>> 
>
>Hot add/remove is somewhat fairly represented by ZONE_MOVABLE. What's I
>think confusing many people is that creating a new zone that's intended
>to be hot-pluggable *and* usable by kernel for kernel-resources/memory
>are presently exclusive operations.
>
>The underlying question is what situation is being hit in which kernel
>memory wants to be located in ZONE_MOVABLE/ZONE_EXMEM that cannot simply
>be serviced by demoting other, movable memory to these regions.
>
>The concept being that kernel allocations are a higher-priority
>allocation than userland, and as such should have priority in DRAM.
>
>For example - there is at least one paper that examined the cost of
>placing page tables on CXL Memory Expansion (on the local CXL complex,
>not remote) and found the cost is significant.  Page tables are likely
>the single largest allocation the kernel will make to service large
>memory structures, so the answer to this problem is not necessarily to
>place that memory in CXL as well, but to use larger page sizes (which is
>less wasteful as memory usage is high and memory is abundant).
>
>I just don't understand what kernel resources would meet the following
>attributes:
>
>1) Do not have major system performance impacts in high-latency memory
>2) Are sufficiently large to warrant tiering
>and
>3) Are capable of being moved (i.e. no pinned areas, no dma areas, etc)
>

I agree the entire level of page table should be on near memory.
In general, a data need to be handled quickly prefer a near memory such as indexing.
For far memory needs, it would be a data that is less user-interactive and latency-senstive.
Basically, our approach is on memory provider stance, not on memory consumer stance. 

>> >> Allocating a user context out of ZONE_EXMEM is movable.
>> >>   a user context - mmap(,,MAP_EXMEM,) - syscall - alloc_pages(GFP_EXMEM | GFP_MOVABLE,)
>> >> This is how ZONE_EXMEM supports the two cases.
>> >> 
>
>So if MAP_EXMEM is not used, EXMEM would not be used?
>
>That seems counter intuitive.  If an allocation via mmap would be
>eligible for ZONE_MOVABLE, why wouldn't it be eligible for ZONE_EXMEM?
>
>I believe this is another reason why some folks are confused what the
>distinction between MOVABLE and EXMEM are.  They seem to ultimately
>reduce to whether the memory can be moved.

Not really. We intended EXMEM can be used both implicitly and explicitly.
Please further refer to the answer below.

>
>> >
>> >Is it intended for a user to explicitly request MAP_EXMEM for it to get
>> >used at all?  As in, if i simply mmap() without MAP_EXMEM, will it
>> >remain unutilized?
>> 
>> Our intention is to allow below 3 cases
>> 1. Explicit DDR allocation - mmap(,,MAP_NORMAL,)
>>  : allocation from ZONE_NORMAL or ZONE_MOVABLE, or allocation fails.
>> 2. Explicit CXL allocation - mmap(,,MAP_EXMEM,)
>>  : allocation from ZONE_EXMEM, of allocation fails.
>> 3. Implicit Memory allocation - mmap(,,,) 
>>  : allocation from ZONE_NORMAL, ZONE_MOVABLE, or ZONE_EXMEM. In other words, no matter where DDR or CXL DRAM.
>> 
>> Among that, 3 is similar with vanilla kernel operation in that the allocation request traverses among multiple zones or nodes.
>> We think it would be good or bad for the mmap caller point of view.
>> It is good because memory is allocated, while it could be bad because the caller does not have idea of allocated memory type.
>> The later would hurt QoS metrics or userspace memory tiering operation, which expects near/far memory.
>> 
>
>For what it's worth, mmap is not the correct api for userland to provide
>kernel hints on data placement.  That would be madvise and friends.

Yes, our key intention is to provide a hint to userland.
Not only mmap(), but mbind(), set_mempolicy(), madvise(), etc

>
>But further, allocation of memory from userland must be ok with having
>its memory moved/swapped/whatever unless additional assistance from the
>kernel is provided (page pinning, mlock, whatever) to ensure it will
>not be moved.  Presumably this is done to ensure the kernel can make
>runtime adjustments to protect itself from being denied memory and
>causing instability and/or full system faults.

Yes. in case of the implicit allocation, our proposal is fully compatible with vanilla linux MM.
Our thought is to provide both explcit and implicit ways.

>
>
>I think you need to clarify your intents for this zone, in particular
>your intent for exactly what data can and cannot live in this zone and
>the reasons for this.  "To assist kernel tiering operations" is very
>vague and not a description of what memory is and is not allowed in the
>zone.

We don't confine a data for ZONE_EXMEM. 
Our intention is to allow both movable and ummovable allocation from a kernel and user context.
Also, an allocation context is able to determine the movability.
In other words, the ZONE_EXMEM is not inteded to confine a usecase, but provide ways to do a usecase on CXL DRAM.

>
>~Gregory

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                                             ` <CGME20230405020631epcas2p1c85058b28a70bbd46d587e78a9c9c7ad@epcas2p1.samsung.com>
@ 2023-04-05  2:06                                               ` Kyungsan Kim
  2023-04-05  5:00                                                 ` Dan Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-05  2:06 UTC (permalink / raw)
  To: fvdl
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="yes", Size: 2707 bytes --]

Hi Frank, 
Thank you for your interest on this topic and remaining your opinion.

>On Fri, Mar 31, 2023 at 6:42 AM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Fri, Mar 31, 2023 at 08:42:20PM +0900, Kyungsan Kim wrote:
>> > Given our experiences/design and industry's viewpoints/inquiries,
>> > I will prepare a few slides in the session to explain
>> >   1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
>> >   2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
>> >   3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)
>>
>> I think you'll find everybody else in the room understands these issues
>> rather better than you do.  This is hardly the first time that we've
>> talked about CXL, and CXL is not the first time that people have
>> proposed disaggregated memory, nor heterogenous latency/bandwidth
>> systems.  All the previous attempts have failed, and I expect this
>> one to fail too.  Maybe there's something novel that means this time
>> it really will work, so any slides you do should focus on that.
>>
>> A more profitable discussion might be:
>>
>> 1. Should we have the page allocator return pages from CXL or should
>>    CXL memory be allocated another way?
>> 2. Should there be a way for userspace to indicate that it prefers CXL
>>    memory when it calls mmap(), or should it always be at the discretion
>>    of the kernel?
>> 3. Do we continue with the current ZONE_DEVICE model, or do we come up
>>    with something new?
>>
>>
>
>Point 2 is what I proposed talking about here:
>https://lore.kernel.org/linux-mm/a80a4d4b-25aa-a38a-884f-9f119c03a1da@google.com/T/
>
>With the current cxl-as-numa-node model, an application can express a
>preference through mbind(). But that also means that mempolicy and
>madvise (e.g. MADV_COLD) are starting to overlap if the intention is
>to use cxl as a second tier for colder memory.  Are these the right
>abstractions? Might it be more flexible to attach properties to memory
>ranges, and have applications hint which properties they prefer?

We also think more userspace hints would be meaningful for diverse purposes of application.
Specific intefaces are need to be discussed, though.

FYI in fact, we expanded mbind() and set_mempolicy() as well to explicitly bind DDR/CXL.
  - mbind(,,MPOL_F_ZONE_EXMEM / MPOL_F_ZONE_NOEXMEM) 
  - set_mempolicy(,,MPOL_F_ZONE_EXMEM / MPOL_F_ZONE_NOEXMEM)
madvise() is also a candidate to express tiering intention.

>
>It's an interesting discussion, and I hope it'll be touched on at
>LSF/MM, happy to participate there.
>
>- Frank

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                                           ` <CGME20230405020916epcas2p24cf04f5354c12632eba50b64b217e403@epcas2p2.samsung.com>
@ 2023-04-05  2:09                                             ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-05  2:09 UTC (permalink / raw)
  To: david
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="yes", Size: 7242 bytes --]

>On 31.03.23 13:42, Kyungsan Kim wrote:
>>> On 24.03.23 14:08, Jørgen Hansen wrote:
>>>>
>>>>> On 24 Mar 2023, at 10.50, Kyungsan Kim <ks0204.kim@samsung.com> wrote:
>>>>>
>>>>>> On 24.03.23 10:27, Kyungsan Kim wrote:
>>>>>>>> On 24.03.23 10:09, Kyungsan Kim wrote:
>>>>>>>>> Thank you David Hinderbrand for your interest on this topic.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Kyungsan Kim wrote:
>>>>>>>>>>>> [..]
>>>>>>>>>>>>>> In addition to CXL memory, we may have other kind of memory in the
>>>>>>>>>>>>>> system, for example, HBM (High Bandwidth Memory), memory in FPGA card,
>>>>>>>>>>>>>> memory in GPU card, etc.  I guess that we need to consider them
>>>>>>>>>>>>>> together.  Do we need to add one zone type for each kind of memory?
>>>>>>>>>>>>>
>>>>>>>>>>>>> We also don't think a new zone is needed for every single memory
>>>>>>>>>>>>> device.  Our viewpoint is the sole ZONE_NORMAL becomes not enough to
>>>>>>>>>>>>> manage multiple volatile memory devices due to the increased device
>>>>>>>>>>>>> types.  Including CXL DRAM, we think the ZONE_EXMEM can be used to
>>>>>>>>>>>>> represent extended volatile memories that have different HW
>>>>>>>>>>>>> characteristics.
>>>>>>>>>>>>
>>>>>>>>>>>> Some advice for the LSF/MM discussion, the rationale will need to be
>>>>>>>>>>>> more than "we think the ZONE_EXMEM can be used to represent extended
>>>>>>>>>>>> volatile memories that have different HW characteristics". It needs to
>>>>>>>>>>>> be along the lines of "yes, to date Linux has been able to describe DDR
>>>>>>>>>>>> with NUMA effects, PMEM with high write overhead, and HBM with improved
>>>>>>>>>>>> bandwidth not necessarily latency, all without adding a new ZONE, but a
>>>>>>>>>>>> new ZONE is absolutely required now to enable use case FOO, or address
>>>>>>>>>>>> unfixable NUMA problem BAR." Without FOO and BAR to discuss the code
>>>>>>>>>>>> maintainability concern of "fewer degress of freedom in the ZONE
>>>>>>>>>>>> dimension" starts to dominate.
>>>>>>>>>>>
>>>>>>>>>>> One problem we experienced was occured in the combination of hot-remove and kerelspace allocation usecases.
>>>>>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>>>>>
>>>>>>>>>> That sounds like a bad hack :) .
>>>>>>>>> I consent you.
>>>>>>>>>
>>>>>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>>>>>
>>>>>>>>>> I once raised the idea of a ZONE_PREFER_MOVABLE [1], maybe that's
>>>>>>>>>> similar to what you have in mind here. In general, adding new zones is
>>>>>>>>>> frowned upon.
>>>>>>>>>
>>>>>>>>> Actually, we have already studied your idea and thought it is similar with us in 2 aspects.
>>>>>>>>> 1. ZONE_PREFER_MOVABLE allows a kernelspace allocation using a new zone
>>>>>>>>> 2. ZONE_PREFER_MOVABLE helps less fragmentation by splitting zones, and ordering allocation requests from the zones.
>>>>>>>>>
>>>>>>>>> We think ZONE_EXMEM also helps less fragmentation.
>>>>>>>>> Because it is a separated zone and handles a page allocation as movable by default.
>>>>>>>>
>>>>>>>> So how is it different that it would justify a different (more confusing
>>>>>>>> IMHO) name? :) Of course, names don't matter that much, but I'd be
>>>>>>>> interested in which other aspect that zone would be "special".
>>>>>>>
>>>>>>> FYI for the first time I named it as ZONE_CXLMEM, but we thought it would be needed to cover other extended memory types as well.
>>>>>>> So I changed it as ZONE_EXMEM.
>>>>>>> We also would like to point out a "special" zone aspeact, which is different from ZONE_NORMAL for tranditional DDR DRAM.
>>>>>>> Of course, a symbol naming is important more or less to represent it very nicely, though.
>>>>>>> Do you prefer ZONE_SPECIAL? :)
>>>>>>
>>>>>> I called it ZONE_PREFER_MOVABLE. If you studied that approach there must
>>>>>> be a good reason to name it differently?
>>>>>>
>>>>>
>>>>> The intention of ZONE_EXMEM is a separated logical management dimension originated from the HW diffrences of extended memory devices.
>>>>> Althought the ZONE_EXMEM considers the movable and frementation aspect, it is not all what ZONE_EXMEM considers.
>>>>> So it is named as it.
>>>>
>>>> Given that CXL memory devices can potentially cover a wide range of technologies with quite different latency and bandwidth metrics, will one zone serve as the management vehicle that you seek? If a system contains both CXL attached DRAM and, let say, a byte-addressable CXL SSD - both used as (different) byte addressable tiers in a tiered memory hierarchy, allocating memory from the ZONE_EXMEM doesn’t really tell you much about what you get. So the client would still need an orthogonal method to characterize the desired performance characteristics. This method could be combined with a fabric independent zone such as ZONE_PREFER_MOVABLE to address the kernel allocation issue. At the same time, this new zone could also be useful in other cases, such as virtio-mem.
>>>
>>> Yes. I still did not get a satisfying answer to my original question:
>>> what would be the differences between both zones from a MM point of
>>> view? We can discuss that in the session, of course.
>>>
>>> Regarding performance differences, I thought the idea was to go with
>>> different nodes to express (and model) such.
>>>
>> 
>>  From a MM point of view on the movability aspect, a kernel context is not allocated from ZONE_EXMEM without using GFP_EXMEM explicitly.
>> In contrast, if we understand the design of ZONE_PREFER_MOVABLE correctly, a kernel context can be allocated from ZONE_PREFER_MOVABLE implicitly as the fallback of ZONE_NORMAL allocation.
>> However, the movable attribute is not all we are concerning.
>> In addition, we experienced page allocation and migration issue on the heterogeneous memories.
>> 
>> Given our experiences/design and industry's viewpoints/inquiries,
>> I will prepare a few slides in the session to explain
>>    1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
>>    2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
>>    3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)
>
>Yes, especially a motivation for GFP_EXMEM and ZONE_EXMEM would be 
>great. New GFP flags and zone are very likely a lot of upstream 
>pushback. So we need a clear motivation and discussion of alternatives 
>(and why this memory has to be treated so special but still wants to be 
>managed by the buddy).
>
>Willy raises some very good points.
>

Please find the slide in preparation[1].
To help clarity, we included SW blocks and interaction of the proposal.

[1] https://github.com/OpenMPDK/SMDK/wiki/93.-%5BLSF-MM-BPF-TOPIC%5D-SMDK-inspired-MM-changes-for-CXL

>-- 
>Thanks,
>
>David / dhildenb

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                                               ` <CGME20230405021655epcas2p2364b1f56dcde629bbd05bc796c2896aa@epcas2p2.samsung.com>
@ 2023-04-05  2:16                                                 ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-05  2:16 UTC (permalink / raw)
  To: david
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="yes", Size: 3438 bytes --]

>On 31.03.23 17:56, Frank van der Linden wrote:
>> On Fri, Mar 31, 2023 at 6:42 AM Matthew Wilcox <willy@infradead.org> wrote:
>>>
>>> On Fri, Mar 31, 2023 at 08:42:20PM +0900, Kyungsan Kim wrote:
>>>> Given our experiences/design and industry's viewpoints/inquiries,
>>>> I will prepare a few slides in the session to explain
>>>>    1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
>>>>    2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
>>>>    3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)
>>>
>>> I think you'll find everybody else in the room understands these issues
>>> rather better than you do.  This is hardly the first time that we've
>>> talked about CXL, and CXL is not the first time that people have
>>> proposed disaggregated memory, nor heterogenous latency/bandwidth
>>> systems.  All the previous attempts have failed, and I expect this
>>> one to fail too.  Maybe there's something novel that means this time
>>> it really will work, so any slides you do should focus on that.
>>>
>>> A more profitable discussion might be:
>>>
>>> 1. Should we have the page allocator return pages from CXL or should
>>>     CXL memory be allocated another way?
>>> 2. Should there be a way for userspace to indicate that it prefers CXL
>>>     memory when it calls mmap(), or should it always be at the discretion
>>>     of the kernel?
>>> 3. Do we continue with the current ZONE_DEVICE model, or do we come up
>>>     with something new?
>>>
>>>
>> 
>> Point 2 is what I proposed talking about here:
>> https://lore.kernel.org/linux-mm/a80a4d4b-25aa-a38a-884f-9f119c03a1da@google.com/T/
>> 
>> With the current cxl-as-numa-node model, an application can express a
>> preference through mbind(). But that also means that mempolicy and
>> madvise (e.g. MADV_COLD) are starting to overlap if the intention is
>> to use cxl as a second tier for colder memory.  Are these the right
>> abstractions? Might it be more flexible to attach properties to memory
>> ranges, and have applications hint which properties they prefer?
>
>I think history told us that the discussions always go like "but user 
>space wants more control, let's give user space all the power", and a 
>couple of months later we get "but we cannot possibly enlighten all 
>applications, and user space does not have sufficient information: we 
>need the kernel to handle this transparently."
>
>It seems to be a steady back and forth. Most probably we want something 
>in between: cxl-as-numa-node model is already a pretty good and 
>simplistic abstractions. Avoid too many new special user-space knobs is 
>most probably the way to go.
>
>Interesting discussion, I agree. And we had plenty of similar ones 
>already with PMEM and NUMA in general.
>

Haha. funny sentences. IMHO the two kind of contradictory needs exists all the time in real-world.
Based on my experiences, some userlands prefer transparent use, others eager to an optimization chance. 
I also would put higher priority on transparent side, though. 
On linux point of view as the general purpose OS, I believe it has been also a common approach that Linux supports a basic operation, and further provides tunables through API or configurations to support a variety of needs as many as possible.

>-- 
>Thanks,
>
>David / dhildenb

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [External] RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                               ` <642cb7ec58c71_21a829453@dwillia2-xfh.jf.intel.com.notmuch>
@ 2023-04-05  2:34                                 ` Gregory Price
  0 siblings, 0 replies; 66+ messages in thread
From: Gregory Price @ 2023-04-05  2:34 UTC (permalink / raw)
  To: Dan Williams
  Cc: Viacheslav A.Dubeyko, Adam Manzanares, Mike Rapoport,
	Kyungsan Kim, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	seungjun.ha, wj28.lee

On Tue, Apr 04, 2023 at 04:51:08PM -0700, Dan Williams wrote:
> Gregory Price wrote:
> [..]
> > More generally, I think a cxl-swap (cswap? ;V) would be useful exactly to
> > help identify when watch-and-wait tiering becomes more performant than
> > promote-on-first-use.  If you can't beat a simple fast-swap, why bother?
> 
> I think it is instructive to look at what happened with PMEM, i.e.  a
> "pswap" idea never entered the discourse. The moment the memory is not
> byte-addressable, it might as well be an NVME device where it can
> support a queue-depth and async-dma.

touché, but then did pmem hit latencies as high as 1.5-2us?

(I honestly don't know).

I'm just wondering how useful a 2mb page of memory at 1.5us per fetch
is, and whether we'll find it's almost always beneficial to promote that
page one first/second/third cache line fetch in some interval.  If you
always promote on first use, it's basically just super-swap - even if
the memory is still itself still byte-addressable.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Re: RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-05  2:01                                             ` Kyungsan Kim
@ 2023-04-05  3:11                                               ` Matthew Wilcox
  0 siblings, 0 replies; 66+ messages in thread
From: Matthew Wilcox @ 2023-04-05  3:11 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

On Wed, Apr 05, 2023 at 11:01:21AM +0900, Kyungsan Kim wrote:
> >1. Should we have the page allocator return pages from CXL or should
> >   CXL memory be allocated another way?
> I think yes. Using CXL DRAM as System RAM interface would be the primary use case in real-world application in regards to compatibility.
> So, on the System RAM interface, we think it should be managed by Linux MM subsystem. (Node - Zonelist - buddy page allocator)

I don't think this is the right approach.

> >2. Should there be a way for userspace to indicate that it prefers CXL
> >   memory when it calls mmap(), or should it always be at the discretion
> >   of the kernel?
> I think yes. Both implcit and explict ways are meaningful for users on a different purpose.
> The dynamic performance variation of CXL DRAM is likely bigger than other memory types due to the topology expansion and link negotiation.
> I think it strengthens the needs.

I also disagree with your answer here.

> >3. Do we continue with the current ZONE_DEVICE model, or do we come up
> >   with something new?
> In fact, ZONE_DEVICE was the our first candidate for CXL DRAM.
> But because ZONE_DEVICE is not managed by buddy, we thought it does not fit to provide System RAM interface.

But what you're proposing (separate GFP_EXMEM, ZONE_EXMEM, etc) doesn't
let the buddy allocator satisfy GFP_KERNEL allocations from CXL.  So
what's the point?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-05  2:00                           ` Kyungsan Kim
@ 2023-04-05  4:48                             ` Dan Williams
  2023-04-05 18:12                               ` Matthew Wilcox
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2023-04-05  4:48 UTC (permalink / raw)
  To: Kyungsan Kim, willy
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

Kyungsan Kim wrote:
> >On Fri, Mar 31, 2023 at 08:37:15PM +0900, Kyungsan Kim wrote:
> >> >> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
> >> >
> >> >This sounds dangerously confused.  Do you want the EXMEM to be removable
> >> >or not?  If you do, then allocations from it have to be movable.  If
> >> >you don't, why go to all this trouble?
> >> 
> >> I'm sorry to make you confused. We will try more to clearly explain our thought.
> >> We think the CXL DRAM device should be removable along with HW pluggable nature.
> >> For MM point of view, we think a page of CXL DRAM can be both movable and unmovable. 
> >> An user or kernel context should be able to determine it. Thus, we think dedication on the ZONE_NORMAL or the ZONE_MOVABLE is not enough.
> >
> >No, this is not the right approach.  If CXL is to be hot-pluggable,
> >then all CXL allocations must be movable.  If even one allocation on a
> >device is not movable, then the device cannot be removed.  ZONE_EXMEM
> >feels like a solution in search of a problem
> 
> We know the situation. When a CXL DRAM channel is located under ZONE_NORMAL,
> a random allocation of a kernel object by calling kmalloc() siblings makes the entire CXL DRAM unremovable.
> Also, not all kernel objects can be allocated from ZONE_MOVABLE.
> 
> ZONE_EXMEM does not confine a movability attribute(movable or unmovable), rather it allows a calling context can decide it.
> In that aspect, it is the same with ZONE_NORMAL but ZONE_EXMEM works for extended memory device.
> It does not mean ZONE_EXMEM support both movability and kernel object allocation at the same time.
> In case multiple CXL DRAM channels are connected, we think a memory consumer possibly dedicate a channel for movable or unmovable purpose.
> 

I want to clarify that I expect the number of people doing physical CXL
hotplug of whole devices to be small compared to dynamic capacity
devices (DCD). DCD is a new feature of the CXL 3.0 specification where a
device maps 1 or more thinly provisioned memory regions that have
individual extents get populated and depopulated by a fabric manager.

In that scenario there is a semantic where the fabric manager hands out
100G to a host and asks for it back, it is within the protocol that the
host can say "I can give 97GB back now, come back and ask again if you
need that last 3GB".

In other words even pinned pages in ZONE_MOVABLE are not fatal to the
flow. Alternatively, if a deployment needs 100% guarantees that the host
will return all the memory it was assigned when asked there is always
the option to keep that memory out of the page allocator and just access
it via a device. That's the role device-dax plays for "dedicated" memory
that needs to be set aside from kernel allocations.

This is to say something like ZONE_PREFER_MOVABLE semantics can be
handled within the DCD protocol, where 100% unpluggability is not
necessary and 97% is good enough.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: RE: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-05  2:06                                               ` Re: " Kyungsan Kim
@ 2023-04-05  5:00                                                 ` Dan Williams
  0 siblings, 0 replies; 66+ messages in thread
From: Dan Williams @ 2023-04-05  5:00 UTC (permalink / raw)
  To: Kyungsan Kim, fvdl
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

Kyungsan Kim wrote:
> Hi Frank, 
> Thank you for your interest on this topic and remaining your opinion.
> 
> >On Fri, Mar 31, 2023 at 6:42���AM Matthew Wilcox <willy@infradead.org> wrote:
> >>
> >> On Fri, Mar 31, 2023 at 08:42:20PM +0900, Kyungsan Kim wrote:
> >> > Given our experiences/design and industry's viewpoints/inquiries,
> >> > I will prepare a few slides in the session to explain
> >> >   1. Usecase - user/kernespace memory tiering for near/far placement, memory virtualization between hypervisor/baremetal OS
> >> >   2. Issue - movability(movable/unmovable), allocation(explicit/implicit), migration(intented/unintended)
> >> >   3. HW - topology(direct, switch, fabric), feature(pluggability,error-handling,etc)
> >>
> >> I think you'll find everybody else in the room understands these issues
> >> rather better than you do.  This is hardly the first time that we've
> >> talked about CXL, and CXL is not the first time that people have
> >> proposed disaggregated memory, nor heterogenous latency/bandwidth
> >> systems.  All the previous attempts have failed, and I expect this
> >> one to fail too.  Maybe there's something novel that means this time
> >> it really will work, so any slides you do should focus on that.
> >>
> >> A more profitable discussion might be:
> >>
> >> 1. Should we have the page allocator return pages from CXL or should
> >>    CXL memory be allocated another way?
> >> 2. Should there be a way for userspace to indicate that it prefers CXL
> >>    memory when it calls mmap(), or should it always be at the discretion
> >>    of the kernel?
> >> 3. Do we continue with the current ZONE_DEVICE model, or do we come up
> >>    with something new?
> >>
> >>
> >
> >Point 2 is what I proposed talking about here:
> >https://lore.kernel.org/linux-mm/a80a4d4b-25aa-a38a-884f-9f119c03a1da@google.com/T/
> >
> >With the current cxl-as-numa-node model, an application can express a
> >preference through mbind(). But that also means that mempolicy and
> >madvise (e.g. MADV_COLD) are starting to overlap if the intention is
> >to use cxl as a second tier for colder memory.  Are these the right
> >abstractions? Might it be more flexible to attach properties to memory
> >ranges, and have applications hint which properties they prefer?
> 
> We also think more userspace hints would be meaningful for diverse purposes of application.
> Specific intefaces are need to be discussed, though.
> 
> FYI in fact, we expanded mbind() and set_mempolicy() as well to explicitly bind DDR/CXL.
>   - mbind(,,MPOL_F_ZONE_EXMEM / MPOL_F_ZONE_NOEXMEM) 
>   - set_mempolicy(,,MPOL_F_ZONE_EXMEM / MPOL_F_ZONE_NOEXMEM)
> madvise() is also a candidate to express tiering intention.

Need to be careful to explain why node numbers are not sufficient,
because the need for new userspace ABI is a high bar.

Recall that ZONE id bits and NUMA id bits are both coming from
page->flags:

#define NODES_PGSHIFT           (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT           (ZONES_PGOFF * (ZONES_WIDTH != 0))
#define ZONES_MASK              ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK              ((1UL << NODES_WIDTH) - 1)

So when people declare that they are on "team ZONE" or "team NUMA" for
this solution they are both on "team page->flags".

Also have a look at the HMEM_REPORTING [1] interface and how it
enumerates performance properties from initiator nodes to target nodes.
There's no similar existing ABI for enumerating the performance of a
ZONE. This is just to point out the momentum behind numbers in
NODES_MASK having more meaning for conveying policy and enumerating
performance than numbers in ZONES_MASK.

[1]: https://www.kernel.org/doc/html/latest/admin-guide/mm/numaperf.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                       ` <CGME20230405101840epcas2p4c92037ceba77dfe963d24791a9058450@epcas2p4.samsung.com>
@ 2023-04-05 10:18                         ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-05 10:18 UTC (permalink / raw)
  To: dragan
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>Hi Mike,
>
>On 4/3/23 03:44, Mike Rapoport wrote:
>> Hi Dragan,
>> 
>> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
>>> On 3/26/23 02:21, Mike Rapoport wrote:
>>>> Hi,
>>>>
>>>> [..] >> One problem we experienced was occured in the combination of
>>> hot-remove and kerelspace allocation usecases.
>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>>>
>>>> This still does not describe what are the use cases that require having
>>>> kernel allocations on CXL.mem.
>>>>
>>>> I believe it's important to start with explanation *why* it is important to
>>>> have kernel allocations on removable devices.
>>>
>>> Hi Mike,
>>>
>>> not speaking for Kyungsan here, but I am starting to tackle hypervisor
>>> clustering and VM migration over cxl.mem [1].
>>>
>>> And in my mind, at least one reason that I can think of having kernel
>>> allocations from cxl.mem devices is where you have multiple VH connections
>>> sharing the memory [2]. Where for example you have a user space application
>>> stored in cxl.mem, and then you want the metadata about this
>>> process/application that the kernel keeps on one hypervisor be "passed on"
>>> to another hypervisor. So basically the same way processors in a single
>>> hypervisors cooperate on memory, you extend that across processors that span
>>> over physical hypervisors. If that makes sense...
>> 
>> Let me reiterate to make sure I understand your example.
>> If we focus on VM usecase, your suggestion is to store VM's memory and
>> associated KVM structures on a CXL.mem device shared by several nodes.
>
>Yes correct. That is what I am exploring, two different approaches:
>
>Approach 1: Use CXL.mem for VM migration between hypervisors. In this 
>approach the VM and the metadata executes/resides on a traditional NUMA 
>node (cpu+dram) and only uses CXL.mem to transition between hypervisors. 
>It's not kept permanently there. So basically on hypervisor A you would 
>do something along the lines of migrate_pages into cxl.mem and then on 
>hypervisor B you would migrate_pages from cxl.mem and onto the regular 
>NUMA node (cpu+dram).
>
>Approach 2: Use CXL.mem to cluster hypervisors to improve high 
>availability of VMs. In this approach the VM and metadata would be kept 
>in CXL.mem permanently and each hypervisor accessing this shared memory 
>could have the potential to schedule/run the VM if the other hypervisor 
>experienced a failure.
>
>> Even putting aside the aspect of keeping KVM structures on presumably
>> slower memory, 
>
>Totally agree, presumption of memory speed dully noted. As far as I am 
>aware, CXL.mem at this point has higher latency than DRAM, and switched 
>CXL.mem has an additional latency. That may or may not change in the 
>future, but even with actual CXL induced latency I think there are 
>benefits to the approaches.
>
>In the example #1 above, I think even if you had a very noisy VM that is 
>dirtying pages at a high rate, once migrate_pages has occurred, it 
>wouldn't have to be quiesced for the migration to happen. A migration 
>could basically occur in-between the CPU slices, once VCPU is done with 
>it's slice on hypervisor A, the next slice could be on hypervisor B.
>
>And the example #2 above, you are trading memory speed for 
>high-availability. Where either hypervisor A or B could run the CPU load 
>of the VM. You could even have a VM where some of the VCPUs are 
>executing on hypervisor A and others on hypervisor B to be able to shift 
>CPU load across hypervisors in quasi real-time.
>
>
>> what ZONE_EXMEM will provide that cannot be accomplished
>> with having the cxl memory in a memoryless node and using that node to
>> allocate VM metadata?
>
>It has crossed my mind to perhaps use NUMA node distance for the two 
>approaches above. But I think that is not sufficient because we can have 
>varying distance, and distance in itself doesn't indicate 
>switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly 
>speaking just for myself here, with the two approaches above, the 
>crucial differentiator in order for #1 and #2 to work would be that 
>switched/shared CXL.mem would have to be indicated as such in a way. 
>Because switched memory would have to be treated and formatted in some 
>kind of ABI way that would allow hypervisors to cooperate and follow 
>certain protocols when using this memory.
>
>
>I can't answer what ZONE_EXMEM will provide since we haven's seen 
>Kyungsan's talk yet, that's why I myself was very curious to find out 
>more about ZONE_EXMEM proposal and if it includes some provisions for 
>CXL switched/shared memory.
>
>To me, I don't think it makes a difference if pages are coming from 
>ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was if 
>I could allocate from or migrate_pages to (ZONE_EXMEM | type 
>"SWITCHED/SHARED"). So it's not the zone that is crucial for me,  it's 
>the typing. That's what I meant with my initial response but I guess it 
>wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in my 
>case, this is where you'd have kernel allocations on CXL.mem"

Hi Dragan, I'm sorry for late reply, we are trying to reply well, though.
ZONE_EXMEM can be movable. A calling context is able to determine movability(movable/unmovable).

I'm not sure if it is related to the provision you keep in mind, but ZONE_EXMEM allows capacity and bandwidth aggregation among multiple CXL DRAM channels. 
Multiple CXL DRAM can be grouped into a ZONE_EXMEM, then it is able to be exposed as a single memory-node[1].
Along with the increase of CXL DRAM channels through (multi-level) switch and enhanced CXL server system, we thought kernel should manage it seamlessly.
Otherwise, userspace would see many nodes, then a 3rd party tool would be always needed such as numactl and libnuma. 
Of course, CXL switch can do the part, but HW/SW means have pros and cons in many ways, so we thought it would be co-existable.

Also, upon the composability expectation of CXL, I think memory sharing among VM/KVM instances well fits with CXL. 
This is just a gut now, but a security and permission matter would be handled in the zone dimension possibly.

In general, given CXL nature(PCIe basis) and topology expansions(direct->switches->fabrics), 
let us carefully guess more functionality and performance matter would be raised. 
We have proposed ZONE_EXMEM as a separated logical management dimension for extended memory types, as of now CXL DRAM.
To help your clarify, please find the slide that explains our proposal[2].

[1] https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
[2] https://github.com/OpenMPDK/SMDK/wiki/93.-%5BLSF-MM-BPF-TOPIC%5D-SMDK-inspired-MM-changes-for-CXL

>
>
>Sorry if it got long, hope that makes sense... :)
>
>
>>   
>>> [1] A high-level explanation is at https://protect2.fireeye.com/v1/url?k=4536d55f-244b3fdc-45375e10-74fe48600158-3fa306550dc8830d&q=1&e=afaf972f-90cd-4c53-b50f-bead1fea18a3&u=http%3A%2F%2Fnil-migration.org%2F
>>> [2] Compute Express Link Specification r3.0, v1.0 8/1/22, Page 51, figure
>>> 1-4, black color scheme circle(3) and bars.
>>>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: [External] RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                               ` <CGME20230405101843epcas2p2c819c8d60b2a9a776124c2b4bc25af14@epcas2p2.samsung.com>
@ 2023-04-05 10:18                                 ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-05 10:18 UTC (permalink / raw)
  To: gregory.price
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>On Tue, Apr 04, 2023 at 11:59:22AM -0700, Viacheslav A.Dubeyko wrote:
>> 
>> 
>> > On Apr 1, 2023, at 3:51 AM, Gregory Price <gregory.price@memverge.com> wrote:
>> > 
>> > On Tue, Apr 04, 2023 at 05:58:05PM +0000, Adam Manzanares wrote:
>> >> On Tue, Apr 04, 2023 at 11:31:08AM +0300, Mike Rapoport wrote:
>> >>> 
>> >>> The point of zswap IIUC is to have small and fast swap device and
>> >>> compression is required to better utilize DRAM capacity at expense of CPU
>> >>> time.
>> >>> 
>> >>> Presuming CXL memory will have larger capacity than DRAM, why not skip the
>> >>> compression and use CXL as a swap device directly?
>> >> 
>> >> I like to shy away from saying CXL memory should be used for swap. I see a 
>> >> swap device as storing pages in a manner that is no longer directly addressable
>> >> by the cpu. 
>> >> 
>> >> Migrating pages to a CXL device is a reasonable approach and I believe we
>> >> have the ability to do this in the page reclaim code. 
>> >> 
>> > 
>> > The argument is "why do you need swap if memory itself is elastic", and
>> > I think there are open questions about how performant using large
>> > amounts of high-latency memory is.
>> > 
>> > Think 1us-1.5us+ cross-rack attached memory.
>> > 
>> > Does it make sense to use that as CPU-addressible and migrate it on
>> > first use?  Isn't that just swap with more steps?  What happens if we
>> > just use it as swap, is the performance all that different?
>> > 
>> > I think there's a reasonable argument for exploring the idea at the
>> > higher ends of the latency spectrum.  And the simplicity of using an
>> > existing system (swap) to implement a form of proto-tiering is rather
>> > attractive in my opinion.
>> > 
>> 
>> I think the problem with swap that we need to take into account the additional
>> latency of swap-in/swap-out logic. I assume that this logic is expensive enough.
>> And if we considering the huge graph, for example, I am afraid the swap-in/swap-out
>> logic could be expensive. So, the question here is about use-case. Which use-case could
>> have benefits to employ the swap as a big space of high-latency memory? I see your point
>> that such swap could be faster than persistent storage. But which use-case can be happy
>> user of this space of high-latency memory?
>> 
>> Thanks,
>> Slava.
>> 
>
>Just spitballing here - to me this problem is two fold:
>
>I think the tiering use case and the swap use case are exactly the same.
>If tiering is sufficiently valuable, there exists a spectrum of compute
>density (cpu:dram:cxl:far-cxl) where simply using far-cxl as fast-swap
>becomes easier and less expensive than a complex tiering system.
>
>So rather than a single use-case question, it reads like a tiering
>question to me:
>
>1) Where on the 1us-20us (far cxl : nvme) spectrum does it make sense to
>   switch from a swap mechanism to simply byte-addressable memory?
>   There's a point, somewhere, where promote on first access (effectively
>   swap) is the same performance as active tiering (for a given workload).
>
>   If that point is under 2us, there's a good chance that a high-latency
>   CXL swap-system would be a major win for any workload on any cloud-based
>   system.  It's simple, clean, and reclaim doesn't have to worry about the
>   complexities of hotpluggable memory zones.
>
>
>Beyond that, to your point, what use-case is happy with this class of
>memory, and in what form?
>
>2) This is likely obscurred by the fact that many large-memory
>   applications avoid swap like the plague by sharding data and creating
>   clusters. So it's hard to answer this until it's tested, and you
>   can't test it unless you make it... woo!
>
>   Bit of a chicken/egg in here.  I don't know that anyone can say
>   definitively what workload can make use of it, but that doesn't mean
>   there isn't one.  So in the spectrum of risk/reward, at least
>   enabling some simple mechanism for the sake of exploration feels
>   exciting to say the least.
>
>
>More generally, I think a cxl-swap (cswap? ;V) would be useful exactly to
>help identify when watch-and-wait tiering becomes more performant than
>promote-on-first-use.  If you can't beat a simple fast-swap, why bother?
>
>Again, I think this is narrowly applicable to high-latency CXL. My gut
>tells me that anything under 1us is better used in a byte-addressable
>manner, but once you start hitting 1us "It makes me go hmmm..."
>
>I concede this is largely conjecture until someone tests it out, but
>certainly a fun thing to discess.

In fact, we enabled CXL swap, OS-level swap interface for CXL DRAM[1].
It was not LSF/MM proposal of this year, though.
Likewise zswap, CXL swap implements frontswap[2], but it does not perform compression using CPU cycle. 
The first motivation was to enhance TMO solution[3].
TMO uses both zswap and disk swap, and we intended to replace zswap part as CXL swap to seamlessly adopt CXL DRAM in the solution.

We agree that the primary usecase of CXL DRAM is byte-addressable memory. 
So, memory expansion with CXL DRAM will significantly resolve memory pressure of a system,
thus memory allocation would fail less and swapper would be less triggered by PFRA.

However,we think the Linux swap mechanism would be keep used for a different purpose. 
Due to CXL topology expansion and more SW overheads along with the topology, we guess the end-user latency to CXL DRAM would be increased. 
Along with the purpose of frontswap, we think CXL swap interface would fit between hypervisor and baremetal OS with more capacity sensitive and less user-interactive workload.

Let us share some performance numbers regarding CXL swap on a CXL capable testbed.
1. Latency evalution of swap in/out logic - swap-in/swap-out was 0.56us/1.07us, respectively.
2. CPU utilization - CXL swap on CXL DRAM saves x14.94 cpu utilization compared to zswap because of no (de)compression. zswap occupies aroud 70~80% cpu cycle on (de)compression logic.
3. QoS - The latency of zswap put/get was fluctuated around 1.3~14us, while that of CXL swap out/get was evenly around 0.49~0.94us. 

[1] https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#swap
[2] https://www.kernel.org/doc/html/v5.0/vm/frontswap.html
[3] https://arxiv.org/abs/2206.02878

>~Gregory

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Re: Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-05  4:48                             ` Dan Williams
@ 2023-04-05 18:12                               ` Matthew Wilcox
  2023-04-05 19:42                                 ` Dan Williams
  0 siblings, 1 reply; 66+ messages in thread
From: Matthew Wilcox @ 2023-04-05 18:12 UTC (permalink / raw)
  To: Dan Williams
  Cc: Kyungsan Kim, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko, seungjun.ha, wj28.lee

On Tue, Apr 04, 2023 at 09:48:41PM -0700, Dan Williams wrote:
> Kyungsan Kim wrote:
> > We know the situation. When a CXL DRAM channel is located under ZONE_NORMAL,
> > a random allocation of a kernel object by calling kmalloc() siblings makes the entire CXL DRAM unremovable.
> > Also, not all kernel objects can be allocated from ZONE_MOVABLE.
> > 
> > ZONE_EXMEM does not confine a movability attribute(movable or unmovable), rather it allows a calling context can decide it.
> > In that aspect, it is the same with ZONE_NORMAL but ZONE_EXMEM works for extended memory device.
> > It does not mean ZONE_EXMEM support both movability and kernel object allocation at the same time.
> > In case multiple CXL DRAM channels are connected, we think a memory consumer possibly dedicate a channel for movable or unmovable purpose.
> > 
> 
> I want to clarify that I expect the number of people doing physical CXL
> hotplug of whole devices to be small compared to dynamic capacity
> devices (DCD). DCD is a new feature of the CXL 3.0 specification where a
> device maps 1 or more thinly provisioned memory regions that have
> individual extents get populated and depopulated by a fabric manager.
> 
> In that scenario there is a semantic where the fabric manager hands out
> 100G to a host and asks for it back, it is within the protocol that the
> host can say "I can give 97GB back now, come back and ask again if you
> need that last 3GB".

Presumably it can't give back arbitrary chunks of that 100GB?  There's
some granularity that's preferred; maybe on 1GB boundaries or something?

> In other words even pinned pages in ZONE_MOVABLE are not fatal to the
> flow. Alternatively, if a deployment needs 100% guarantees that the host
> will return all the memory it was assigned when asked there is always
> the option to keep that memory out of the page allocator and just access
> it via a device. That's the role device-dax plays for "dedicated" memory
> that needs to be set aside from kernel allocations.
> 
> This is to say something like ZONE_PREFER_MOVABLE semantics can be
> handled within the DCD protocol, where 100% unpluggability is not
> necessary and 97% is good enough.

This certainly makes life better (and rather more like hypervisor
shrinking than like DIMM hotplug), but I think fragmentation may well
result in "only 3GB of 100GB allocated" will result in being able to
return less than 50% of the memory, depending on granule size and
exactly how the allocations got chunked.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Re: Re: RE(2): FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-05 18:12                               ` Matthew Wilcox
@ 2023-04-05 19:42                                 ` Dan Williams
  2023-04-06 12:27                                   ` David Hildenbrand
  0 siblings, 1 reply; 66+ messages in thread
From: Dan Williams @ 2023-04-05 19:42 UTC (permalink / raw)
  To: Matthew Wilcox, Dan Williams
  Cc: Kyungsan Kim, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko, seungjun.ha, wj28.lee

Matthew Wilcox wrote:
> On Tue, Apr 04, 2023 at 09:48:41PM -0700, Dan Williams wrote:
> > Kyungsan Kim wrote:
> > > We know the situation. When a CXL DRAM channel is located under ZONE_NORMAL,
> > > a random allocation of a kernel object by calling kmalloc() siblings makes the entire CXL DRAM unremovable.
> > > Also, not all kernel objects can be allocated from ZONE_MOVABLE.
> > > 
> > > ZONE_EXMEM does not confine a movability attribute(movable or unmovable), rather it allows a calling context can decide it.
> > > In that aspect, it is the same with ZONE_NORMAL but ZONE_EXMEM works for extended memory device.
> > > It does not mean ZONE_EXMEM support both movability and kernel object allocation at the same time.
> > > In case multiple CXL DRAM channels are connected, we think a memory consumer possibly dedicate a channel for movable or unmovable purpose.
> > > 
> > 
> > I want to clarify that I expect the number of people doing physical CXL
> > hotplug of whole devices to be small compared to dynamic capacity
> > devices (DCD). DCD is a new feature of the CXL 3.0 specification where a
> > device maps 1 or more thinly provisioned memory regions that have
> > individual extents get populated and depopulated by a fabric manager.
> > 
> > In that scenario there is a semantic where the fabric manager hands out
> > 100G to a host and asks for it back, it is within the protocol that the
> > host can say "I can give 97GB back now, come back and ask again if you
> > need that last 3GB".
> 
> Presumably it can't give back arbitrary chunks of that 100GB?  There's
> some granularity that's preferred; maybe on 1GB boundaries or something?

The device picks a granularity that can be tiny per spec, but it makes
the hardware more expensive to track in small extents, so I expect
something reasonable like 1GB, but time will tell once actual devices
start showing up.

> > In other words even pinned pages in ZONE_MOVABLE are not fatal to the
> > flow. Alternatively, if a deployment needs 100% guarantees that the host
> > will return all the memory it was assigned when asked there is always
> > the option to keep that memory out of the page allocator and just access
> > it via a device. That's the role device-dax plays for "dedicated" memory
> > that needs to be set aside from kernel allocations.
> > 
> > This is to say something like ZONE_PREFER_MOVABLE semantics can be
> > handled within the DCD protocol, where 100% unpluggability is not
> > necessary and 97% is good enough.
> 
> This certainly makes life better (and rather more like hypervisor
> shrinking than like DIMM hotplug), but I think fragmentation may well
> result in "only 3GB of 100GB allocated" will result in being able to
> return less than 50% of the memory, depending on granule size and
> exactly how the allocations got chunked.

Agree.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-05 19:42                                 ` Dan Williams
@ 2023-04-06 12:27                                   ` David Hildenbrand
       [not found]                                     ` <CGME20230407093007epcas2p32addf5da24110c3e45c90a15dcde0d01@epcas2p3.samsung.com>
  0 siblings, 1 reply; 66+ messages in thread
From: David Hildenbrand @ 2023-04-06 12:27 UTC (permalink / raw)
  To: Dan Williams, Matthew Wilcox
  Cc: Kyungsan Kim, lsf-pc, linux-mm, linux-fsdevel, linux-cxl,
	a.manzanares, viacheslav.dubeyko, seungjun.ha, wj28.lee

On 05.04.23 21:42, Dan Williams wrote:
> Matthew Wilcox wrote:
>> On Tue, Apr 04, 2023 at 09:48:41PM -0700, Dan Williams wrote:
>>> Kyungsan Kim wrote:
>>>> We know the situation. When a CXL DRAM channel is located under ZONE_NORMAL,
>>>> a random allocation of a kernel object by calling kmalloc() siblings makes the entire CXL DRAM unremovable.
>>>> Also, not all kernel objects can be allocated from ZONE_MOVABLE.
>>>>
>>>> ZONE_EXMEM does not confine a movability attribute(movable or unmovable), rather it allows a calling context can decide it.
>>>> In that aspect, it is the same with ZONE_NORMAL but ZONE_EXMEM works for extended memory device.
>>>> It does not mean ZONE_EXMEM support both movability and kernel object allocation at the same time.
>>>> In case multiple CXL DRAM channels are connected, we think a memory consumer possibly dedicate a channel for movable or unmovable purpose.
>>>>
>>>
>>> I want to clarify that I expect the number of people doing physical CXL
>>> hotplug of whole devices to be small compared to dynamic capacity
>>> devices (DCD). DCD is a new feature of the CXL 3.0 specification where a
>>> device maps 1 or more thinly provisioned memory regions that have
>>> individual extents get populated and depopulated by a fabric manager.
>>>
>>> In that scenario there is a semantic where the fabric manager hands out
>>> 100G to a host and asks for it back, it is within the protocol that the
>>> host can say "I can give 97GB back now, come back and ask again if you
>>> need that last 3GB".
>>
>> Presumably it can't give back arbitrary chunks of that 100GB?  There's
>> some granularity that's preferred; maybe on 1GB boundaries or something?
> 
> The device picks a granularity that can be tiny per spec, but it makes
> the hardware more expensive to track in small extents, so I expect
> something reasonable like 1GB, but time will tell once actual devices
> start showing up.

It all sounds a lot like virtio-mem using real hardware [I know, there 
are important differences, but for the dynamic aspect there are very 
similar issues to solve]

Fir virtio-mem, the current best way to support hotplugging of large 
memory to a VM to eventually be able to unplug a big fraction again is 
using a combination of ZONE_MOVABLE and ZONE_NORMAL -- "auto-movable" 
memory onlining policy. What's online to ZONE_MOVABLE can get (fairly) 
reliably unplugged again. What's onlined to ZONE_NORMAL is possibly lost 
forever.

Like (incrementally) hotplugging 1 TiB to a 4 GiB VM. Being able to 
unplug 1 TiB reliably again is pretty much out of scope. But the more 
memory we can reliably get back the better. And the more memory we can 
get in the common case, the better. With a ZONE_NORMAL vs. ZONE_MOVABLE 
ration of 1:3 on could unplug ~768 GiB again reliably. The remainder 
depends on fragmentation on the actual system and the unplug granularity.

The original plan was to use ZONE_PREFER_MOVABLE as a safety buffer to 
reduce ZONE_NORMAL memory without increasing ZONE_MOVABLE memory (and 
possibly harming the system). The underlying idea was that in many 
setups that memory in ZONE_PREFER_MOVABLE would not get used for 
unmovable allocations and it could, therefore, get unplugged fairly 
reliably in these setups. For all other setups, unmmovable allocations 
could leak into ZONE_PREFER_MOVABLE and reduce the number of memory we 
could unplug again. But the system would try to keep unmovable 
allocations to ZONE_NORMAL, so in most cases with some 
ZONE_PREFER_MOVABLE memory we would perform better than with only 
ZONE_NORMAL.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-04  6:47                       ` Huang, Ying
@ 2023-04-06 22:27                         ` Dragan Stancevic
  2023-04-07  0:58                           ` Huang, Ying
  0 siblings, 1 reply; 66+ messages in thread
From: Dragan Stancevic @ 2023-04-06 22:27 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mike Rapoport, Kyungsan Kim, dan.j.williams, lsf-pc, linux-mm,
	linux-fsdevel, linux-cxl, a.manzanares, viacheslav.dubeyko,
	nil-migration

Hi Ying-

On 4/4/23 01:47, Huang, Ying wrote:
> Dragan Stancevic <dragan@stancevic.com> writes:
> 
>> Hi Mike,
>>
>> On 4/3/23 03:44, Mike Rapoport wrote:
>>> Hi Dragan,
>>> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
>>>> On 3/26/23 02:21, Mike Rapoport wrote:
>>>>> Hi,
>>>>>
>>>>> [..] >> One problem we experienced was occured in the combination of
>>>> hot-remove and kerelspace allocation usecases.
>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>>>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>>>>
>>>>> This still does not describe what are the use cases that require having
>>>>> kernel allocations on CXL.mem.
>>>>>
>>>>> I believe it's important to start with explanation *why* it is important to
>>>>> have kernel allocations on removable devices.
>>>>
>>>> Hi Mike,
>>>>
>>>> not speaking for Kyungsan here, but I am starting to tackle hypervisor
>>>> clustering and VM migration over cxl.mem [1].
>>>>
>>>> And in my mind, at least one reason that I can think of having kernel
>>>> allocations from cxl.mem devices is where you have multiple VH connections
>>>> sharing the memory [2]. Where for example you have a user space application
>>>> stored in cxl.mem, and then you want the metadata about this
>>>> process/application that the kernel keeps on one hypervisor be "passed on"
>>>> to another hypervisor. So basically the same way processors in a single
>>>> hypervisors cooperate on memory, you extend that across processors that span
>>>> over physical hypervisors. If that makes sense...
>>> Let me reiterate to make sure I understand your example.
>>> If we focus on VM usecase, your suggestion is to store VM's memory and
>>> associated KVM structures on a CXL.mem device shared by several nodes.
>>
>> Yes correct. That is what I am exploring, two different approaches:
>>
>> Approach 1: Use CXL.mem for VM migration between hypervisors. In this
>> approach the VM and the metadata executes/resides on a traditional
>> NUMA node (cpu+dram) and only uses CXL.mem to transition between
>> hypervisors. It's not kept permanently there. So basically on
>> hypervisor A you would do something along the lines of migrate_pages
>> into cxl.mem and then on hypervisor B you would migrate_pages from
>> cxl.mem and onto the regular NUMA node (cpu+dram).
>>
>> Approach 2: Use CXL.mem to cluster hypervisors to improve high
>> availability of VMs. In this approach the VM and metadata would be
>> kept in CXL.mem permanently and each hypervisor accessing this shared
>> memory could have the potential to schedule/run the VM if the other
>> hypervisor experienced a failure.
>>
>>> Even putting aside the aspect of keeping KVM structures on presumably
>>> slower memory,
>>
>> Totally agree, presumption of memory speed dully noted. As far as I am
>> aware, CXL.mem at this point has higher latency than DRAM, and
>> switched CXL.mem has an additional latency. That may or may not change
>> in the future, but even with actual CXL induced latency I think there
>> are benefits to the approaches.
>>
>> In the example #1 above, I think even if you had a very noisy VM that
>> is dirtying pages at a high rate, once migrate_pages has occurred, it
>> wouldn't have to be quiesced for the migration to happen. A migration
>> could basically occur in-between the CPU slices, once VCPU is done
>> with it's slice on hypervisor A, the next slice could be on hypervisor
>> B.
>>
>> And the example #2 above, you are trading memory speed for
>> high-availability. Where either hypervisor A or B could run the CPU
>> load of the VM. You could even have a VM where some of the VCPUs are
>> executing on hypervisor A and others on hypervisor B to be able to
>> shift CPU load across hypervisors in quasi real-time.
>>
>>
>>> what ZONE_EXMEM will provide that cannot be accomplished
>>> with having the cxl memory in a memoryless node and using that node to
>>> allocate VM metadata?
>>
>> It has crossed my mind to perhaps use NUMA node distance for the two
>> approaches above. But I think that is not sufficient because we can
>> have varying distance, and distance in itself doesn't indicate
>> switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly
>> speaking just for myself here, with the two approaches above, the
>> crucial differentiator in order for #1 and #2 to work would be that
>> switched/shared CXL.mem would have to be indicated as such in a way.
>> Because switched memory would have to be treated and formatted in some
>> kind of ABI way that would allow hypervisors to cooperate and follow
>> certain protocols when using this memory.
>>
>>
>> I can't answer what ZONE_EXMEM will provide since we haven's seen
>> Kyungsan's talk yet, that's why I myself was very curious to find out
>> more about ZONE_EXMEM proposal and if it includes some provisions for
>> CXL switched/shared memory.
>>
>> To me, I don't think it makes a difference if pages are coming from
>> ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was
>> if I could allocate from or migrate_pages to (ZONE_EXMEM | type
>> "SWITCHED/SHARED"). So it's not the zone that is crucial for me,  it's
>> the typing. That's what I meant with my initial response but I guess
>> it wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in
>> my case, this is where you'd have kernel allocations on CXL.mem"
>>
> 
> We have 2 choices here.
> 
> a) Put CXL.mem in a separate NUMA node, with an existing ZONE type
> (normal or movable).  Then you can migrate pages there with
> move_pages(2) or migrate_pages(2).  Or you can run your workload on the
> CXL.mem with numactl.
> 
> b) Put CXL.mem in an existing NUMA node, with a new ZONE type.  To
> control your workloads in user space, you need a set of new ABIs.
> Anything you cannot do in a)?

I like the CXL.mem as a NUMA node approach, and also think it's best to 
do this with move/migrate_pages and numactl and those a & b are good 
choices.

I think there is an option c too though, which is an amalgamation of a & 
b. Here is my thinking, and please do let me know what you think about 
this approach.

If you think about CXL 3.0 shared/switched memory as a portal for a VM 
to move from one hypervisor to another, I think each switched memory 
should be represented by it's own node and have a distinct type so the 
migration path becomes more deterministic. I was thinking along the 
lines that there would be some kind of user space clustering/migration 
app/script that runs on all the hypervisors. Which would read, let's say 
/proc/pagetypeinfo to find these "portals":
Node 4, zone Normal, type Switched ....
Node 6, zone Normal, type Switched ....

Then it would build a traversal Graph, find per hypervisor reach and 
critical connections, where critical connections are cross-rack or 
cross-pod, perhaps something along the lines of this pseudo/python code:
class Graph:
	def __init__(self, mydict):
		self.dict = mydict
		self.visited = set()
		self.critical = list()
		self.reach = dict()
		self.id = 0
	def depth_first_search(self, vertex, parent):
		self.visited.add(vertex)
		if vertex not in self.reach:
			self.reach[vertex] = {'id':self.id, 'reach':self.id}
			self.id += 1
		for next_vertex in self.dict[vertex] - {parent}:
			if next_vertex not in self.visited:
				self.depth_first_search(next_vertex, vertex)
			if self.reach[next_vertex]['reach'] < self.reach[vertex]['reach']:
				self.reach[vertex]['reach'] = self.reach[next_vertex]['reach']
		if parent != None and self.reach[vertex]['id'] == 
self.reach[vertex]['reach']:
			self.critical.append([parent, vertex])
		return self.critical

critical = mygraph.depth_first_search("hostname-foo4", None)

that way you could have a VM migrate between only two hypervisors 
sharing switched memory, or pass through a subset of hypervisors (that 
don't necessarily share switched memory) to reach it's destination. This 
may be rack confined, or across a rack or even a pod using critical 
connections.

Long way of saying that if you do a) then the clustering/migration 
script only sees a bunch of nodes and a bunch of normal zones it 
wouldn't know how to build the "flight-path" and where to send a VM. 
You'd probably have to add an additional interface in the kernel for the 
script to query the paths somehow, where on the other hand pulling 
things from proc/sys is easy.


And then if you do b) and put it in an existing NUMA and with a 
"Switched" type, you could potentially end up with several "Switched" 
types under the same node. So when you numactl/move/migrate pages they 
could go in either direction and you could send some pages through one 
"portal" and others through another "portal", which is not what you want 
to do.

That's why I think the c option might be the most optimal, where each 
switched memory has it's own node number. And then displaying type as 
"Switched" just makes it easier to detect and Graph the topology.


And with regards to an ABI, I was referring to an ABI needed between the 
kernels running on separate hypervisors. When hypervisor B boots, it 
needs to detect through an ABI if this switched/shared memory is already 
initialized and if there are VMs in there which are used by another 
hypervisor, say A. Also during the migration, hypervisors A and B would 
have to use this ABI to synchronize the hand-off between the two 
physical hosts. Not an all-inclusive list, but I was referring to those 
types of scenarios.

What do you think?


--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-06 22:27                         ` Dragan Stancevic
@ 2023-04-07  0:58                           ` Huang, Ying
       [not found]                             ` <CGME20230407092950epcas2p12bc20c2952a800cf3f4f1d0b695f67e2@epcas2p1.samsung.com>
  2023-04-07 14:35                             ` Dragan Stancevic
  0 siblings, 2 replies; 66+ messages in thread
From: Huang, Ying @ 2023-04-07  0:58 UTC (permalink / raw)
  To: Dragan Stancevic
  Cc: Mike Rapoport, Kyungsan Kim, dan.j.williams, lsf-pc, linux-mm,
	linux-fsdevel, linux-cxl, a.manzanares, viacheslav.dubeyko,
	nil-migration

Dragan Stancevic <dragan@stancevic.com> writes:

> Hi Ying-
>
> On 4/4/23 01:47, Huang, Ying wrote:
>> Dragan Stancevic <dragan@stancevic.com> writes:
>> 
>>> Hi Mike,
>>>
>>> On 4/3/23 03:44, Mike Rapoport wrote:
>>>> Hi Dragan,
>>>> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
>>>>> On 3/26/23 02:21, Mike Rapoport wrote:
>>>>>> Hi,
>>>>>>
>>>>>> [..] >> One problem we experienced was occured in the combination of
>>>>> hot-remove and kerelspace allocation usecases.
>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>>>>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>>>>>
>>>>>> This still does not describe what are the use cases that require having
>>>>>> kernel allocations on CXL.mem.
>>>>>>
>>>>>> I believe it's important to start with explanation *why* it is important to
>>>>>> have kernel allocations on removable devices.
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> not speaking for Kyungsan here, but I am starting to tackle hypervisor
>>>>> clustering and VM migration over cxl.mem [1].
>>>>>
>>>>> And in my mind, at least one reason that I can think of having kernel
>>>>> allocations from cxl.mem devices is where you have multiple VH connections
>>>>> sharing the memory [2]. Where for example you have a user space application
>>>>> stored in cxl.mem, and then you want the metadata about this
>>>>> process/application that the kernel keeps on one hypervisor be "passed on"
>>>>> to another hypervisor. So basically the same way processors in a single
>>>>> hypervisors cooperate on memory, you extend that across processors that span
>>>>> over physical hypervisors. If that makes sense...
>>>> Let me reiterate to make sure I understand your example.
>>>> If we focus on VM usecase, your suggestion is to store VM's memory and
>>>> associated KVM structures on a CXL.mem device shared by several nodes.
>>>
>>> Yes correct. That is what I am exploring, two different approaches:
>>>
>>> Approach 1: Use CXL.mem for VM migration between hypervisors. In this
>>> approach the VM and the metadata executes/resides on a traditional
>>> NUMA node (cpu+dram) and only uses CXL.mem to transition between
>>> hypervisors. It's not kept permanently there. So basically on
>>> hypervisor A you would do something along the lines of migrate_pages
>>> into cxl.mem and then on hypervisor B you would migrate_pages from
>>> cxl.mem and onto the regular NUMA node (cpu+dram).
>>>
>>> Approach 2: Use CXL.mem to cluster hypervisors to improve high
>>> availability of VMs. In this approach the VM and metadata would be
>>> kept in CXL.mem permanently and each hypervisor accessing this shared
>>> memory could have the potential to schedule/run the VM if the other
>>> hypervisor experienced a failure.
>>>
>>>> Even putting aside the aspect of keeping KVM structures on presumably
>>>> slower memory,
>>>
>>> Totally agree, presumption of memory speed dully noted. As far as I am
>>> aware, CXL.mem at this point has higher latency than DRAM, and
>>> switched CXL.mem has an additional latency. That may or may not change
>>> in the future, but even with actual CXL induced latency I think there
>>> are benefits to the approaches.
>>>
>>> In the example #1 above, I think even if you had a very noisy VM that
>>> is dirtying pages at a high rate, once migrate_pages has occurred, it
>>> wouldn't have to be quiesced for the migration to happen. A migration
>>> could basically occur in-between the CPU slices, once VCPU is done
>>> with it's slice on hypervisor A, the next slice could be on hypervisor
>>> B.
>>>
>>> And the example #2 above, you are trading memory speed for
>>> high-availability. Where either hypervisor A or B could run the CPU
>>> load of the VM. You could even have a VM where some of the VCPUs are
>>> executing on hypervisor A and others on hypervisor B to be able to
>>> shift CPU load across hypervisors in quasi real-time.
>>>
>>>
>>>> what ZONE_EXMEM will provide that cannot be accomplished
>>>> with having the cxl memory in a memoryless node and using that node to
>>>> allocate VM metadata?
>>>
>>> It has crossed my mind to perhaps use NUMA node distance for the two
>>> approaches above. But I think that is not sufficient because we can
>>> have varying distance, and distance in itself doesn't indicate
>>> switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly
>>> speaking just for myself here, with the two approaches above, the
>>> crucial differentiator in order for #1 and #2 to work would be that
>>> switched/shared CXL.mem would have to be indicated as such in a way.
>>> Because switched memory would have to be treated and formatted in some
>>> kind of ABI way that would allow hypervisors to cooperate and follow
>>> certain protocols when using this memory.
>>>
>>>
>>> I can't answer what ZONE_EXMEM will provide since we haven's seen
>>> Kyungsan's talk yet, that's why I myself was very curious to find out
>>> more about ZONE_EXMEM proposal and if it includes some provisions for
>>> CXL switched/shared memory.
>>>
>>> To me, I don't think it makes a difference if pages are coming from
>>> ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was
>>> if I could allocate from or migrate_pages to (ZONE_EXMEM | type
>>> "SWITCHED/SHARED"). So it's not the zone that is crucial for me,  it's
>>> the typing. That's what I meant with my initial response but I guess
>>> it wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in
>>> my case, this is where you'd have kernel allocations on CXL.mem"
>>>
>> We have 2 choices here.
>> a) Put CXL.mem in a separate NUMA node, with an existing ZONE type
>> (normal or movable).  Then you can migrate pages there with
>> move_pages(2) or migrate_pages(2).  Or you can run your workload on the
>> CXL.mem with numactl.
>> b) Put CXL.mem in an existing NUMA node, with a new ZONE type.  To
>> control your workloads in user space, you need a set of new ABIs.
>> Anything you cannot do in a)?
>
> I like the CXL.mem as a NUMA node approach, and also think it's best
> to do this with move/migrate_pages and numactl and those a & b are
> good choices.
>
> I think there is an option c too though, which is an amalgamation of a
> & b. Here is my thinking, and please do let me know what you think
> about this approach.
>
> If you think about CXL 3.0 shared/switched memory as a portal for a VM
> to move from one hypervisor to another, I think each switched memory 
> should be represented by it's own node and have a distinct type so the
> migration path becomes more deterministic. I was thinking along the 
> lines that there would be some kind of user space clustering/migration
> app/script that runs on all the hypervisors. Which would read, let's
> say /proc/pagetypeinfo to find these "portals":
> Node 4, zone Normal, type Switched ....
> Node 6, zone Normal, type Switched ....
>
> Then it would build a traversal Graph, find per hypervisor reach and
> critical connections, where critical connections are cross-rack or 
> cross-pod, perhaps something along the lines of this pseudo/python code:
> class Graph:
> 	def __init__(self, mydict):
> 		self.dict = mydict
> 		self.visited = set()
> 		self.critical = list()
> 		self.reach = dict()
> 		self.id = 0
> 	def depth_first_search(self, vertex, parent):
> 		self.visited.add(vertex)
> 		if vertex not in self.reach:
> 			self.reach[vertex] = {'id':self.id, 'reach':self.id}
> 			self.id += 1
> 		for next_vertex in self.dict[vertex] - {parent}:
> 			if next_vertex not in self.visited:
> 				self.depth_first_search(next_vertex, vertex)
> 			if self.reach[next_vertex]['reach'] < self.reach[vertex]['reach']:
> 				self.reach[vertex]['reach'] = self.reach[next_vertex]['reach']
> 		if parent != None and self.reach[vertex]['id'] ==
> 		self.reach[vertex]['reach']:
> 			self.critical.append([parent, vertex])
> 		return self.critical
>
> critical = mygraph.depth_first_search("hostname-foo4", None)
>
> that way you could have a VM migrate between only two hypervisors
> sharing switched memory, or pass through a subset of hypervisors (that 
> don't necessarily share switched memory) to reach it's
> destination. This may be rack confined, or across a rack or even a pod
> using critical connections.
>
> Long way of saying that if you do a) then the clustering/migration
> script only sees a bunch of nodes and a bunch of normal zones it 
> wouldn't know how to build the "flight-path" and where to send a
> VM. You'd probably have to add an additional interface in the kernel
> for the script to query the paths somehow, where on the other hand
> pulling things from proc/sys is easy.
>
>
> And then if you do b) and put it in an existing NUMA and with a
> "Switched" type, you could potentially end up with several "Switched" 
> types under the same node. So when you numactl/move/migrate pages they
> could go in either direction and you could send some pages through one 
> "portal" and others through another "portal", which is not what you
> want to do.
>
> That's why I think the c option might be the most optimal, where each
> switched memory has it's own node number. And then displaying type as 
> "Switched" just makes it easier to detect and Graph the topology.
>
>
> And with regards to an ABI, I was referring to an ABI needed between
> the kernels running on separate hypervisors. When hypervisor B boots,
> it needs to detect through an ABI if this switched/shared memory is
> already initialized and if there are VMs in there which are used by
> another hypervisor, say A. Also during the migration, hypervisors A
> and B would have to use this ABI to synchronize the hand-off between
> the two physical hosts. Not an all-inclusive list, but I was referring
> to those types of scenarios.
>
> What do you think?

It seems unnecessary to add a new zone type to mark a node with some
attribute.  For example, in the following patch, a per-node attribute
can be added and shown in sysfs.

https://lore.kernel.org/linux-mm/20220704135833.1496303-10-martin.fernandez@eclypsium.com/

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                             ` <CGME20230407092950epcas2p12bc20c2952a800cf3f4f1d0b695f67e2@epcas2p1.samsung.com>
@ 2023-04-07  9:29                               ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-07  9:29 UTC (permalink / raw)
  To: ying.huang, dragan
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>Dragan Stancevic <dragan@stancevic.com> writes:
>
>> Hi Ying-
>>
>> On 4/4/23 01:47, Huang, Ying wrote:
>>> Dragan Stancevic <dragan@stancevic.com> writes:
>>> 
>>>> Hi Mike,
>>>>
>>>> On 4/3/23 03:44, Mike Rapoport wrote:
>>>>> Hi Dragan,
>>>>> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
>>>>>> On 3/26/23 02:21, Mike Rapoport wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> [..] >> One problem we experienced was occured in the combination of
>>>>>> hot-remove and kerelspace allocation usecases.
>>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>>>>>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>>>>>>
>>>>>>> This still does not describe what are the use cases that require having
>>>>>>> kernel allocations on CXL.mem.
>>>>>>>
>>>>>>> I believe it's important to start with explanation *why* it is important to
>>>>>>> have kernel allocations on removable devices.
>>>>>>
>>>>>> Hi Mike,
>>>>>>
>>>>>> not speaking for Kyungsan here, but I am starting to tackle hypervisor
>>>>>> clustering and VM migration over cxl.mem [1].
>>>>>>
>>>>>> And in my mind, at least one reason that I can think of having kernel
>>>>>> allocations from cxl.mem devices is where you have multiple VH connections
>>>>>> sharing the memory [2]. Where for example you have a user space application
>>>>>> stored in cxl.mem, and then you want the metadata about this
>>>>>> process/application that the kernel keeps on one hypervisor be "passed on"
>>>>>> to another hypervisor. So basically the same way processors in a single
>>>>>> hypervisors cooperate on memory, you extend that across processors that span
>>>>>> over physical hypervisors. If that makes sense...
>>>>> Let me reiterate to make sure I understand your example.
>>>>> If we focus on VM usecase, your suggestion is to store VM's memory and
>>>>> associated KVM structures on a CXL.mem device shared by several nodes.
>>>>
>>>> Yes correct. That is what I am exploring, two different approaches:
>>>>
>>>> Approach 1: Use CXL.mem for VM migration between hypervisors. In this
>>>> approach the VM and the metadata executes/resides on a traditional
>>>> NUMA node (cpu+dram) and only uses CXL.mem to transition between
>>>> hypervisors. It's not kept permanently there. So basically on
>>>> hypervisor A you would do something along the lines of migrate_pages
>>>> into cxl.mem and then on hypervisor B you would migrate_pages from
>>>> cxl.mem and onto the regular NUMA node (cpu+dram).
>>>>
>>>> Approach 2: Use CXL.mem to cluster hypervisors to improve high
>>>> availability of VMs. In this approach the VM and metadata would be
>>>> kept in CXL.mem permanently and each hypervisor accessing this shared
>>>> memory could have the potential to schedule/run the VM if the other
>>>> hypervisor experienced a failure.
>>>>
>>>>> Even putting aside the aspect of keeping KVM structures on presumably
>>>>> slower memory,
>>>>
>>>> Totally agree, presumption of memory speed dully noted. As far as I am
>>>> aware, CXL.mem at this point has higher latency than DRAM, and
>>>> switched CXL.mem has an additional latency. That may or may not change
>>>> in the future, but even with actual CXL induced latency I think there
>>>> are benefits to the approaches.
>>>>
>>>> In the example #1 above, I think even if you had a very noisy VM that
>>>> is dirtying pages at a high rate, once migrate_pages has occurred, it
>>>> wouldn't have to be quiesced for the migration to happen. A migration
>>>> could basically occur in-between the CPU slices, once VCPU is done
>>>> with it's slice on hypervisor A, the next slice could be on hypervisor
>>>> B.
>>>>
>>>> And the example #2 above, you are trading memory speed for
>>>> high-availability. Where either hypervisor A or B could run the CPU
>>>> load of the VM. You could even have a VM where some of the VCPUs are
>>>> executing on hypervisor A and others on hypervisor B to be able to
>>>> shift CPU load across hypervisors in quasi real-time.
>>>>
>>>>
>>>>> what ZONE_EXMEM will provide that cannot be accomplished
>>>>> with having the cxl memory in a memoryless node and using that node to
>>>>> allocate VM metadata?
>>>>
>>>> It has crossed my mind to perhaps use NUMA node distance for the two
>>>> approaches above. But I think that is not sufficient because we can
>>>> have varying distance, and distance in itself doesn't indicate
>>>> switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly
>>>> speaking just for myself here, with the two approaches above, the
>>>> crucial differentiator in order for #1 and #2 to work would be that
>>>> switched/shared CXL.mem would have to be indicated as such in a way.
>>>> Because switched memory would have to be treated and formatted in some
>>>> kind of ABI way that would allow hypervisors to cooperate and follow
>>>> certain protocols when using this memory.
>>>>
>>>>
>>>> I can't answer what ZONE_EXMEM will provide since we haven's seen
>>>> Kyungsan's talk yet, that's why I myself was very curious to find out
>>>> more about ZONE_EXMEM proposal and if it includes some provisions for
>>>> CXL switched/shared memory.
>>>>
>>>> To me, I don't think it makes a difference if pages are coming from
>>>> ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was
>>>> if I could allocate from or migrate_pages to (ZONE_EXMEM | type
>>>> "SWITCHED/SHARED"). So it's not the zone that is crucial for me,  it's
>>>> the typing. That's what I meant with my initial response but I guess
>>>> it wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in
>>>> my case, this is where you'd have kernel allocations on CXL.mem"
>>>>
>>> We have 2 choices here.
>>> a) Put CXL.mem in a separate NUMA node, with an existing ZONE type
>>> (normal or movable).  Then you can migrate pages there with
>>> move_pages(2) or migrate_pages(2).  Or you can run your workload on the
>>> CXL.mem with numactl.
>>> b) Put CXL.mem in an existing NUMA node, with a new ZONE type.  To
>>> control your workloads in user space, you need a set of new ABIs.
>>> Anything you cannot do in a)?
>>
>> I like the CXL.mem as a NUMA node approach, and also think it's best
>> to do this with move/migrate_pages and numactl and those a & b are
>> good choices.
>>
>> I think there is an option c too though, which is an amalgamation of a
>> & b. Here is my thinking, and please do let me know what you think
>> about this approach.
>>
>> If you think about CXL 3.0 shared/switched memory as a portal for a VM
>> to move from one hypervisor to another, I think each switched memory 
>> should be represented by it's own node and have a distinct type so the
>> migration path becomes more deterministic. I was thinking along the 
>> lines that there would be some kind of user space clustering/migration
>> app/script that runs on all the hypervisors. Which would read, let's
>> say /proc/pagetypeinfo to find these "portals":
>> Node 4, zone Normal, type Switched ....
>> Node 6, zone Normal, type Switched ....
>>
>> Then it would build a traversal Graph, find per hypervisor reach and
>> critical connections, where critical connections are cross-rack or 
>> cross-pod, perhaps something along the lines of this pseudo/python code:
>> class Graph:
>> 	def __init__(self, mydict):
>> 		self.dict = mydict
>> 		self.visited = set()
>> 		self.critical = list()
>> 		self.reach = dict()
>> 		self.id = 0
>> 	def depth_first_search(self, vertex, parent):
>> 		self.visited.add(vertex)
>> 		if vertex not in self.reach:
>> 			self.reach[vertex] = {'id':self.id, 'reach':self.id}
>> 			self.id += 1
>> 		for next_vertex in self.dict[vertex] - {parent}:
>> 			if next_vertex not in self.visited:
>> 				self.depth_first_search(next_vertex, vertex)
>> 			if self.reach[next_vertex]['reach'] < self.reach[vertex]['reach']:
>> 				self.reach[vertex]['reach'] = self.reach[next_vertex]['reach']
>> 		if parent != None and self.reach[vertex]['id'] ==
>> 		self.reach[vertex]['reach']:
>> 			self.critical.append([parent, vertex])
>> 		return self.critical
>>
>> critical = mygraph.depth_first_search("hostname-foo4", None)
>>
>> that way you could have a VM migrate between only two hypervisors
>> sharing switched memory, or pass through a subset of hypervisors (that 
>> don't necessarily share switched memory) to reach it's
>> destination. This may be rack confined, or across a rack or even a pod
>> using critical connections.
>>
>> Long way of saying that if you do a) then the clustering/migration
>> script only sees a bunch of nodes and a bunch of normal zones it 
>> wouldn't know how to build the "flight-path" and where to send a
>> VM. You'd probably have to add an additional interface in the kernel
>> for the script to query the paths somehow, where on the other hand
>> pulling things from proc/sys is easy.
>>
>>
>> And then if you do b) and put it in an existing NUMA and with a
>> "Switched" type, you could potentially end up with several "Switched" 
>> types under the same node. So when you numactl/move/migrate pages they
>> could go in either direction and you could send some pages through one 
>> "portal" and others through another "portal", which is not what you
>> want to do.
>>
>> That's why I think the c option might be the most optimal, where each
>> switched memory has it's own node number. And then displaying type as 
>> "Switched" just makes it easier to detect and Graph the topology.
>>
>>
>> And with regards to an ABI, I was referring to an ABI needed between
>> the kernels running on separate hypervisors. When hypervisor B boots,
>> it needs to detect through an ABI if this switched/shared memory is
>> already initialized and if there are VMs in there which are used by
>> another hypervisor, say A. Also during the migration, hypervisors A
>> and B would have to use this ABI to synchronize the hand-off between
>> the two physical hosts. Not an all-inclusive list, but I was referring
>> to those types of scenarios.
>>
>> What do you think?
>
>It seems unnecessary to add a new zone type to mark a node with some
>attribute.  For example, in the following patch, a per-node attribute
>can be added and shown in sysfs.
>

Hi Dragan, could you please confirm if I understand the a,b correctly?
a = the flow of page move/migration among switched nodes. Here, the switch node is "b" as one single node.
b = a node that is composed of multiple CXL DRAM devices under single or multi-level switch.

Hi Ying,
ZONE_EXMEM not only means adding an attribute in a node, but also provides provisioning among CXL.mem channels.
To be specific, multiple CXL DRAM devices can be composed as ZONE_EXMEM using sysfs or cli[1], as a result userland is able to handle it as a single node.

[1] https://github.com/OpenMPDK/SMDK/wiki/4.-Kernel#n-way-grouping
>https://lore.kernel.org/linux-mm/20220704135833.1496303-10-martin.fernandez@eclypsium.com/
>
>Best Regards,
>Huang, Ying

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]                                     ` <CGME20230407093007epcas2p32addf5da24110c3e45c90a15dcde0d01@epcas2p3.samsung.com>
@ 2023-04-07  9:30                                       ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-07  9:30 UTC (permalink / raw)
  To: david
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>On 05.04.23 21:42, Dan Williams wrote:
>> Matthew Wilcox wrote:
>>> On Tue, Apr 04, 2023 at 09:48:41PM -0700, Dan Williams wrote:
>>>> Kyungsan Kim wrote:
>>>>> We know the situation. When a CXL DRAM channel is located under ZONE_NORMAL,
>>>>> a random allocation of a kernel object by calling kmalloc() siblings makes the entire CXL DRAM unremovable.
>>>>> Also, not all kernel objects can be allocated from ZONE_MOVABLE.
>>>>>
>>>>> ZONE_EXMEM does not confine a movability attribute(movable or unmovable), rather it allows a calling context can decide it.
>>>>> In that aspect, it is the same with ZONE_NORMAL but ZONE_EXMEM works for extended memory device.
>>>>> It does not mean ZONE_EXMEM support both movability and kernel object allocation at the same time.
>>>>> In case multiple CXL DRAM channels are connected, we think a memory consumer possibly dedicate a channel for movable or unmovable purpose.
>>>>>
>>>>
>>>> I want to clarify that I expect the number of people doing physical CXL
>>>> hotplug of whole devices to be small compared to dynamic capacity
>>>> devices (DCD). DCD is a new feature of the CXL 3.0 specification where a
>>>> device maps 1 or more thinly provisioned memory regions that have
>>>> individual extents get populated and depopulated by a fabric manager.
>>>>
>>>> In that scenario there is a semantic where the fabric manager hands out
>>>> 100G to a host and asks for it back, it is within the protocol that the
>>>> host can say "I can give 97GB back now, come back and ask again if you
>>>> need that last 3GB".
>>>
>>> Presumably it can't give back arbitrary chunks of that 100GB?  There's
>>> some granularity that's preferred; maybe on 1GB boundaries or something?
>> 
>> The device picks a granularity that can be tiny per spec, but it makes
>> the hardware more expensive to track in small extents, so I expect
>> something reasonable like 1GB, but time will tell once actual devices
>> start showing up.
>
>It all sounds a lot like virtio-mem using real hardware [I know, there 
>are important differences, but for the dynamic aspect there are very 
>similar issues to solve]
>
>Fir virtio-mem, the current best way to support hotplugging of large 
>memory to a VM to eventually be able to unplug a big fraction again is 
>using a combination of ZONE_MOVABLE and ZONE_NORMAL -- "auto-movable" 
>memory onlining policy. What's online to ZONE_MOVABLE can get (fairly) 
>reliably unplugged again. What's onlined to ZONE_NORMAL is possibly lost 
>forever.
>
>Like (incrementally) hotplugging 1 TiB to a 4 GiB VM. Being able to 
>unplug 1 TiB reliably again is pretty much out of scope. But the more 
>memory we can reliably get back the better. And the more memory we can 
>get in the common case, the better. With a ZONE_NORMAL vs. ZONE_MOVABLE 
>ration of 1:3 on could unplug ~768 GiB again reliably. The remainder 
>depends on fragmentation on the actual system and the unplug granularity.
>
>The original plan was to use ZONE_PREFER_MOVABLE as a safety buffer to 
>reduce ZONE_NORMAL memory without increasing ZONE_MOVABLE memory (and 
>possibly harming the system). The underlying idea was that in many 
>setups that memory in ZONE_PREFER_MOVABLE would not get used for 
>unmovable allocations and it could, therefore, get unplugged fairly 
>reliably in these setups. For all other setups, unmmovable allocations 
>could leak into ZONE_PREFER_MOVABLE and reduce the number of memory we 
>could unplug again. But the system would try to keep unmovable 
>allocations to ZONE_NORMAL, so in most cases with some 
>ZONE_PREFER_MOVABLE memory we would perform better than with only 
>ZONE_NORMAL.

Probably memory hotplug mechanism would be separated into two stages, physical memory add/remove and logical memory on/offline[1].
We think ZONE_PREFER_MOVABLE could help logical memory on/offline. But, there would be trade-off between physical add/remove and device utilization.
In case of ZONE_PREFER_MOVABLE allocation on switched CXL DRAM devices, 
when pages are evenly allocated among physical CXL DRAM devices, then it would not help physical memory add/remove.
Meanwhile, when page are sequentially allocated among physical CXL DRAM devices, it would be opposite.

ZONE_EXMEM provides provision of CXL DRAM devices[2], we think the idea of ZONE_PREFER_MOVABLE idea can be applied on that.
For example, preferred movable page per CXL DRAM device within the zone.

[1] https://docs.kernel.org/admin-guide/mm/memory-hotplug.html#phases-of-memory-hotplug
[2] https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
>
>-- 
>Thanks,
>
>David / dhildenb

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
  2023-04-07  0:58                           ` Huang, Ying
       [not found]                             ` <CGME20230407092950epcas2p12bc20c2952a800cf3f4f1d0b695f67e2@epcas2p1.samsung.com>
@ 2023-04-07 14:35                             ` Dragan Stancevic
  1 sibling, 0 replies; 66+ messages in thread
From: Dragan Stancevic @ 2023-04-07 14:35 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Mike Rapoport, Kyungsan Kim, dan.j.williams, lsf-pc, linux-mm,
	linux-fsdevel, linux-cxl, a.manzanares, viacheslav.dubeyko,
	nil-migration

Hi Ying-


On 4/6/23 19:58, Huang, Ying wrote:
> Dragan Stancevic <dragan@stancevic.com> writes:
> 
>> Hi Ying-
>>
>> On 4/4/23 01:47, Huang, Ying wrote:
>>> Dragan Stancevic <dragan@stancevic.com> writes:
>>>
>>>> Hi Mike,
>>>>
>>>> On 4/3/23 03:44, Mike Rapoport wrote:
>>>>> Hi Dragan,
>>>>> On Thu, Mar 30, 2023 at 05:03:24PM -0500, Dragan Stancevic wrote:
>>>>>> On 3/26/23 02:21, Mike Rapoport wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> [..] >> One problem we experienced was occured in the combination of
>>>>>> hot-remove and kerelspace allocation usecases.
>>>>>>>> ZONE_NORMAL allows kernel context allocation, but it does not allow hot-remove because kernel resides all the time.
>>>>>>>> ZONE_MOVABLE allows hot-remove due to the page migration, but it only allows userspace allocation.
>>>>>>>> Alternatively, we allocated a kernel context out of ZONE_MOVABLE by adding GFP_MOVABLE flag.
>>>>>>>> In case, oops and system hang has occasionally occured because ZONE_MOVABLE can be swapped.
>>>>>>>> We resolved the issue using ZONE_EXMEM by allowing seletively choice of the two usecases.
>>>>>>>> As you well know, among heterogeneous DRAM devices, CXL DRAM is the first PCIe basis device, which allows hot-pluggability, different RAS, and extended connectivity.
>>>>>>>> So, we thought it could be a graceful approach adding a new zone and separately manage the new features.
>>>>>>>
>>>>>>> This still does not describe what are the use cases that require having
>>>>>>> kernel allocations on CXL.mem.
>>>>>>>
>>>>>>> I believe it's important to start with explanation *why* it is important to
>>>>>>> have kernel allocations on removable devices.
>>>>>>
>>>>>> Hi Mike,
>>>>>>
>>>>>> not speaking for Kyungsan here, but I am starting to tackle hypervisor
>>>>>> clustering and VM migration over cxl.mem [1].
>>>>>>
>>>>>> And in my mind, at least one reason that I can think of having kernel
>>>>>> allocations from cxl.mem devices is where you have multiple VH connections
>>>>>> sharing the memory [2]. Where for example you have a user space application
>>>>>> stored in cxl.mem, and then you want the metadata about this
>>>>>> process/application that the kernel keeps on one hypervisor be "passed on"
>>>>>> to another hypervisor. So basically the same way processors in a single
>>>>>> hypervisors cooperate on memory, you extend that across processors that span
>>>>>> over physical hypervisors. If that makes sense...
>>>>> Let me reiterate to make sure I understand your example.
>>>>> If we focus on VM usecase, your suggestion is to store VM's memory and
>>>>> associated KVM structures on a CXL.mem device shared by several nodes.
>>>>
>>>> Yes correct. That is what I am exploring, two different approaches:
>>>>
>>>> Approach 1: Use CXL.mem for VM migration between hypervisors. In this
>>>> approach the VM and the metadata executes/resides on a traditional
>>>> NUMA node (cpu+dram) and only uses CXL.mem to transition between
>>>> hypervisors. It's not kept permanently there. So basically on
>>>> hypervisor A you would do something along the lines of migrate_pages
>>>> into cxl.mem and then on hypervisor B you would migrate_pages from
>>>> cxl.mem and onto the regular NUMA node (cpu+dram).
>>>>
>>>> Approach 2: Use CXL.mem to cluster hypervisors to improve high
>>>> availability of VMs. In this approach the VM and metadata would be
>>>> kept in CXL.mem permanently and each hypervisor accessing this shared
>>>> memory could have the potential to schedule/run the VM if the other
>>>> hypervisor experienced a failure.
>>>>
>>>>> Even putting aside the aspect of keeping KVM structures on presumably
>>>>> slower memory,
>>>>
>>>> Totally agree, presumption of memory speed dully noted. As far as I am
>>>> aware, CXL.mem at this point has higher latency than DRAM, and
>>>> switched CXL.mem has an additional latency. That may or may not change
>>>> in the future, but even with actual CXL induced latency I think there
>>>> are benefits to the approaches.
>>>>
>>>> In the example #1 above, I think even if you had a very noisy VM that
>>>> is dirtying pages at a high rate, once migrate_pages has occurred, it
>>>> wouldn't have to be quiesced for the migration to happen. A migration
>>>> could basically occur in-between the CPU slices, once VCPU is done
>>>> with it's slice on hypervisor A, the next slice could be on hypervisor
>>>> B.
>>>>
>>>> And the example #2 above, you are trading memory speed for
>>>> high-availability. Where either hypervisor A or B could run the CPU
>>>> load of the VM. You could even have a VM where some of the VCPUs are
>>>> executing on hypervisor A and others on hypervisor B to be able to
>>>> shift CPU load across hypervisors in quasi real-time.
>>>>
>>>>
>>>>> what ZONE_EXMEM will provide that cannot be accomplished
>>>>> with having the cxl memory in a memoryless node and using that node to
>>>>> allocate VM metadata?
>>>>
>>>> It has crossed my mind to perhaps use NUMA node distance for the two
>>>> approaches above. But I think that is not sufficient because we can
>>>> have varying distance, and distance in itself doesn't indicate
>>>> switched/shared CXL.mem or non-switched/non-shared CXL.mem. Strictly
>>>> speaking just for myself here, with the two approaches above, the
>>>> crucial differentiator in order for #1 and #2 to work would be that
>>>> switched/shared CXL.mem would have to be indicated as such in a way.
>>>> Because switched memory would have to be treated and formatted in some
>>>> kind of ABI way that would allow hypervisors to cooperate and follow
>>>> certain protocols when using this memory.
>>>>
>>>>
>>>> I can't answer what ZONE_EXMEM will provide since we haven's seen
>>>> Kyungsan's talk yet, that's why I myself was very curious to find out
>>>> more about ZONE_EXMEM proposal and if it includes some provisions for
>>>> CXL switched/shared memory.
>>>>
>>>> To me, I don't think it makes a difference if pages are coming from
>>>> ZONE_NORMAL, or ZONE_EXMEM but the part that I was curious about was
>>>> if I could allocate from or migrate_pages to (ZONE_EXMEM | type
>>>> "SWITCHED/SHARED"). So it's not the zone that is crucial for me,  it's
>>>> the typing. That's what I meant with my initial response but I guess
>>>> it wasn't clear enough, "_if_ ZONE_EXMEM had some typing mechanism, in
>>>> my case, this is where you'd have kernel allocations on CXL.mem"
>>>>
>>> We have 2 choices here.
>>> a) Put CXL.mem in a separate NUMA node, with an existing ZONE type
>>> (normal or movable).  Then you can migrate pages there with
>>> move_pages(2) or migrate_pages(2).  Or you can run your workload on the
>>> CXL.mem with numactl.
>>> b) Put CXL.mem in an existing NUMA node, with a new ZONE type.  To
>>> control your workloads in user space, you need a set of new ABIs.
>>> Anything you cannot do in a)?
>>
>> I like the CXL.mem as a NUMA node approach, and also think it's best
>> to do this with move/migrate_pages and numactl and those a & b are
>> good choices.
>>
>> I think there is an option c too though, which is an amalgamation of a
>> & b. Here is my thinking, and please do let me know what you think
>> about this approach.
>>
>> If you think about CXL 3.0 shared/switched memory as a portal for a VM
>> to move from one hypervisor to another, I think each switched memory
>> should be represented by it's own node and have a distinct type so the
>> migration path becomes more deterministic. I was thinking along the
>> lines that there would be some kind of user space clustering/migration
>> app/script that runs on all the hypervisors. Which would read, let's
>> say /proc/pagetypeinfo to find these "portals":
>> Node 4, zone Normal, type Switched ....
>> Node 6, zone Normal, type Switched ....
>>
>> Then it would build a traversal Graph, find per hypervisor reach and
>> critical connections, where critical connections are cross-rack or
>> cross-pod, perhaps something along the lines of this pseudo/python code:
>> class Graph:
>> 	def __init__(self, mydict):
>> 		self.dict = mydict
>> 		self.visited = set()
>> 		self.critical = list()
>> 		self.reach = dict()
>> 		self.id = 0
>> 	def depth_first_search(self, vertex, parent):
>> 		self.visited.add(vertex)
>> 		if vertex not in self.reach:
>> 			self.reach[vertex] = {'id':self.id, 'reach':self.id}
>> 			self.id += 1
>> 		for next_vertex in self.dict[vertex] - {parent}:
>> 			if next_vertex not in self.visited:
>> 				self.depth_first_search(next_vertex, vertex)
>> 			if self.reach[next_vertex]['reach'] < self.reach[vertex]['reach']:
>> 				self.reach[vertex]['reach'] = self.reach[next_vertex]['reach']
>> 		if parent != None and self.reach[vertex]['id'] ==
>> 		self.reach[vertex]['reach']:
>> 			self.critical.append([parent, vertex])
>> 		return self.critical
>>
>> critical = mygraph.depth_first_search("hostname-foo4", None)
>>
>> that way you could have a VM migrate between only two hypervisors
>> sharing switched memory, or pass through a subset of hypervisors (that
>> don't necessarily share switched memory) to reach it's
>> destination. This may be rack confined, or across a rack or even a pod
>> using critical connections.
>>
>> Long way of saying that if you do a) then the clustering/migration
>> script only sees a bunch of nodes and a bunch of normal zones it
>> wouldn't know how to build the "flight-path" and where to send a
>> VM. You'd probably have to add an additional interface in the kernel
>> for the script to query the paths somehow, where on the other hand
>> pulling things from proc/sys is easy.
>>
>>
>> And then if you do b) and put it in an existing NUMA and with a
>> "Switched" type, you could potentially end up with several "Switched"
>> types under the same node. So when you numactl/move/migrate pages they
>> could go in either direction and you could send some pages through one
>> "portal" and others through another "portal", which is not what you
>> want to do.
>>
>> That's why I think the c option might be the most optimal, where each
>> switched memory has it's own node number. And then displaying type as
>> "Switched" just makes it easier to detect and Graph the topology.
>>
>>
>> And with regards to an ABI, I was referring to an ABI needed between
>> the kernels running on separate hypervisors. When hypervisor B boots,
>> it needs to detect through an ABI if this switched/shared memory is
>> already initialized and if there are VMs in there which are used by
>> another hypervisor, say A. Also during the migration, hypervisors A
>> and B would have to use this ABI to synchronize the hand-off between
>> the two physical hosts. Not an all-inclusive list, but I was referring
>> to those types of scenarios.
>>
>> What do you think?
> 
> It seems unnecessary to add a new zone type to mark a node with some
> attribute.  For example, in the following patch, a per-node attribute
> can be added and shown in sysfs.
> 
> https://lore.kernel.org/linux-mm/20220704135833.1496303-10-martin.fernandez@eclypsium.com/

That's a very good suggestion Ying, thank you I appreciate it.

So perhaps having switched memory on it's own node(option a), and 
exporting a sysfs attribute like "switched". Might be a good place to 
also export hypervisor partners in there which share the same switched 
memory, for the script to build up a connection topology graph.


--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 66+ messages in thread

* FW: [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL
       [not found]   ` <CGME20230414084120epcas2p37f105901350410772a3115a5a490c215@epcas2p3.samsung.com>
@ 2023-04-14  8:41     ` Kyungsan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Kyungsan Kim @ 2023-04-14  8:41 UTC (permalink / raw)
  To: ks0204.kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee,
	hj96.nam

>CXL is a promising technology that leads to fundamental changes in computing architecture.
>To facilitate adoption and widespread of CXL memory, we are developing a memory tiering solution, called SMDK[1][2].
>Using SMDK and CXL RAM device, our team has been working with industry and academic partners over last year.
>Also, thanks to many researcher's effort, CXL adoption stage is gradually moving forward from basic enablement to real-world composite usecases.
>At this moment, based on the researches and experiences gained working on SMDK, we would like to suggest a session at LSF/MM/BFP this year
>to propose possible Linux MM changes with a brief of SMDK.
>
>Adam Manzanares kindly adviced me that it is preferred to discuss implementation details on given problem and consensus at LSF/MM/BFP.
>Considering the adoption stage of CXL technology, however, let me suggest a design level discussion on the two MM expansions of SMDK this year.
>When we have design consensus with participants, we want to continue follow-up discussions with additional implementation details, hopefully.
>
> 
>1. A new zone, ZONE_EXMEM
>We added ZONE_EXMEM to manage CXL RAM device(s), separated from ZONE_NORMAL for usual DRAM due to the three reasons below.
>
>1) a CXL RAM has many different characteristics with conventional DRAM because a CXL device inherits and expands PCIe specification.
>ex) frequency range, pluggability, link speed/width negotiation, host/device flow control, power throttling, channel-interleaving methodology, error handling, and etc.
>It is likely that the primary usecase of CXL RAM would be System RAM.
>However, to deal with the hardware differences properly, different MM algorithms are needed accordingly.
>
>2) Historically, zone has been expanded by reflecting the evolution of CPU, IO, and memory devices.
>ex) ZONE_DMA(32), ZONE_HIGHMEM, ZONE_DEVICE, and ZONE_MOVABLE.
>Each zone applies different MM algorithms such as page reclaim, compaction, migration, and fragmentation.
>At first, we tried reuse of existing zones, ZONE_DEVICE and ZONE_MOVABLE, for CXL RAM purpose.
>However, the purpose and implementation of the zones are not fit for CXL RAM.
>
>3) Industry is preparing a CXL-capable system that connects dozens of CXL devices in a server system.
>When a CXL device becomes a separate node, an administrator/programmer needs to be aware of and manually control all nodes using 3rd party software, such as numactl and libnuma.
>ZONE_EXMEM allows the assemble of CXL RAM devices into the single ZONE_EXMEM zone, and provides an abstraction to userspace by seamlessly managing the devices.
>Also, the zone is able to interleave assembled devices in a software way to lead to aggregated bandwidth.
>We would like to suggest if it is co-existable with HW interleaving like SW/HW raid0.
>To help understanding, please refer to the node partition part of the picture[3].
>
>
>2. User/Kernelspace Programmable Interface
>In terms of a memory tiering solution, it is typical that the solution attempts to locate hot data on near memory, and cold data on far memory as accurately as possible.[4][5][6][7]
>We noticed that the hot/coldness of data is determined by the memory access pattern of running application and/or kernel context.
>Hence, a running context needs a near/far memory identifier to determine near/far memory. 
>When CXL RAM(s) is manipulated as a NUMA node, a node id can be function as a CXL identifier more or less.
>However, the node id has limitation in that it is an ephemeral information that dynamically varies according to online status of CXL topology and system socket.
>In this sense, we provides programmable interfaces for userspace and kernelspace context to explicitly (de)allocate memory from DRAM and CXL RAM regardless of a system change.
>Specifically, MAP_EXMEM and GFP_EXMEM flags were added to mmap() syscall and kmalloc() siblings, respectively.
>
>Thanks to Adam Manzanares for reviewing this CFP thoroughly.
>
>
>[1]SMDK: https://github.com/openMPDK/SMDK
>[2]SMT: Software-defined Memory Tiering for Heterogeneous Computing systems with CXL Memory Expander, https://ieeexplore.ieee.org/document/10032695
>[3]SMDK node partition: https://github.com/OpenMPDK/SMDK/wiki/2.-SMDK-Architecture#memory-partition
>[4]TMO: Transparent Memory Offloading in Datacenters, https://dl.acm.org/doi/10.1145/3503222.3507731
>[5]TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, https://arxiv.org/abs/2206.02878
>[6]Pond: CXL-Based Memory Pooling Systems for Cloud Platforms, https://dl.acm.org/doi/10.1145/3575693.3578835
>[7]Hierarchical NUMA: https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4656/original/Hierarchical_NUMA_Design_Plumbers_2017.pdf

Let us restate the original CFP as requirement point of view and the thought on that.

1) CXL DRAM pluggability
Issue: a random unmovable allocation makes a CXL DRAM unpluggable. 
It can happen out of userspace e.g.) pinning for DMA buffer, or kernelspace e.g.) pinning for metadata such as struct page, zone, etc.
For this matter, we should separate logical memory on/offline and physical add/remove.
Thought: a CXL DRAM should be able to be used in a selective manner, pluggable or unpluggable.
But, please don't get this wrong. Those are mutual-exclusive, so it cannot happen at the same time on a single CXL DRAM channel.

2) CXL DRAM identifier (API and ABI)
Issue: an user/kernel context has to use the node id of a CXL memory-node to access CXL DRAM explicitly and implicitly.
Thought: Node id would be ephemeral information. An userspace and kernelspace memory tiering solution need a API and/or ABI rather than node id.

3) Prevention of unintended CXL page migration
Issue: while zswap operation, a page on near memory(DIMM DRAM) is allocated to store swapped page on far memory(CXL DRAM).
Our thought: On the swap flow, the far memory should not be promoted to near memory accidentally. 

4) Too many CXL nodes appearing in userland
Issue: many CXL memory nodes would be appeared to userland along with development of a CXL capable server, switch and fabric topology.
Currently, to lead to aggregated bandwidth among the CXL nodes, an userland needs to be aware and manage the nodes using a 3rd party SW such as numactl and libnuma.
Thought: Kernel would provide an abstraction layer for userland to deal with it seamlessly.
By the way, traditionally a node implies multiple memory channels in the same distance, and a node is the largest management unit in MM. i.e.) Node - Zone - Page.
So, we thought that multiple CXL DRAMs can be appeared as a node, so the management dimension for single CXL DRAM should be smaller than node. 


^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2023-04-14  8:41 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20230221014114epcas2p1687db1d75765a8f9ed0b3495eab1154d@epcas2p1.samsung.com>
2023-02-21  1:41 ` [LSF/MM/BPF TOPIC] SMDK inspired MM changes for CXL Kyungsan Kim
2023-02-27 23:14   ` Dan Williams
     [not found]     ` <CGME20230228043551epcas2p3085444899b00b106c2901e1f51814d2c@epcas2p3.samsung.com>
2023-02-28  4:35       ` Kyungsan Kim
2023-03-03  6:07   ` Huang, Ying
     [not found]     ` <CGME20230322043354epcas2p2227bcad190a470d635b92f92587dc69e@epcas2p2.samsung.com>
2023-03-22  4:33       ` FW: " Kyungsan Kim
2023-03-22 22:03         ` Dan Williams
     [not found]           ` <CGME20230323105106epcas2p39ea8de619622376a4698db425c6a6fb3@epcas2p3.samsung.com>
2023-03-23 10:51             ` RE(2): " Kyungsan Kim
2023-03-23 12:25               ` David Hildenbrand
     [not found]                 ` <CGME20230324090923epcas2p2710ba4dc8157f9141c03104cf66e9d26@epcas2p2.samsung.com>
2023-03-24  9:09                   ` RE(4): " Kyungsan Kim
2023-03-24  9:12                     ` David Hildenbrand
     [not found]                       ` <CGME20230324092731epcas2p315c348bd76ef9fc84bffdb158e4c1aa4@epcas2p3.samsung.com>
2023-03-24  9:27                         ` RE(2): " Kyungsan Kim
2023-03-24  9:30                           ` David Hildenbrand
     [not found]                             ` <CGME20230324095031epcas2p284095ae90b25a47360b5098478dffdaa@epcas2p2.samsung.com>
2023-03-24  9:50                               ` RE(3): " Kyungsan Kim
2023-03-24 13:08                                 ` Jørgen Hansen
2023-03-24 22:33                                   ` David Hildenbrand
     [not found]                                     ` <CGME20230331114220epcas2p2d5734efcbdd8956f861f8e7178cd5288@epcas2p2.samsung.com>
2023-03-31 11:42                                       ` Kyungsan Kim
2023-03-31 13:42                                         ` Matthew Wilcox
2023-03-31 15:56                                           ` Frank van der Linden
2023-04-03  8:34                                             ` David Hildenbrand
     [not found]                                               ` <CGME20230405021655epcas2p2364b1f56dcde629bbd05bc796c2896aa@epcas2p2.samsung.com>
2023-04-05  2:16                                                 ` Kyungsan Kim
     [not found]                                             ` <CGME20230405020631epcas2p1c85058b28a70bbd46d587e78a9c9c7ad@epcas2p1.samsung.com>
2023-04-05  2:06                                               ` Re: " Kyungsan Kim
2023-04-05  5:00                                                 ` Dan Williams
     [not found]                                           ` <CGME20230405020121epcas2p2d9d39c151b6c5ab9e568ab9e2ab826ce@epcas2p2.samsung.com>
2023-04-05  2:01                                             ` Kyungsan Kim
2023-04-05  3:11                                               ` Matthew Wilcox
2023-04-03  8:28                                         ` David Hildenbrand
     [not found]                                           ` <CGME20230405020916epcas2p24cf04f5354c12632eba50b64b217e403@epcas2p2.samsung.com>
2023-04-05  2:09                                             ` Kyungsan Kim
     [not found]                                   ` <CGME20230331113147epcas2p12655777fec6839f7070ffcc446e3581b@epcas2p1.samsung.com>
2023-03-31 11:31                                     ` RE: RE(3): " Kyungsan Kim
2023-03-24  0:41               ` RE(2): " Huang, Ying
     [not found]                 ` <CGME20230324084808epcas2p354865d38dccddcb5cd46b17610345a5f@epcas2p3.samsung.com>
2023-03-24  8:48                   ` RE(4): " Kyungsan Kim
2023-03-24 13:46                     ` Gregory Price
     [not found]                       ` <CGME20230331113417epcas2p20a886e1712dbdb1f8eec03a2ac0a47e2@epcas2p2.samsung.com>
2023-03-31 11:34                         ` Kyungsan Kim
2023-03-31 15:53                           ` Gregory Price
     [not found]                             ` <CGME20230405020257epcas2p11b253f8c97a353890b96e6ae6eb515d3@epcas2p1.samsung.com>
2023-04-05  2:02                               ` Kyungsan Kim
2023-03-24 14:55               ` RE(2): " Matthew Wilcox
2023-03-24 17:49                 ` Matthew Wilcox
     [not found]                   ` <CGME20230331113715epcas2p13127b95af4000ec1ed96a2e9d89b7444@epcas2p1.samsung.com>
2023-03-31 11:37                     ` Kyungsan Kim
2023-03-31 12:54                       ` Matthew Wilcox
     [not found]                         ` <CGME20230405020027epcas2p4682d43446a493385b60c39a1dbbf07d6@epcas2p4.samsung.com>
2023-04-05  2:00                           ` Kyungsan Kim
2023-04-05  4:48                             ` Dan Williams
2023-04-05 18:12                               ` Matthew Wilcox
2023-04-05 19:42                                 ` Dan Williams
2023-04-06 12:27                                   ` David Hildenbrand
     [not found]                                     ` <CGME20230407093007epcas2p32addf5da24110c3e45c90a15dcde0d01@epcas2p3.samsung.com>
2023-04-07  9:30                                       ` Kyungsan Kim
     [not found]                   ` <CGME20230331113845epcas2p313118617918ae2bf634c3c475fc5dbd8@epcas2p3.samsung.com>
2023-03-31 11:38                     ` Re: RE(2): " Kyungsan Kim
2023-03-26  7:21               ` Mike Rapoport
2023-03-30 22:03                 ` Dragan Stancevic
2023-04-03  8:44                   ` Mike Rapoport
2023-04-04  4:27                     ` Dragan Stancevic
2023-04-04  6:47                       ` Huang, Ying
2023-04-06 22:27                         ` Dragan Stancevic
2023-04-07  0:58                           ` Huang, Ying
     [not found]                             ` <CGME20230407092950epcas2p12bc20c2952a800cf3f4f1d0b695f67e2@epcas2p1.samsung.com>
2023-04-07  9:29                               ` Kyungsan Kim
2023-04-07 14:35                             ` Dragan Stancevic
     [not found]                       ` <CGME20230405101840epcas2p4c92037ceba77dfe963d24791a9058450@epcas2p4.samsung.com>
2023-04-05 10:18                         ` Kyungsan Kim
     [not found]                 ` <CGME20230331114526epcas2p2b6f1d4c8c1c0b2e3c12a425b6e48c0d8@epcas2p2.samsung.com>
2023-03-31 11:45                   ` RE: RE(2): " Kyungsan Kim
2023-04-04  8:31                     ` Mike Rapoport
2023-04-04 17:58                       ` Adam Manzanares
2023-04-01 10:51                         ` Gregory Price
2023-04-04 18:59                           ` [External] " Viacheslav A.Dubeyko
2023-04-01 11:51                             ` Gregory Price
2023-04-04 21:09                               ` Viacheslav A.Dubeyko
     [not found]                               ` <642cb7ec58c71_21a829453@dwillia2-xfh.jf.intel.com.notmuch>
2023-04-05  2:34                                 ` Gregory Price
     [not found]                               ` <CGME20230405101843epcas2p2c819c8d60b2a9a776124c2b4bc25af14@epcas2p2.samsung.com>
2023-04-05 10:18                                 ` Kyungsan Kim
2023-03-30 22:02   ` Dragan Stancevic
     [not found]     ` <CGME20230331114649epcas2p23d52cd1d224085e6192a0aaf22948e3e@epcas2p2.samsung.com>
2023-03-31 11:46       ` Kyungsan Kim
     [not found]   ` <CGME20230414084120epcas2p37f105901350410772a3115a5a490c215@epcas2p3.samsung.com>
2023-04-14  8:41     ` FW: " Kyungsan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).