linux-cxl.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
@ 2023-04-07 21:05 Dragan Stancevic
  2023-04-07 22:23 ` James Houghton
                   ` (6 more replies)
  0 siblings, 7 replies; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-07 21:05 UTC (permalink / raw)
  To: lsf-pc; +Cc: nil-migration, linux-cxl, linux-mm

Hi folks-

if it's not too late for the schedule...

I am starting to tackle VM live migration and hypervisor clustering over 
switched CXL memory[1][2], intended for cloud virtualization types of loads.

I'd be interested in doing a small BoF session with some slides and get 
into a discussion/brainstorming with other people that deal with VM/LM 
cloud loads. Among other things to discuss would be page migrations over 
switched CXL memory, shared in-memory ABI to allow VM hand-off between 
hypervisors, etc...

A few of us discussed some of this under the ZONE_XMEM thread, but I 
figured it might be better to start a separate thread.

If there is interested, thank you.


[1]. High-level overview available at http://nil-migration.org/
[2]. Based on CXL spec 3.0

--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-07 21:05 [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Dragan Stancevic
@ 2023-04-07 22:23 ` James Houghton
  2023-04-07 23:17   ` David Rientjes
  2023-04-08  0:05 ` Gregory Price
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: James Houghton @ 2023-04-07 22:23 UTC (permalink / raw)
  To: Dragan Stancevic
  Cc: lsf-pc, nil-migration, linux-cxl, linux-mm, David Rientjes

On Fri, Apr 7, 2023 at 5:05 PM Dragan Stancevic <dragan@stancevic.com> wrote:
>
> Hi folks-
>
> if it's not too late for the schedule...
>
> I am starting to tackle VM live migration and hypervisor clustering over
> switched CXL memory[1][2], intended for cloud virtualization types of loads.
>
> I'd be interested in doing a small BoF session with some slides and get
> into a discussion/brainstorming with other people that deal with VM/LM
> cloud loads. Among other things to discuss would be page migrations over
> switched CXL memory, shared in-memory ABI to allow VM hand-off between
> hypervisors, etc...
>
> A few of us discussed some of this under the ZONE_XMEM thread, but I
> figured it might be better to start a separate thread.
>
> If there is interested, thank you.

Hi Dragan,

Thanks for bringing up this topic. I'd be very interested to be part
of this BoF, as I'm also interested in using CXL.mem as a live
migration mechanism.

- James

>
>
> [1]. High-level overview available at http://nil-migration.org/
> [2]. Based on CXL spec 3.0
>
> --
> Peace can only come as a natural consequence
> of universal enlightenment -Dr. Nikola Tesla
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-07 22:23 ` James Houghton
@ 2023-04-07 23:17   ` David Rientjes
  2023-04-08  1:33     ` Dragan Stancevic
  2023-04-08 16:24     ` Dragan Stancevic
  0 siblings, 2 replies; 40+ messages in thread
From: David Rientjes @ 2023-04-07 23:17 UTC (permalink / raw)
  To: James Houghton, Dragan Stancevic
  Cc: lsf-pc, nil-migration, linux-cxl, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1460 bytes --]

On Fri, 7 Apr 2023, James Houghton wrote:

> On Fri, Apr 7, 2023 at 5:05 PM Dragan Stancevic <dragan@stancevic.com> wrote:
> >
> > Hi folks-
> >
> > if it's not too late for the schedule...
> >
> > I am starting to tackle VM live migration and hypervisor clustering over
> > switched CXL memory[1][2], intended for cloud virtualization types of loads.
> >
> > I'd be interested in doing a small BoF session with some slides and get
> > into a discussion/brainstorming with other people that deal with VM/LM
> > cloud loads. Among other things to discuss would be page migrations over
> > switched CXL memory, shared in-memory ABI to allow VM hand-off between
> > hypervisors, etc...
> >
> > A few of us discussed some of this under the ZONE_XMEM thread, but I
> > figured it might be better to start a separate thread.
> >
> > If there is interested, thank you.
> 
> Hi Dragan,
> 
> Thanks for bringing up this topic. I'd be very interested to be part
> of this BoF, as I'm also interested in using CXL.mem as a live
> migration mechanism.
> 

Thanks for cc'ing me, this would be very interesting to talk about.  Count 
me in!

> > [1]. High-level overview available at http://nil-migration.org/
> > [2]. Based on CXL spec 3.0
> >

Dragan: I'm curious about the reference to CXL spec 3.0 here, is there 
something specific about 3.0 that this work depends on or should we be 
good-to-go with 2.0 as well?  (Are you referring to 3.0 for security 
extensions?)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-07 21:05 [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Dragan Stancevic
  2023-04-07 22:23 ` James Houghton
@ 2023-04-08  0:05 ` Gregory Price
  2023-04-11  0:56   ` Dragan Stancevic
  2023-04-11  6:37   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Huang, Ying
  2023-04-09 17:40 ` Shreyas Shah
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 40+ messages in thread
From: Gregory Price @ 2023-04-08  0:05 UTC (permalink / raw)
  To: Dragan Stancevic; +Cc: lsf-pc, nil-migration, linux-cxl, linux-mm

On Fri, Apr 07, 2023 at 04:05:31PM -0500, Dragan Stancevic wrote:
> Hi folks-
> 
> if it's not too late for the schedule...
> 
> I am starting to tackle VM live migration and hypervisor clustering over
> switched CXL memory[1][2], intended for cloud virtualization types of loads.
> 
> I'd be interested in doing a small BoF session with some slides and get into
> a discussion/brainstorming with other people that deal with VM/LM cloud
> loads. Among other things to discuss would be page migrations over switched
> CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors,
> etc...
> 
> A few of us discussed some of this under the ZONE_XMEM thread, but I figured
> it might be better to start a separate thread.
> 
> If there is interested, thank you.
> 
> 
> [1]. High-level overview available at http://nil-migration.org/
> [2]. Based on CXL spec 3.0
> 
> --
> Peace can only come as a natural consequence
> of universal enlightenment -Dr. Nikola Tesla

I've been chatting about this with folks offline, figure i'll toss my
thoughts on the issue here.


Some things to consider:


1. If secure-compute is being used, then this mechanism won't work as
   pages will be pinned, and therefore not movable and excluded from
   using cxl memory at all.

   This issue does not exist with traditional live migration, because
   typically some kind of copy is used from one virtual space to another
   (i.e. RMDA), so pages aren't really migrated in the kernel memory
   block/numa node sense.


2. During the migration process, the memory needs to be forced not to be
   migrated to another node by other means (tiering software, swap,
   etc).  The obvious way of doing this would be to migrate and
   temporarily pin the page... but going back to problem #1 we see that
   ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
   troublesome.


3. This is changing the semantics of migration from a virtual memory
   movement to a physical memory movement.  Typically you would expect
   the RDMA process for live migration to work something like...

   a) migration request arrives
   b) source host informs destination host of size requirements
   c) destination host allocations memory and passes a Virtual Address
      back to source host
   d) source host initates an RDMA from HostA-VA to HostB-VA
   e) CPU task is migrated

   Importantly, the allocation of memory by Host B handles the important
   step of creating HVA->HPA mappings, and the Extended/Nested Page
   Tables can simply be flushed and re-created after the VM is fully
   migrated.

   to long didn't read: live migration is a virtual address operation,
   and node-migration is a PHYSICAL address operation, the virtual
   addresses remain the same.

   This is problematic, as it's changing the underlying semantics of the
   migration operation.



Problem #1 and #2 are head-scratchers, but maybe solvable.

Problem #3 is the meat and potatos of the issue in my opinion. So lets
consider that a little more closely.

Generically: NIL Migration is basically a pass by reference operation.

The reference in this case is... the page tables.  You need to know how
to interpret the data in the CXL memory region on the remote host, and
that's a "relative page table translation" (to coin a phrase? I'm not
sure how to best describe it).

That's... complicated to say the least.
1) Pages on the physical hardware do not need to be contiguous
2) The CFMW on source and target host do not need to be mapped at the
   same place
3) There's not pre-allocation in these charts, and migration isn't
   targeted, so having the source-host "expertly place" the data isn't
   possible (right now, i suppose you could make kernel extensions).
4) Similar to problem #2 above, even with a pre-allocate added in, you
   would need to ensure those mappings were pinned during migration,
   lest the target host end up swapping a page or something.



An Option:  Make pages physically contiguous on migration to CXL

In this case, you don't necessarily care about the Host Virtual
Addresses, what you actually care about are the structure of the pages
in memory (are they physically contiguous? or do you need to
reconstruct the contiguity by inspecting the page tables?).

If a migration API were capable of reserving large swaths of contiguous
CXL memory, you could discard individual page information and instead
send page range information, reconstructing the virtual-physical
mappings this way.



That's about as far as I've thought about it so far.  Feel free to rip
it apart! :]

~Gregory

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-07 23:17   ` David Rientjes
@ 2023-04-08  1:33     ` Dragan Stancevic
  2023-04-08 16:24     ` Dragan Stancevic
  1 sibling, 0 replies; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-08  1:33 UTC (permalink / raw)
  To: David Rientjes, James Houghton; +Cc: lsf-pc, nil-migration, linux-cxl, linux-mm

Hi David-

On 4/7/23 18:17, David Rientjes wrote:
> On Fri, 7 Apr 2023, James Houghton wrote:
> 
>> On Fri, Apr 7, 2023 at 5:05 PM Dragan Stancevic <dragan@stancevic.com> wrote:
>>>
>>> Hi folks-
>>>
>>> if it's not too late for the schedule...
>>>
>>> I am starting to tackle VM live migration and hypervisor clustering over
>>> switched CXL memory[1][2], intended for cloud virtualization types of loads.
>>>
>>> I'd be interested in doing a small BoF session with some slides and get
>>> into a discussion/brainstorming with other people that deal with VM/LM
>>> cloud loads. Among other things to discuss would be page migrations over
>>> switched CXL memory, shared in-memory ABI to allow VM hand-off between
>>> hypervisors, etc...
>>>
>>> A few of us discussed some of this under the ZONE_XMEM thread, but I
>>> figured it might be better to start a separate thread.
>>>
>>> If there is interested, thank you.
>>
>> Hi Dragan,
>>
>> Thanks for bringing up this topic. I'd be very interested to be part
>> of this BoF, as I'm also interested in using CXL.mem as a live
>> migration mechanism.
>>
> 
> Thanks for cc'ing me, this would be very interesting to talk about.  Count
> me in!
> 
>>> [1]. High-level overview available at http://nil-migration.org/
>>> [2]. Based on CXL spec 3.0
>>>
> 
> Dragan: I'm curious about the reference to CXL spec 3.0 here, is there
> something specific about 3.0 that this work depends on or should we be
> good-to-go with 2.0 as well?  (Are you referring to 3.0 for security
> extensions?)

I'm referencing 3.0 with regards to switched/shared memory as defined in 
Compute Express Link Specification r3.0, v1.0 8/1/22, Page 51, figure 
1-4, black color scheme circle(3) and bars.


It may be possible to do it with 2.0 but as far as I understand[1] the 
2.0 spec it might be a lot more involved/clunky. I think new 3.0 
features make it easier



[1]. I would love to read 2.0 spec, but I don't have access to 2.0 
spec(only 3.0). But that is my understanding from speaking with some CXL 
folks at last years plumbers when I floated this idea with them.

--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-07 23:17   ` David Rientjes
  2023-04-08  1:33     ` Dragan Stancevic
@ 2023-04-08 16:24     ` Dragan Stancevic
  1 sibling, 0 replies; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-08 16:24 UTC (permalink / raw)
  To: David Rientjes, James Houghton; +Cc: lsf-pc, nil-migration, linux-cxl, linux-mm

Hi David-


On 4/7/23 18:17, David Rientjes wrote:
> On Fri, 7 Apr 2023, James Houghton wrote:
> 
>> On Fri, Apr 7, 2023 at 5:05 PM Dragan Stancevic <dragan@stancevic.com> wrote:
>>>
>>> Hi folks-
>>>
>>> if it's not too late for the schedule...
>>>
>>> I am starting to tackle VM live migration and hypervisor clustering over
>>> switched CXL memory[1][2], intended for cloud virtualization types of loads.
>>>
>>> I'd be interested in doing a small BoF session with some slides and get
>>> into a discussion/brainstorming with other people that deal with VM/LM
>>> cloud loads. Among other things to discuss would be page migrations over
>>> switched CXL memory, shared in-memory ABI to allow VM hand-off between
>>> hypervisors, etc...
>>>
>>> A few of us discussed some of this under the ZONE_XMEM thread, but I
>>> figured it might be better to start a separate thread.
>>>
>>> If there is interested, thank you.
>>
>> Hi Dragan,
>>
>> Thanks for bringing up this topic. I'd be very interested to be part
>> of this BoF, as I'm also interested in using CXL.mem as a live
>> migration mechanism.
>>
> 
> Thanks for cc'ing me, this would be very interesting to talk about.  Count
> me in!
> 
>>> [1]. High-level overview available at http://nil-migration.org/
>>> [2]. Based on CXL spec 3.0
>>>
> 
> Dragan: I'm curious about the reference to CXL spec 3.0 here, is there
> something specific about 3.0 that this work depends on or should we be
> good-to-go with 2.0 as well?  (Are you referring to 3.0 for security
> extensions?)


Sorry hit send too soon, and then had hosting provider issues... the 
hypervisor clustering part[1] might not work on CXL 2.0


[1]. http://nil-migration.org/ds-nil-migration-p12.png



--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-07 21:05 [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Dragan Stancevic
  2023-04-07 22:23 ` James Houghton
  2023-04-08  0:05 ` Gregory Price
@ 2023-04-09 17:40 ` Shreyas Shah
  2023-04-11  1:08   ` Dragan Stancevic
       [not found] ` <CGME20230410030532epcas2p49eae675396bf81658c1a3401796da1d4@epcas2p4.samsung.com>
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Shreyas Shah @ 2023-04-09 17:40 UTC (permalink / raw)
  To: Dragan Stancevic, lsf-pc; +Cc: nil-migration, linux-cxl, linux-mm

Hi Dragon,

The concept is great to time share the CXL attached memory across two NUMA nodes for live migration and create a cluster of VMs to increase the compute capacity. 

When and where is the BoF? 


Regards,
Shreyas
 

-----Original Message-----
From: Dragan Stancevic <dragan@stancevic.com> 
Sent: Friday, April 7, 2023 2:06 PM
To: lsf-pc@lists.linux-foundation.org
Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org; linux-mm@kvack.org
Subject: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​

Hi folks-

if it's not too late for the schedule...

I am starting to tackle VM live migration and hypervisor clustering over switched CXL memory[1][2], intended for cloud virtualization types of loads.

I'd be interested in doing a small BoF session with some slides and get into a discussion/brainstorming with other people that deal with VM/LM cloud loads. Among other things to discuss would be page migrations over switched CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors, etc...

A few of us discussed some of this under the ZONE_XMEM thread, but I figured it might be better to start a separate thread.

If there is interested, thank you.


[1]. High-level overview available at http://nil-migration.org/ [2]. Based on CXL spec 3.0

--
Peace can only come as a natural consequence of universal enlightenment -Dr. Nikola Tesla

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
       [not found] ` <CGME20230410030532epcas2p49eae675396bf81658c1a3401796da1d4@epcas2p4.samsung.com>
@ 2023-04-10  3:05   ` Kyungsan Kim
  2023-04-10 17:46     ` [External] " Viacheslav A.Dubeyko
  2023-04-14  3:27     ` Dragan Stancevic
  0 siblings, 2 replies; 40+ messages in thread
From: Kyungsan Kim @ 2023-04-10  3:05 UTC (permalink / raw)
  To: dragan
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>Hi folks-
>
>if it's not too late for the schedule...
>
>I am starting to tackle VM live migration and hypervisor clustering over
>switched CXL memory[1][2], intended for cloud virtualization types of loads.
>
>I'd be interested in doing a small BoF session with some slides and get
>into a discussion/brainstorming with other people that deal with VM/LM
>cloud loads. Among other things to discuss would be page migrations over
>switched CXL memory, shared in-memory ABI to allow VM hand-off between
>hypervisors, etc...
>
>A few of us discussed some of this under the ZONE_XMEM thread, but I
>figured it might be better to start a separate thread.
>
>If there is interested, thank you.

I would like join the discussion as well.
Let me kindly suggest it would be more great if it includes the data flow of VM/hypervisor as background and kernel interaction expected.

>
>
>[1]. High-level overview available at http://nil-migration.org/
>[2]. Based on CXL spec 3.0
>
>--
>Peace can only come as a natural consequence
>of universal enlightenment -Dr. Nikola Tesla

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [External] [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
  2023-04-10  3:05   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Kyungsan Kim
@ 2023-04-10 17:46     ` Viacheslav A.Dubeyko
  2023-04-14  3:27     ` Dragan Stancevic
  1 sibling, 0 replies; 40+ messages in thread
From: Viacheslav A.Dubeyko @ 2023-04-10 17:46 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: dragan, lsf-pc, linux-mm, Linux FS Devel, linux-cxl,
	Adam Manzanares, Dan Williams, seungjun.ha, wj28.lee



> On Apr 9, 2023, at 8:05 PM, Kyungsan Kim <ks0204.kim@samsung.com> wrote:
> 
>> Hi folks-
>> 
>> if it's not too late for the schedule...
>> 
>> I am starting to tackle VM live migration and hypervisor clustering over
>> switched CXL memory[1][2], intended for cloud virtualization types of loads.
>> 
>> I'd be interested in doing a small BoF session with some slides and get
>> into a discussion/brainstorming with other people that deal with VM/LM
>> cloud loads. Among other things to discuss would be page migrations over
>> switched CXL memory, shared in-memory ABI to allow VM hand-off between
>> hypervisors, etc...
>> 
>> A few of us discussed some of this under the ZONE_XMEM thread, but I
>> figured it might be better to start a separate thread.
>> 
>> If there is interested, thank you.
> 
> I would like join the discussion as well.
> Let me kindly suggest it would be more great if it includes the data flow of VM/hypervisor as background and kernel interaction expected.
> 

Sounds like interesting topic to me. I would like to attend the discussion.

Thanks,
Slava.

>> 
>> 
>> [1]. High-level overview available at http://nil-migration.org/
>> [2]. Based on CXL spec 3.0
>> 
>> --
>> Peace can only come as a natural consequence
>> of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-08  0:05 ` Gregory Price
@ 2023-04-11  0:56   ` Dragan Stancevic
  2023-04-11  1:48     ` Gregory Price
  2023-04-11  6:37   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Huang, Ying
  1 sibling, 1 reply; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-11  0:56 UTC (permalink / raw)
  To: Gregory Price; +Cc: lsf-pc, nil-migration, linux-cxl, linux-mm

Hi Gregory-

On 4/7/23 19:05, Gregory Price wrote:
> On Fri, Apr 07, 2023 at 04:05:31PM -0500, Dragan Stancevic wrote:
>> Hi folks-
>>
>> if it's not too late for the schedule...
>>
>> I am starting to tackle VM live migration and hypervisor clustering over
>> switched CXL memory[1][2], intended for cloud virtualization types of loads.
>>
>> I'd be interested in doing a small BoF session with some slides and get into
>> a discussion/brainstorming with other people that deal with VM/LM cloud
>> loads. Among other things to discuss would be page migrations over switched
>> CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors,
>> etc...
>>
>> A few of us discussed some of this under the ZONE_XMEM thread, but I figured
>> it might be better to start a separate thread.
>>
>> If there is interested, thank you.
>>
>>
>> [1]. High-level overview available at http://nil-migration.org/
>> [2]. Based on CXL spec 3.0
>>
>> --
>> Peace can only come as a natural consequence
>> of universal enlightenment -Dr. Nikola Tesla
> 
> I've been chatting about this with folks offline, figure i'll toss my
> thoughts on the issue here.

excellent brain dump, thank you


> Some things to consider:
> 
> 1. If secure-compute is being used, then this mechanism won't work as
>     pages will be pinned, and therefore not movable and excluded from
>     using cxl memory at all.
> 
>     This issue does not exist with traditional live migration, because
>     typically some kind of copy is used from one virtual space to another
>     (i.e. RMDA), so pages aren't really migrated in the kernel memory
>     block/numa node sense.

right, agreed... I don't think we can migrate in all scenarios, such as 
pinning or forms of pass-through, etc

my opinion just to start off, as a base requirement, would be that the 
pages be movable.



> 2. During the migration process, the memory needs to be forced not to be
>     migrated to another node by other means (tiering software, swap,
>     etc).  The obvious way of doing this would be to migrate and
>     temporarily pin the page... but going back to problem #1 we see that
>     ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>     troublesome.

Yeah, true. I'd have to check the code, but I wonder if perhaps we could 
mapcount or refount the pages upon migration onto CLX switched memory. 
If my memory serves me right, wouldn't the move_pages back off or stall? 
I guess it's TBD, how workable or useful that would be but it's good to 
be thinking of different ways of doing this


> 3. This is changing the semantics of migration from a virtual memory
>     movement to a physical memory movement.  Typically you would expect
>     the RDMA process for live migration to work something like...
> 
>     a) migration request arrives
>     b) source host informs destination host of size requirements
>     c) destination host allocations memory and passes a Virtual Address
>        back to source host
>     d) source host initates an RDMA from HostA-VA to HostB-VA
>     e) CPU task is migrated
> 
>     Importantly, the allocation of memory by Host B handles the important
>     step of creating HVA->HPA mappings, and the Extended/Nested Page
>     Tables can simply be flushed and re-created after the VM is fully
>     migrated.
> 
>     to long didn't read: live migration is a virtual address operation,
>     and node-migration is a PHYSICAL address operation, the virtual
>     addresses remain the same.
> 
>     This is problematic, as it's changing the underlying semantics of the
>     migration operation.

Those are all valid points, but what if you don't need to recreate 
HVA->HPA mappings? If I am understanding the CXL 3.0 spec correctly, 
then both virtual addresses and physical addresses wouldn't have to 
change. Because the fabric "virtualizes" host physical addresses and the 
translation is done by the G-FAM/GFD that has the capability to 
translate multi-host HPAs to it's internal DPAs. So if you have two 
hypervisors seeing device physical address as the same physical address, 
that might work?


> Problem #1 and #2 are head-scratchers, but maybe solvable.
> 
> Problem #3 is the meat and potatos of the issue in my opinion. So lets
> consider that a little more closely.
> 
> Generically: NIL Migration is basically a pass by reference operation.

Yup, agreed


> The reference in this case is... the page tables.  You need to know how
> to interpret the data in the CXL memory region on the remote host, and
> that's a "relative page table translation" (to coin a phrase? I'm not
> sure how to best describe it).

right, coining phrases... I have been thinking of a "super-page" (for 
the lack of a better word) a metadata region sitting on the switched 
CXL.mem device that would allow hypervisors to synchronize on various 
aspects, such as "relative page table translation", host is up, host is 
down, list of peers, who owns what etc... In a perfect scenario, I would 
love to see the hypervisors cooperating on switched CXL.mem device the 
same way cpus on different numa nodes cooperate on memory in a single 
hypervisor. If either host can allocate and schedule from this space 
then "NIL" aspect of migration is "free".


> That's... complicated to say the least.
> 1) Pages on the physical hardware do not need to be contiguous
> 2) The CFMW on source and target host do not need to be mapped at the
>     same place
> 3) There's not pre-allocation in these charts, and migration isn't
>     targeted, so having the source-host "expertly place" the data isn't
>     possible (right now, i suppose you could make kernel extensions).
> 4) Similar to problem #2 above, even with a pre-allocate added in, you
>     would need to ensure those mappings were pinned during migration,
>     lest the target host end up swapping a page or something.
> 
> 
> 
> An Option:  Make pages physically contiguous on migration to CXL
> 
> In this case, you don't necessarily care about the Host Virtual
> Addresses, what you actually care about are the structure of the pages
> in memory (are they physically contiguous? or do you need to
> reconstruct the contiguity by inspecting the page tables?).
> 
> If a migration API were capable of reserving large swaths of contiguous
> CXL memory, you could discard individual page information and instead
> send page range information, reconstructing the virtual-physical
> mappings this way.

yeah, good points, but this is all tricky though... it seems this would 
require quiescing the VM and that is something I would like to avoid if 
possible. I'd like to see the VM still executing while all of it's pages 
are migrated onto CXL NUMA on the source hypervisor. And I would like to 
see the VM executing on the destination hypervisor while migrate_pages 
is moving pages off of CXL. Of course, what you are describing above 
would still be a very fast VM migration, but would require quiescing.



> That's about as far as I've thought about it so far.  Feel free to rip
> it apart! :]

Those are all great thoughts and I appreciate you sharing them. I don't 
have all the answers either :)


> ~Gregory
> 


--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-09 17:40 ` Shreyas Shah
@ 2023-04-11  1:08   ` Dragan Stancevic
  2023-04-11  1:17     ` Shreyas Shah
  0 siblings, 1 reply; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-11  1:08 UTC (permalink / raw)
  To: Shreyas Shah, lsf-pc; +Cc: nil-migration, linux-cxl, linux-mm

Hi Shreyas-

On 4/9/23 12:40, Shreyas Shah wrote:
> Hi Dragon,
> 
> The concept is great to time share the CXL attached memory across two NUMA nodes for live migration and create a cluster of VMs to increase the compute capacity.
> 
> When and where is the BoF?

It's a proposal sent under (CFP), it has not been accepted yet. The 
agenda is selected by the program committee based on interest. The 
current proposal is for LSF/MM/BPF summit[1] running May 8 - May 10th, 
but it's not happening if not approved by committee.


[1]. https://events.linuxfoundation.org/lsfmm/


> 
> 
> Regards,
> Shreyas
>   
> 
> -----Original Message-----
> From: Dragan Stancevic <dragan@stancevic.com>
> Sent: Friday, April 7, 2023 2:06 PM
> To: lsf-pc@lists.linux-foundation.org
> Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org; linux-mm@kvack.org
> Subject: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
> 
> Hi folks-
> 
> if it's not too late for the schedule...
> 
> I am starting to tackle VM live migration and hypervisor clustering over switched CXL memory[1][2], intended for cloud virtualization types of loads.
> 
> I'd be interested in doing a small BoF session with some slides and get into a discussion/brainstorming with other people that deal with VM/LM cloud loads. Among other things to discuss would be page migrations over switched CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors, etc...
> 
> A few of us discussed some of this under the ZONE_XMEM thread, but I figured it might be better to start a separate thread.
> 
> If there is interested, thank you.
> 
> 
> [1]. High-level overview available at http://nil-migration.org/ [2]. Based on CXL spec 3.0
> 
> --
> Peace can only come as a natural consequence of universal enlightenment -Dr. Nikola Tesla

-- 
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-11  1:08   ` Dragan Stancevic
@ 2023-04-11  1:17     ` Shreyas Shah
  2023-04-11  1:32       ` Dragan Stancevic
  0 siblings, 1 reply; 40+ messages in thread
From: Shreyas Shah @ 2023-04-11  1:17 UTC (permalink / raw)
  To: Dragan Stancevic, lsf-pc; +Cc: nil-migration, linux-cxl, linux-mm

Thank you, Dragon. 

Btw, we can demonstrate the VM live migration with our FPGA based CXL memory and our demonstration today. We are not application expert, looked at the last diagram in your link and we are confident we can achieve it. 

Will there be any interest from the group? I can present for 15 mins and Q&A.

Regards,
Shreyas



-----Original Message-----
From: Dragan Stancevic <dragan@stancevic.com> 
Sent: Monday, April 10, 2023 6:09 PM
To: Shreyas Shah <shreyas.shah@elastics.cloud>; lsf-pc@lists.linux-foundation.org
Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org; linux-mm@kvack.org
Subject: Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​

Hi Shreyas-

On 4/9/23 12:40, Shreyas Shah wrote:
> Hi Dragon,
> 
> The concept is great to time share the CXL attached memory across two NUMA nodes for live migration and create a cluster of VMs to increase the compute capacity.
> 
> When and where is the BoF?

It's a proposal sent under (CFP), it has not been accepted yet. The agenda is selected by the program committee based on interest. The current proposal is for LSF/MM/BPF summit[1] running May 8 - May 10th, but it's not happening if not approved by committee.


[1]. https://events.linuxfoundation.org/lsfmm/


> 
> 
> Regards,
> Shreyas
>   
> 
> -----Original Message-----
> From: Dragan Stancevic <dragan@stancevic.com>
> Sent: Friday, April 7, 2023 2:06 PM
> To: lsf-pc@lists.linux-foundation.org
> Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org; 
> linux-mm@kvack.org
> Subject: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
> 
> Hi folks-
> 
> if it's not too late for the schedule...
> 
> I am starting to tackle VM live migration and hypervisor clustering over switched CXL memory[1][2], intended for cloud virtualization types of loads.
> 
> I'd be interested in doing a small BoF session with some slides and get into a discussion/brainstorming with other people that deal with VM/LM cloud loads. Among other things to discuss would be page migrations over switched CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors, etc...
> 
> A few of us discussed some of this under the ZONE_XMEM thread, but I figured it might be better to start a separate thread.
> 
> If there is interested, thank you.
> 
> 
> [1]. High-level overview available at http://nil-migration.org/ [2]. 
> Based on CXL spec 3.0
> 
> --
> Peace can only come as a natural consequence of universal 
> enlightenment -Dr. Nikola Tesla

--
--
Peace can only come as a natural consequence of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-11  1:17     ` Shreyas Shah
@ 2023-04-11  1:32       ` Dragan Stancevic
  2023-04-11  4:33         ` Shreyas Shah
  0 siblings, 1 reply; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-11  1:32 UTC (permalink / raw)
  To: Shreyas Shah, lsf-pc; +Cc: nil-migration, linux-cxl, linux-mm

Hi Shreyas-

speaking strictly for myself, sounds interesting


On 4/10/23 20:17, Shreyas Shah wrote:
> Thank you, Dragon.
> 
> Btw, we can demonstrate the VM live migration with our FPGA based CXL memory and our demonstration today. We are not application expert, looked at the last diagram in your link and we are confident we can achieve it.
> 
> Will there be any interest from the group? I can present for 15 mins and Q&A.
> 
> Regards,
> Shreyas
> 
> 
> 
> -----Original Message-----
> From: Dragan Stancevic <dragan@stancevic.com>
> Sent: Monday, April 10, 2023 6:09 PM
> To: Shreyas Shah <shreyas.shah@elastics.cloud>; lsf-pc@lists.linux-foundation.org
> Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org; linux-mm@kvack.org
> Subject: Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
> 
> Hi Shreyas-
> 
> On 4/9/23 12:40, Shreyas Shah wrote:
>> Hi Dragon,
>>
>> The concept is great to time share the CXL attached memory across two NUMA nodes for live migration and create a cluster of VMs to increase the compute capacity.
>>
>> When and where is the BoF?
> 
> It's a proposal sent under (CFP), it has not been accepted yet. The agenda is selected by the program committee based on interest. The current proposal is for LSF/MM/BPF summit[1] running May 8 - May 10th, but it's not happening if not approved by committee.
> 
> 
> [1]. https://events.linuxfoundation.org/lsfmm/
> 
> 
>>
>>
>> Regards,
>> Shreyas
>>    
>>
>> -----Original Message-----
>> From: Dragan Stancevic <dragan@stancevic.com>
>> Sent: Friday, April 7, 2023 2:06 PM
>> To: lsf-pc@lists.linux-foundation.org
>> Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org;
>> linux-mm@kvack.org
>> Subject: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
>>
>> Hi folks-
>>
>> if it's not too late for the schedule...
>>
>> I am starting to tackle VM live migration and hypervisor clustering over switched CXL memory[1][2], intended for cloud virtualization types of loads.
>>
>> I'd be interested in doing a small BoF session with some slides and get into a discussion/brainstorming with other people that deal with VM/LM cloud loads. Among other things to discuss would be page migrations over switched CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors, etc...
>>
>> A few of us discussed some of this under the ZONE_XMEM thread, but I figured it might be better to start a separate thread.
>>
>> If there is interested, thank you.
>>
>>
>> [1]. High-level overview available at http://nil-migration.org/ [2].
>> Based on CXL spec 3.0
>>
>> --
>> Peace can only come as a natural consequence of universal
>> enlightenment -Dr. Nikola Tesla
> 
> --
> --
> Peace can only come as a natural consequence of universal enlightenment -Dr. Nikola Tesla
> 

-- 
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-11  0:56   ` Dragan Stancevic
@ 2023-04-11  1:48     ` Gregory Price
  2023-04-14  3:32       ` Dragan Stancevic
  0 siblings, 1 reply; 40+ messages in thread
From: Gregory Price @ 2023-04-11  1:48 UTC (permalink / raw)
  To: Dragan Stancevic; +Cc: lsf-pc, nil-migration, linux-cxl, linux-mm

On Mon, Apr 10, 2023 at 07:56:01PM -0500, Dragan Stancevic wrote:
> Hi Gregory-
> 
> On 4/7/23 19:05, Gregory Price wrote:
> > 3. This is changing the semantics of migration from a virtual memory
> >     movement to a physical memory movement.  Typically you would expect
> >     the RDMA process for live migration to work something like...
> > 
> >     a) migration request arrives
> >     b) source host informs destination host of size requirements
> >     c) destination host allocations memory and passes a Virtual Address
> >        back to source host
> >     d) source host initates an RDMA from HostA-VA to HostB-VA
> >     e) CPU task is migrated
> > 
> >     Importantly, the allocation of memory by Host B handles the important
> >     step of creating HVA->HPA mappings, and the Extended/Nested Page
> >     Tables can simply be flushed and re-created after the VM is fully
> >     migrated.
> > 
> >     to long didn't read: live migration is a virtual address operation,
> >     and node-migration is a PHYSICAL address operation, the virtual
> >     addresses remain the same.
> > 
> >     This is problematic, as it's changing the underlying semantics of the
> >     migration operation.
> 
> Those are all valid points, but what if you don't need to recreate HVA->HPA
> mappings? If I am understanding the CXL 3.0 spec correctly, then both
> virtual addresses and physical addresses wouldn't have to change. Because
> the fabric "virtualizes" host physical addresses and the translation is done
> by the G-FAM/GFD that has the capability to translate multi-host HPAs to
> it's internal DPAs. So if you have two hypervisors seeing device physical
> address as the same physical address, that might work?
> 
> 

Hm.  I hadn't considered the device side translation (decoders), though
that's obviously a tool in the toolbox.  You still have to know how to
slide ranges of data (which you mention below).

> 
> > The reference in this case is... the page tables.  You need to know how
> > to interpret the data in the CXL memory region on the remote host, and
> > that's a "relative page table translation" (to coin a phrase? I'm not
> > sure how to best describe it).
> 
> right, coining phrases... I have been thinking of a "super-page" (for the
> lack of a better word) a metadata region sitting on the switched CXL.mem
> device that would allow hypervisors to synchronize on various aspects, such
> as "relative page table translation", host is up, host is down, list of
> peers, who owns what etc... In a perfect scenario, I would love to see the
> hypervisors cooperating on switched CXL.mem device the same way cpus on
> different numa nodes cooperate on memory in a single hypervisor. If either
> host can allocate and schedule from this space then "NIL" aspect of
> migration is "free".
> 
> 

The core of the problem is still that each of the hosts has to agree on
the location (physically) of this region of memory, which could be
problematic unless you have very strong BIOS and/or kernel driver
controls to ensure certain devices are guaranteed to be mapped into
certain spots in the CFMW.

After that it's a matter of treating this memory as incoherent shared
memory and handling ownership in a safe way.  If the memory is only used
for migrations, then you don't have to worry about performance.

So I agree, as long as shared memory mapped into the same CFMW area is
used, this mechanism is totally sound.

My main concerns are that I don't know of a mechanism to ensure that.  I
suppose for those interested, and with special BIOS/EFI, you could do
that - but I think that's going to be a tall ask in a heterogenous cloud
environment.

> > That's... complicated to say the least.
> > 
> > <... snip ...>
> > 
> > An Option:  Make pages physically contiguous on migration to CXL
> > 
> > In this case, you don't necessarily care about the Host Virtual
> > Addresses, what you actually care about are the structure of the pages
> > in memory (are they physically contiguous? or do you need to
> > reconstruct the contiguity by inspecting the page tables?).
> > 
> > If a migration API were capable of reserving large swaths of contiguous
> > CXL memory, you could discard individual page information and instead
> > send page range information, reconstructing the virtual-physical
> > mappings this way.
> 
> yeah, good points, but this is all tricky though... it seems this would
> require quiescing the VM and that is something I would like to avoid if
> possible. I'd like to see the VM still executing while all of it's pages are
> migrated onto CXL NUMA on the source hypervisor. And I would like to see the
> VM executing on the destination hypervisor while migrate_pages is moving
> pages off of CXL. Of course, what you are describing above would still be a
> very fast VM migration, but would require quiescing.
> 
>

Possibly.  If you're going to quiesce you're probably better off just
snapshotting to shared memory and migrating the snapshot.

Maybe that's the better option for a first-pass migration mechanism.  I
don't know.

Anyway, would love to attend this session.

~Gregory

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-11  1:32       ` Dragan Stancevic
@ 2023-04-11  4:33         ` Shreyas Shah
  2023-04-14  3:26           ` Dragan Stancevic
  0 siblings, 1 reply; 40+ messages in thread
From: Shreyas Shah @ 2023-04-11  4:33 UTC (permalink / raw)
  To: Dragan Stancevic, lsf-pc; +Cc: nil-migration, linux-cxl, linux-mm

Hi Dragon,

Do you all meet on Zoom or Teams weekly call? 

Maybe next time I can present if you have a slot.

Regards,
Shreyas
-----Original Message-----
From: Dragan Stancevic <dragan@stancevic.com> 
Sent: Monday, April 10, 2023 6:33 PM
To: Shreyas Shah <shreyas.shah@elastics.cloud>; lsf-pc@lists.linux-foundation.org
Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org; linux-mm@kvack.org
Subject: Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​

Hi Shreyas-

speaking strictly for myself, sounds interesting


On 4/10/23 20:17, Shreyas Shah wrote:
> Thank you, Dragon.
> 
> Btw, we can demonstrate the VM live migration with our FPGA based CXL memory and our demonstration today. We are not application expert, looked at the last diagram in your link and we are confident we can achieve it.
> 
> Will there be any interest from the group? I can present for 15 mins and Q&A.
> 
> Regards,
> Shreyas
> 
> 
> 
> -----Original Message-----
> From: Dragan Stancevic <dragan@stancevic.com>
> Sent: Monday, April 10, 2023 6:09 PM
> To: Shreyas Shah <shreyas.shah@elastics.cloud>; 
> lsf-pc@lists.linux-foundation.org
> Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org; 
> linux-mm@kvack.org
> Subject: Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
> 
> Hi Shreyas-
> 
> On 4/9/23 12:40, Shreyas Shah wrote:
>> Hi Dragon,
>>
>> The concept is great to time share the CXL attached memory across two NUMA nodes for live migration and create a cluster of VMs to increase the compute capacity.
>>
>> When and where is the BoF?
> 
> It's a proposal sent under (CFP), it has not been accepted yet. The agenda is selected by the program committee based on interest. The current proposal is for LSF/MM/BPF summit[1] running May 8 - May 10th, but it's not happening if not approved by committee.
> 
> 
> [1]. https://events.linuxfoundation.org/lsfmm/
> 
> 
>>
>>
>> Regards,
>> Shreyas
>>    
>>
>> -----Original Message-----
>> From: Dragan Stancevic <dragan@stancevic.com>
>> Sent: Friday, April 7, 2023 2:06 PM
>> To: lsf-pc@lists.linux-foundation.org
>> Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org; 
>> linux-mm@kvack.org
>> Subject: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
>>
>> Hi folks-
>>
>> if it's not too late for the schedule...
>>
>> I am starting to tackle VM live migration and hypervisor clustering over switched CXL memory[1][2], intended for cloud virtualization types of loads.
>>
>> I'd be interested in doing a small BoF session with some slides and get into a discussion/brainstorming with other people that deal with VM/LM cloud loads. Among other things to discuss would be page migrations over switched CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors, etc...
>>
>> A few of us discussed some of this under the ZONE_XMEM thread, but I figured it might be better to start a separate thread.
>>
>> If there is interested, thank you.
>>
>>
>> [1]. High-level overview available at http://nil-migration.org/ [2].
>> Based on CXL spec 3.0
>>
>> --
>> Peace can only come as a natural consequence of universal 
>> enlightenment -Dr. Nikola Tesla
> 
> --
> --
> Peace can only come as a natural consequence of universal 
> enlightenment -Dr. Nikola Tesla
> 

--
--
Peace can only come as a natural consequence of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-08  0:05 ` Gregory Price
  2023-04-11  0:56   ` Dragan Stancevic
@ 2023-04-11  6:37   ` Huang, Ying
  2023-04-11 15:36     ` Gregory Price
  2023-04-14  3:33     ` Dragan Stancevic
  1 sibling, 2 replies; 40+ messages in thread
From: Huang, Ying @ 2023-04-11  6:37 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dragan Stancevic, lsf-pc, nil-migration, linux-cxl, linux-mm

Gregory Price <gregory.price@memverge.com> writes:

[snip]

> 2. During the migration process, the memory needs to be forced not to be
>    migrated to another node by other means (tiering software, swap,
>    etc).  The obvious way of doing this would be to migrate and
>    temporarily pin the page... but going back to problem #1 we see that
>    ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>    troublesome.

Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.) to
avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
but I think it is fixable.

Best Regards,
Huang, Ying

[snip]

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-11  6:37   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Huang, Ying
@ 2023-04-11 15:36     ` Gregory Price
  2023-04-12  2:54       ` Huang, Ying
  2023-04-14  3:33     ` Dragan Stancevic
  1 sibling, 1 reply; 40+ messages in thread
From: Gregory Price @ 2023-04-11 15:36 UTC (permalink / raw)
  To: Huang, Ying; +Cc: Dragan Stancevic, lsf-pc, nil-migration, linux-cxl, linux-mm

On Tue, Apr 11, 2023 at 02:37:50PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> [snip]
> 
> > 2. During the migration process, the memory needs to be forced not to be
> >    migrated to another node by other means (tiering software, swap,
> >    etc).  The obvious way of doing this would be to migrate and
> >    temporarily pin the page... but going back to problem #1 we see that
> >    ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
> >    troublesome.
> 
> Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.) to
> avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
> but I think it is fixable.
> 
> Best Regards,
> Huang, Ying
> 
> [snip]

That feels like a hack/bodge rather than a proper solution to me.

Maybe this is an affirmative argument for the creation of an EXMEM
zone.  Specifically to allow page pinning, but with far more stringent
controls - i.e. the zone is excluded from use via general allocations.

The point of ZONE_MOVABLE is to allow general allocation of userland
data into hotpluggable memory regions.

This memory region is not for general use, and wants to allow pinning
and be hotpluggable under very controlled circumstances.  That seems
like a reasonable argument for the creation of EXMEM.

~Gregory

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-07 21:05 [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Dragan Stancevic
                   ` (3 preceding siblings ...)
       [not found] ` <CGME20230410030532epcas2p49eae675396bf81658c1a3401796da1d4@epcas2p4.samsung.com>
@ 2023-04-11 18:00 ` Dave Hansen
  2023-05-01 23:49   ` Dragan Stancevic
  2023-04-11 18:16 ` RAGHU H
  2023-05-09 15:08 ` Dragan Stancevic
  6 siblings, 1 reply; 40+ messages in thread
From: Dave Hansen @ 2023-04-11 18:00 UTC (permalink / raw)
  To: Dragan Stancevic, lsf-pc; +Cc: nil-migration, linux-cxl, linux-mm

On 4/7/23 14:05, Dragan Stancevic wrote:
> I'd be interested in doing a small BoF session with some slides and get
> into a discussion/brainstorming with other people that deal with VM/LM
> cloud loads. Among other things to discuss would be page migrations over
> switched CXL memory, shared in-memory ABI to allow VM hand-off between
> hypervisors, etc...

How would 'struct page' or other kernel metadata be handled?

I assume you'd want a really big CXL memory device with as many hosts
connected to it as is feasible.  But, in order to hand the memory off
from one host to another, both would need to have metadata for it at
_some_ point.

So, do all hosts have metadata for the whole CXL memory device all the
time?  Or, would they create the metadata (hotplug) when a VM is
migrated in and destroy it (hot unplug) when a VM is migrated out?

That gets back to the granularity question discussed elsewhere in the
thread.  How would the metadata allocation granularity interact with the
page allocation granularity?  How would fragmentation be avoided so that
hosts don't eat up all their RAM with unused metadata?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-07 21:05 [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Dragan Stancevic
                   ` (4 preceding siblings ...)
  2023-04-11 18:00 ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Dave Hansen
@ 2023-04-11 18:16 ` RAGHU H
  2023-05-09 15:08 ` Dragan Stancevic
  6 siblings, 0 replies; 40+ messages in thread
From: RAGHU H @ 2023-04-11 18:16 UTC (permalink / raw)
  To: Dragan Stancevic; +Cc: lsf-pc, nil-migration, linux-cxl, linux-mm

Hi Dragan

Please add me to the discussion!

Regards
Raghu

On Sat, Apr 8, 2023 at 2:56 AM Dragan Stancevic <dragan@stancevic.com> wrote:
>
> Hi folks-
>
> if it's not too late for the schedule...
>
> I am starting to tackle VM live migration and hypervisor clustering over
> switched CXL memory[1][2], intended for cloud virtualization types of loads.
>
> I'd be interested in doing a small BoF session with some slides and get
> into a discussion/brainstorming with other people that deal with VM/LM
> cloud loads. Among other things to discuss would be page migrations over
> switched CXL memory, shared in-memory ABI to allow VM hand-off between
> hypervisors, etc...
>
> A few of us discussed some of this under the ZONE_XMEM thread, but I
> figured it might be better to start a separate thread.
>
> If there is interested, thank you.
>
>
> [1]. High-level overview available at http://nil-migration.org/
> [2]. Based on CXL spec 3.0
>
> --
> Peace can only come as a natural consequence
> of universal enlightenment -Dr. Nikola Tesla

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-11 15:36     ` Gregory Price
@ 2023-04-12  2:54       ` Huang, Ying
  2023-04-12  8:38         ` David Hildenbrand
  0 siblings, 1 reply; 40+ messages in thread
From: Huang, Ying @ 2023-04-12  2:54 UTC (permalink / raw)
  To: Gregory Price
  Cc: Dragan Stancevic, lsf-pc, nil-migration, linux-cxl, linux-mm

Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Apr 11, 2023 at 02:37:50PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> [snip]
>> 
>> > 2. During the migration process, the memory needs to be forced not to be
>> >    migrated to another node by other means (tiering software, swap,
>> >    etc).  The obvious way of doing this would be to migrate and
>> >    temporarily pin the page... but going back to problem #1 we see that
>> >    ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>> >    troublesome.
>> 
>> Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.) to
>> avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
>> but I think it is fixable.
>> 
>> Best Regards,
>> Huang, Ying
>> 
>> [snip]
>
> That feels like a hack/bodge rather than a proper solution to me.
>
> Maybe this is an affirmative argument for the creation of an EXMEM
> zone.

Let's start with requirements.  What is the requirements for a new zone
type?

> Specifically to allow page pinning, but with far more stringent
> controls -

> i.e. the zone is excluded from use via general allocations.

This can also be controlled via memory policy.  The alternative solution
is to add a per node attribute.

> The point of ZONE_MOVABLE is to allow general allocation of userland
> data into hotpluggable memory regions.

IIUC, one typical requirement of CXL.mem is hotpluggable, right?

Best Regards,
Huang, Ying

> This memory region is not for general use, and wants to allow pinning
> and be hotpluggable under very controlled circumstances.  That seems
> like a reasonable argument for the creation of EXMEM.
>
> ~Gregory

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-12  2:54       ` Huang, Ying
@ 2023-04-12  8:38         ` David Hildenbrand
       [not found]           ` <CGME20230412111034epcas2p1b46d2a26b7d3ac5db3b0e454255527b0@epcas2p1.samsung.com>
                             ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: David Hildenbrand @ 2023-04-12  8:38 UTC (permalink / raw)
  To: Huang, Ying, Gregory Price
  Cc: Dragan Stancevic, lsf-pc, nil-migration, linux-cxl, linux-mm

On 12.04.23 04:54, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
>> On Tue, Apr 11, 2023 at 02:37:50PM +0800, Huang, Ying wrote:
>>> Gregory Price <gregory.price@memverge.com> writes:
>>>
>>> [snip]
>>>
>>>> 2. During the migration process, the memory needs to be forced not to be
>>>>     migrated to another node by other means (tiering software, swap,
>>>>     etc).  The obvious way of doing this would be to migrate and
>>>>     temporarily pin the page... but going back to problem #1 we see that
>>>>     ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>>>>     troublesome.
>>>
>>> Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.) to
>>> avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
>>> but I think it is fixable.
>>>
>>> Best Regards,
>>> Huang, Ying
>>>
>>> [snip]
>>
>> That feels like a hack/bodge rather than a proper solution to me.
>>
>> Maybe this is an affirmative argument for the creation of an EXMEM
>> zone.
> 
> Let's start with requirements.  What is the requirements for a new zone
> type?

I'm stills scratching my head regarding this. I keep hearing all 
different kind of statements that just add more confusions "we want it 
to be hotunpluggable" "we want to allow for long-term pinning memory" 
"but we still want it to be movable" "we want to place some unmovable 
allocations on it". Huh?

Just to clarify: ZONE_MOVABLE allows for pinning. It just doesn't allow 
for long-term pinning of memory.

For good reason, because long-term pinning of memory is just the worst 
(memory waste, fragmentation, overcommit) and instead of finding new 
ways to *avoid* long-term pinnings, we're coming up with advanced 
concepts to work-around the fundamental property of long-term pinnings.

We want all memory to be long-term pinnable and we want all memory to be 
movable/hotunpluggable. That's not going to work.

If you'd ask me today, my prediction is that ZONE_EXMEM is not going to 
happen.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: FW: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
       [not found]           ` <CGME20230412111034epcas2p1b46d2a26b7d3ac5db3b0e454255527b0@epcas2p1.samsung.com>
@ 2023-04-12 11:10             ` Kyungsan Kim
  2023-04-12 11:26               ` David Hildenbrand
  2023-04-12 15:40               ` Matthew Wilcox
  0 siblings, 2 replies; 40+ messages in thread
From: Kyungsan Kim @ 2023-04-12 11:10 UTC (permalink / raw)
  To: david
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

>> Gregory Price <gregory.price@memverge.com> writes:
>>
>>> On Tue, Apr 11, 2023 at 02:37:50PM +0800, Huang, Ying wrote:
>>>> Gregory Price <gregory.price@memverge.com> writes:
>>>>
>>>> [snip]
>>>>
>>>>> 2. During the migration process, the memory needs to be forced not to be
>>>>>     migrated to another node by other means (tiering software, swap,
>>>>>     etc).  The obvious way of doing this would be to migrate and
>>>>>     temporarily pin the page... but going back to problem #1 we see that
>>>>>     ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>>>>>     troublesome.
>>>>
>>>> Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.) to
>>>> avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
>>>> but I think it is fixable.
>>>>
>>>> Best Regards,
>>>> Huang, Ying
>>>>
>>>> [snip]
>>>
>>> That feels like a hack/bodge rather than a proper solution to me.
>>>
>>> Maybe this is an affirmative argument for the creation of an EXMEM
>>> zone.
>>
>> Let's start with requirements.  What is the requirements for a new zone
>> type?
>
>I'm stills scratching my head regarding this. I keep hearing all
>different kind of statements that just add more confusions "we want it
>to be hotunpluggable" "we want to allow for long-term pinning memory"
>"but we still want it to be movable" "we want to place some unmovable
>allocations on it". Huh?
>
>Just to clarify: ZONE_MOVABLE allows for pinning. It just doesn't allow
>for long-term pinning of memory.
>
>For good reason, because long-term pinning of memory is just the worst
>(memory waste, fragmentation, overcommit) and instead of finding new
>ways to *avoid* long-term pinnings, we're coming up with advanced
>concepts to work-around the fundamental property of long-term pinnings.
>
>We want all memory to be long-term pinnable and we want all memory to be
>movable/hotunpluggable. That's not going to work.

Looks there is misunderstanding about ZONE_EXMEM argument.
Pinning and plubbability is mutual exclusive so it can not happen at the same time.
What we argue is ZONE_EXMEM does not "confine movability". an allocation context can determine the movability attribute.
Even one unmovable allocation will make the entire CXL DRAM unpluggable. 
When you see ZONE_EXMEM just on movable/unmoable aspect, we think it is the same with ZONE_NORMAL,
but ZONE_EXMEM works on an extended memory, as of now CXL DRAM.

Then why ZONE_EXMEM is, ZONE_EXMEM considers not only the pluggability aspect, but CXL identifier for user/kenelspace API, 
the abstraction of multiple CXL DRAM channels, and zone unit algorithm for CXL HW characteristics.
The last one is potential at the moment, though.

As mentioned in ZONE_EXMEM thread, we are preparing slides to explain experiences and proposals.
It it not final version now[1].
[1] https://github.com/OpenMPDK/SMDK/wiki/93.-%5BLSF-MM-BPF-TOPIC%5D-SMDK-inspired-MM-changes-for-CXL

>If you'd ask me today, my prediction is that ZONE_EXMEM is not going to
>happen.
>
>--
>Thanks,
>
>David / dhildenb

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
  2023-04-12 11:10             ` FW: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Kyungsan Kim
@ 2023-04-12 11:26               ` David Hildenbrand
       [not found]                 ` <CGME20230414084110epcas2p20b90a8d1892110d7ca3ac16290cd4686@epcas2p2.samsung.com>
  2023-04-12 15:40               ` Matthew Wilcox
  1 sibling, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2023-04-12 11:26 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

On 12.04.23 13:10, Kyungsan Kim wrote:
>>> Gregory Price <gregory.price@memverge.com> writes:
>>>
>>>> On Tue, Apr 11, 2023 at 02:37:50PM +0800, Huang, Ying wrote:
>>>>> Gregory Price <gregory.price@memverge.com> writes:
>>>>>
>>>>> [snip]
>>>>>
>>>>>> 2. During the migration process, the memory needs to be forced not to be
>>>>>>      migrated to another node by other means (tiering software, swap,
>>>>>>      etc).  The obvious way of doing this would be to migrate and
>>>>>>      temporarily pin the page... but going back to problem #1 we see that
>>>>>>      ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>>>>>>      troublesome.
>>>>>
>>>>> Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.) to
>>>>> avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
>>>>> but I think it is fixable.
>>>>>
>>>>> Best Regards,
>>>>> Huang, Ying
>>>>>
>>>>> [snip]
>>>>
>>>> That feels like a hack/bodge rather than a proper solution to me.
>>>>
>>>> Maybe this is an affirmative argument for the creation of an EXMEM
>>>> zone.
>>>
>>> Let's start with requirements.  What is the requirements for a new zone
>>> type?
>>
>> I'm stills scratching my head regarding this. I keep hearing all
>> different kind of statements that just add more confusions "we want it
>> to be hotunpluggable" "we want to allow for long-term pinning memory"
>> "but we still want it to be movable" "we want to place some unmovable
>> allocations on it". Huh?
>>
>> Just to clarify: ZONE_MOVABLE allows for pinning. It just doesn't allow
>> for long-term pinning of memory.
>>
>> For good reason, because long-term pinning of memory is just the worst
>> (memory waste, fragmentation, overcommit) and instead of finding new
>> ways to *avoid* long-term pinnings, we're coming up with advanced
>> concepts to work-around the fundamental property of long-term pinnings.
>>
>> We want all memory to be long-term pinnable and we want all memory to be
>> movable/hotunpluggable. That's not going to work.
> 
> Looks there is misunderstanding about ZONE_EXMEM argument.
> Pinning and plubbability is mutual exclusive so it can not happen at the same time.
> What we argue is ZONE_EXMEM does not "confine movability". an allocation context can determine the movability attribute.
> Even one unmovable allocation will make the entire CXL DRAM unpluggable.
> When you see ZONE_EXMEM just on movable/unmoable aspect, we think it is the same with ZONE_NORMAL,
> but ZONE_EXMEM works on an extended memory, as of now CXL DRAM.
> 
> Then why ZONE_EXMEM is, ZONE_EXMEM considers not only the pluggability aspect, but CXL identifier for user/kenelspace API,
> the abstraction of multiple CXL DRAM channels, and zone unit algorithm for CXL HW characteristics.
> The last one is potential at the moment, though.
> 
> As mentioned in ZONE_EXMEM thread, we are preparing slides to explain experiences and proposals.
> It it not final version now[1].
> [1] https://github.com/OpenMPDK/SMDK/wiki/93.-%5BLSF-MM-BPF-TOPIC%5D-SMDK-inspired-MM-changes-for-CXL

Yes, hopefully we can discuss at LSF/MM also the problems we are trying 
to solve instead of focusing on one solution. [did not have the time to 
look at the slides yet, sorry]

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-12  8:38         ` David Hildenbrand
       [not found]           ` <CGME20230412111034epcas2p1b46d2a26b7d3ac5db3b0e454255527b0@epcas2p1.samsung.com>
@ 2023-04-12 15:15           ` James Bottomley
  2023-05-03 23:42             ` Dragan Stancevic
  2023-04-12 15:26           ` Gregory Price
  2 siblings, 1 reply; 40+ messages in thread
From: James Bottomley @ 2023-04-12 15:15 UTC (permalink / raw)
  To: David Hildenbrand, Huang, Ying, Gregory Price
  Cc: Dragan Stancevic, lsf-pc, nil-migration, linux-cxl, linux-mm

On Wed, 2023-04-12 at 10:38 +0200, David Hildenbrand wrote:
> On 12.04.23 04:54, Huang, Ying wrote:
> > Gregory Price <gregory.price@memverge.com> writes:
[...]
> > > That feels like a hack/bodge rather than a proper solution to me.
> > > 
> > > Maybe this is an affirmative argument for the creation of an
> > > EXMEM zone.
> > 
> > Let's start with requirements.  What is the requirements for a new
> > zone type?
> 
> I'm stills scratching my head regarding this. I keep hearing all 
> different kind of statements that just add more confusions "we want
> it to be hotunpluggable" "we want to allow for long-term pinning
> memory" "but we still want it to be movable" "we want to place some
> unmovable allocations on it". Huh?

This is the essential question about CXL memory itself: what would its
killer app be?  The CXL people (or at least the ones I've talked to)
don't exactly know.  Within IBM I've seen lots of ideas but no actual
concrete applications.  Given the rates at which memory density in
systems is increasing, I'm a bit dubious of the extensible system pool
argument.   Providing extensible memory to VMs sounds a bit more
plausible, particularly as it solves a big part of the local overcommit
problem (although you still have a global one).  I'm not really sure I
buy the VM migration use case: iterative transfer works fine with small
down times so transferring memory seems to be the least of problems
with the VM migration use case (it's mostly about problems with
attached devices).  CXL 3.0 is adding sharing primitives for memory so
now we have to ask if there are any multi-node shared memory use cases
for this, but most of us have already been burned by multi-node shared
clusters once in our career and are a bit leery of a second go around.

Is there a use case I left out (or needs expanding)?

James


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-12  8:38         ` David Hildenbrand
       [not found]           ` <CGME20230412111034epcas2p1b46d2a26b7d3ac5db3b0e454255527b0@epcas2p1.samsung.com>
  2023-04-12 15:15           ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ James Bottomley
@ 2023-04-12 15:26           ` Gregory Price
  2023-04-12 15:50             ` David Hildenbrand
  2 siblings, 1 reply; 40+ messages in thread
From: Gregory Price @ 2023-04-12 15:26 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Huang, Ying, Dragan Stancevic, lsf-pc, nil-migration, linux-cxl,
	linux-mm

On Wed, Apr 12, 2023 at 10:38:04AM +0200, David Hildenbrand wrote:
> On 12.04.23 04:54, Huang, Ying wrote:
> > Gregory Price <gregory.price@memverge.com> writes:
> > 
> > > On Tue, Apr 11, 2023 at 02:37:50PM +0800, Huang, Ying wrote:
> > > > Gregory Price <gregory.price@memverge.com> writes:
> > > > 
> > > > [snip]
> > > > 
> > > > > 2. During the migration process, the memory needs to be forced not to be
> > > > >     migrated to another node by other means (tiering software, swap,
> > > > >     etc).  The obvious way of doing this would be to migrate and
> > > > >     temporarily pin the page... but going back to problem #1 we see that
> > > > >     ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
> > > > >     troublesome.
> > > > 
> > > > Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.) to
> > > > avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
> > > > but I think it is fixable.
> > > > 
> > > > Best Regards,
> > > > Huang, Ying
> > > > 
> > > > [snip]
> > > 
> > > That feels like a hack/bodge rather than a proper solution to me.
> > > 
> > > Maybe this is an affirmative argument for the creation of an EXMEM
> > > zone.
> > 
> > Let's start with requirements.  What is the requirements for a new zone
> > type?
> 
> I'm stills scratching my head regarding this. I keep hearing all different
> kind of statements that just add more confusions "we want it to be
> hotunpluggable" "we want to allow for long-term pinning memory" "but we
> still want it to be movable" "we want to place some unmovable allocations on
> it". Huh?
> 
> Just to clarify: ZONE_MOVABLE allows for pinning. It just doesn't allow for
> long-term pinning of memory.
> 

I apologize for the confusion, this is my fault.  I had assumed that
since dax regions can't be pinned, subsequent nodes backed by a dax
device could not be pinned.  In testing this, this is not the case.

Re: long-term pinning, can you be more explicit as to what is considered
long-term?  Minutes? hours? days? etc

If a migration operation is considered short term, then pinning VM
memory during migration deals with this issue cleanly.

So walking back my statement - give my testing, i don't believe there's
a reason for a new zone.

> For good reason, because long-term pinning of memory is just the worst
> (memory waste, fragmentation, overcommit) and instead of finding new ways to
> *avoid* long-term pinnings, we're coming up with advanced concepts to
> work-around the fundamental property of long-term pinnings.
> 
> We want all memory to be long-term pinnable and we want all memory to be
> movable/hotunpluggable. That's not going to work.
> 
> If you'd ask me today, my prediction is that ZONE_EXMEM is not going to
> happen.
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: FW: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
  2023-04-12 11:10             ` FW: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Kyungsan Kim
  2023-04-12 11:26               ` David Hildenbrand
@ 2023-04-12 15:40               ` Matthew Wilcox
       [not found]                 ` <CGME20230414084114epcas2p4754d6c0d3c86a0d6d4e855058562100f@epcas2p4.samsung.com>
  1 sibling, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2023-04-12 15:40 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: david, lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

On Wed, Apr 12, 2023 at 08:10:33PM +0900, Kyungsan Kim wrote:
> Pinning and plubbability is mutual exclusive so it can not happen at the same time.
> What we argue is ZONE_EXMEM does not "confine movability". an allocation context can determine the movability attribute.
> Even one unmovable allocation will make the entire CXL DRAM unpluggable. 
> When you see ZONE_EXMEM just on movable/unmoable aspect, we think it is the same with ZONE_NORMAL,
> but ZONE_EXMEM works on an extended memory, as of now CXL DRAM.
> 
> Then why ZONE_EXMEM is, ZONE_EXMEM considers not only the pluggability aspect, but CXL identifier for user/kenelspace API, 
> the abstraction of multiple CXL DRAM channels, and zone unit algorithm for CXL HW characteristics.
> The last one is potential at the moment, though.
> 
> As mentioned in ZONE_EXMEM thread, we are preparing slides to explain experiences and proposals.
> It it not final version now[1].
> [1] https://github.com/OpenMPDK/SMDK/wiki/93.-%5BLSF-MM-BPF-TOPIC%5D-SMDK-inspired-MM-changes-for-CXL

The problem is that you're starting out with a solution.  Tell us what
your requirements are, at a really high level, then walk us through
why ZONE_EXMEM is the best way to satisfy those requirements.

Also, those slides are terrible.  Even at 200% zoom, the text is tiny.

There is no MAP_NORMAL argument to mmap(), there are no GFP flags to
sys_mmap() and calling mmap() does not typically cause alloc_page() to
be called.  I'm not sure that putting your thoughts onto slides is
making them any better organised.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-12 15:26           ` Gregory Price
@ 2023-04-12 15:50             ` David Hildenbrand
  2023-04-12 16:34               ` Gregory Price
  0 siblings, 1 reply; 40+ messages in thread
From: David Hildenbrand @ 2023-04-12 15:50 UTC (permalink / raw)
  To: Gregory Price
  Cc: Huang, Ying, Dragan Stancevic, lsf-pc, nil-migration, linux-cxl,
	linux-mm

On 12.04.23 17:26, Gregory Price wrote:
> On Wed, Apr 12, 2023 at 10:38:04AM +0200, David Hildenbrand wrote:
>> On 12.04.23 04:54, Huang, Ying wrote:
>>> Gregory Price <gregory.price@memverge.com> writes:
>>>
>>>> On Tue, Apr 11, 2023 at 02:37:50PM +0800, Huang, Ying wrote:
>>>>> Gregory Price <gregory.price@memverge.com> writes:
>>>>>
>>>>> [snip]
>>>>>
>>>>>> 2. During the migration process, the memory needs to be forced not to be
>>>>>>      migrated to another node by other means (tiering software, swap,
>>>>>>      etc).  The obvious way of doing this would be to migrate and
>>>>>>      temporarily pin the page... but going back to problem #1 we see that
>>>>>>      ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>>>>>>      troublesome.
>>>>>
>>>>> Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.) to
>>>>> avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
>>>>> but I think it is fixable.
>>>>>
>>>>> Best Regards,
>>>>> Huang, Ying
>>>>>
>>>>> [snip]
>>>>
>>>> That feels like a hack/bodge rather than a proper solution to me.
>>>>
>>>> Maybe this is an affirmative argument for the creation of an EXMEM
>>>> zone.
>>>
>>> Let's start with requirements.  What is the requirements for a new zone
>>> type?
>>
>> I'm stills scratching my head regarding this. I keep hearing all different
>> kind of statements that just add more confusions "we want it to be
>> hotunpluggable" "we want to allow for long-term pinning memory" "but we
>> still want it to be movable" "we want to place some unmovable allocations on
>> it". Huh?
>>
>> Just to clarify: ZONE_MOVABLE allows for pinning. It just doesn't allow for
>> long-term pinning of memory.
>>
> 
> I apologize for the confusion, this is my fault.  I had assumed that
> since dax regions can't be pinned, subsequent nodes backed by a dax
> device could not be pinned.  In testing this, this is not the case.
> 
> Re: long-term pinning, can you be more explicit as to what is considered
> long-term?  Minutes? hours? days? etc

long-term: possibly forever, controlled by user space. In practice, 
anything longer than ~10 seconds ( best guess :) ). There can be 
long-term pinnings that are of very short duration, we just don't know 
what user space is up to and when it will decide to unpin.

Assume user space requests to trigger read/write of a user space page to 
a file: the page is pinned, DMA is started, once DMA completes the page 
is unpinned. Short-term. User space does not control how long the page 
remains pinned.

In contrast:

Example #1: mapping VM guest memory into an IOMMU using vfio for PCI 
passthrough requires pinning the pages. Until user space decides to 
unmap the pages from the IOMMU, the pages will remain pinned. -> long-term

Example #2: mapping a user space address range into an IOMMU to 
repeatedly perform RDMA using that address range requires pinning the 
pages. Until user space decides to unregister that range, the pages 
remain pinned. -> long-term

Example #3: registering a user space address range with io_uring as a 
fixed buffer, such that io_uring OPS can avoid the page table walks by 
simply using the pinned pages that were looked up once. As long as the 
fixed buffer remains registered, the pages stay pinned. -> long-term

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-12 15:50             ` David Hildenbrand
@ 2023-04-12 16:34               ` Gregory Price
  2023-04-14  4:16                 ` Dragan Stancevic
  0 siblings, 1 reply; 40+ messages in thread
From: Gregory Price @ 2023-04-12 16:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Huang, Ying, Dragan Stancevic, lsf-pc, nil-migration, linux-cxl,
	linux-mm

On Wed, Apr 12, 2023 at 05:50:55PM +0200, David Hildenbrand wrote:
> 
> long-term: possibly forever, controlled by user space. In practice, anything
> longer than ~10 seconds ( best guess :) ). There can be long-term pinnings
> that are of very short duration, we just don't know what user space is up to
> and when it will decide to unpin.
> 
> Assume user space requests to trigger read/write of a user space page to a
> file: the page is pinned, DMA is started, once DMA completes the page is
> unpinned. Short-term. User space does not control how long the page remains
> pinned.
> 
> In contrast:
> 
> Example #1: mapping VM guest memory into an IOMMU using vfio for PCI
> passthrough requires pinning the pages. Until user space decides to unmap
> the pages from the IOMMU, the pages will remain pinned. -> long-term
> 
> Example #2: mapping a user space address range into an IOMMU to repeatedly
> perform RDMA using that address range requires pinning the pages. Until user
> space decides to unregister that range, the pages remain pinned. ->
> long-term
> 
> Example #3: registering a user space address range with io_uring as a fixed
> buffer, such that io_uring OPS can avoid the page table walks by simply
> using the pinned pages that were looked up once. As long as the fixed buffer
> remains registered, the pages stay pinned. -> long-term
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

That pretty much precludes live migration from using CXL as a transport
mechanism, since live migration would be a user-initiated process, you
would need what amounts to an atomic move between hosts to ensure pages
are not left pinned.

The more i'm reading the more i'm somewhat convinced CXL memory should
not allow pinning at all.


I suppose you could implement a new RDMA feature where the remote host's
CXL memory is temporarily mapped, data is migrated, and then that area
is unmapped. Basically the exact same RDMA mechanism, but using memory
instead of network. This would make the operation a kernel-controlled
if pin/unpin is required.


Lots to talk about.

~Gregory

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-11  4:33         ` Shreyas Shah
@ 2023-04-14  3:26           ` Dragan Stancevic
  0 siblings, 0 replies; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-14  3:26 UTC (permalink / raw)
  To: Shreyas Shah, lsf-pc; +Cc: nil-migration, linux-cxl, linux-mm

On 4/10/23 23:33, Shreyas Shah wrote:
> Hi Dragon,
> 
> Do you all meet on Zoom or Teams weekly call?
> 
> Maybe next time I can present if you have a slot.

Hi Shreyas-

no, sorry there is no zoom call cadence at the time.



> Regards,
> Shreyas
> -----Original Message-----
> From: Dragan Stancevic <dragan@stancevic.com>
> Sent: Monday, April 10, 2023 6:33 PM
> To: Shreyas Shah <shreyas.shah@elastics.cloud>; lsf-pc@lists.linux-foundation.org
> Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org; linux-mm@kvack.org
> Subject: Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
> 
> Hi Shreyas-
> 
> speaking strictly for myself, sounds interesting
> 
> 
> On 4/10/23 20:17, Shreyas Shah wrote:
>> Thank you, Dragon.
>>
>> Btw, we can demonstrate the VM live migration with our FPGA based CXL memory and our demonstration today. We are not application expert, looked at the last diagram in your link and we are confident we can achieve it.
>>
>> Will there be any interest from the group? I can present for 15 mins and Q&A.
>>
>> Regards,
>> Shreyas
>>
>>
>>
>> -----Original Message-----
>> From: Dragan Stancevic <dragan@stancevic.com>
>> Sent: Monday, April 10, 2023 6:09 PM
>> To: Shreyas Shah <shreyas.shah@elastics.cloud>;
>> lsf-pc@lists.linux-foundation.org
>> Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org;
>> linux-mm@kvack.org
>> Subject: Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
>>
>> Hi Shreyas-
>>
>> On 4/9/23 12:40, Shreyas Shah wrote:
>>> Hi Dragon,
>>>
>>> The concept is great to time share the CXL attached memory across two NUMA nodes for live migration and create a cluster of VMs to increase the compute capacity.
>>>
>>> When and where is the BoF?
>>
>> It's a proposal sent under (CFP), it has not been accepted yet. The agenda is selected by the program committee based on interest. The current proposal is for LSF/MM/BPF summit[1] running May 8 - May 10th, but it's not happening if not approved by committee.
>>
>>
>> [1]. https://events.linuxfoundation.org/lsfmm/
>>
>>
>>>
>>>
>>> Regards,
>>> Shreyas
>>>     
>>>
>>> -----Original Message-----
>>> From: Dragan Stancevic <dragan@stancevic.com>
>>> Sent: Friday, April 7, 2023 2:06 PM
>>> To: lsf-pc@lists.linux-foundation.org
>>> Cc: nil-migration@lists.linux.dev; linux-cxl@vger.kernel.org;
>>> linux-mm@kvack.org
>>> Subject: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
>>>
>>> Hi folks-
>>>
>>> if it's not too late for the schedule...
>>>
>>> I am starting to tackle VM live migration and hypervisor clustering over switched CXL memory[1][2], intended for cloud virtualization types of loads.
>>>
>>> I'd be interested in doing a small BoF session with some slides and get into a discussion/brainstorming with other people that deal with VM/LM cloud loads. Among other things to discuss would be page migrations over switched CXL memory, shared in-memory ABI to allow VM hand-off between hypervisors, etc...
>>>
>>> A few of us discussed some of this under the ZONE_XMEM thread, but I figured it might be better to start a separate thread.
>>>
>>> If there is interested, thank you.
>>>
>>>
>>> [1]. High-level overview available at http://nil-migration.org/ [2].
>>> Based on CXL spec 3.0
>>>
>>> --
>>> Peace can only come as a natural consequence of universal
>>> enlightenment -Dr. Nikola Tesla
>>
>> --
>> --
>> Peace can only come as a natural consequence of universal
>> enlightenment -Dr. Nikola Tesla
>>
> 
> --
> --
> Peace can only come as a natural consequence of universal enlightenment -Dr. Nikola Tesla
> 

-- 
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
  2023-04-10  3:05   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Kyungsan Kim
  2023-04-10 17:46     ` [External] " Viacheslav A.Dubeyko
@ 2023-04-14  3:27     ` Dragan Stancevic
  1 sibling, 0 replies; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-14  3:27 UTC (permalink / raw)
  To: Kyungsan Kim
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee

Hi Kyungsan-


On 4/9/23 22:05, Kyungsan Kim wrote:
>> Hi folks-
>>
>> if it's not too late for the schedule...
>>
>> I am starting to tackle VM live migration and hypervisor clustering over
>> switched CXL memory[1][2], intended for cloud virtualization types of loads.
>>
>> I'd be interested in doing a small BoF session with some slides and get
>> into a discussion/brainstorming with other people that deal with VM/LM
>> cloud loads. Among other things to discuss would be page migrations over
>> switched CXL memory, shared in-memory ABI to allow VM hand-off between
>> hypervisors, etc...
>>
>> A few of us discussed some of this under the ZONE_XMEM thread, but I
>> figured it might be better to start a separate thread.
>>
>> If there is interested, thank you.
> 
> I would like join the discussion as well.
> Let me kindly suggest it would be more great if it includes the data flow of VM/hypervisor as background and kernel interaction expected.

Thank you for the suggestion, have you had a chance to check out 
http://nil-migration.org/ I have a high-level data flow between 
hypervisors, both for VM migration and hypervisor clustering. If that is 
not enough, I can definitely throw more things together. Let me know, 
thank you



>>
>>
>> [1]. High-level overview available at http://nil-migration.org/
>> [2]. Based on CXL spec 3.0
>>
>> --
>> Peace can only come as a natural consequence
>> of universal enlightenment -Dr. Nikola Tesla
> 

-- 
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-11  1:48     ` Gregory Price
@ 2023-04-14  3:32       ` Dragan Stancevic
  2023-04-14 13:16         ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Jonathan Cameron
  0 siblings, 1 reply; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-14  3:32 UTC (permalink / raw)
  To: Gregory Price; +Cc: lsf-pc, nil-migration, linux-cxl, linux-mm

Hi Gregory-


On 4/10/23 20:48, Gregory Price wrote:
> On Mon, Apr 10, 2023 at 07:56:01PM -0500, Dragan Stancevic wrote:
>> Hi Gregory-
>>
>> On 4/7/23 19:05, Gregory Price wrote:
>>> 3. This is changing the semantics of migration from a virtual memory
>>>      movement to a physical memory movement.  Typically you would expect
>>>      the RDMA process for live migration to work something like...
>>>
>>>      a) migration request arrives
>>>      b) source host informs destination host of size requirements
>>>      c) destination host allocations memory and passes a Virtual Address
>>>         back to source host
>>>      d) source host initates an RDMA from HostA-VA to HostB-VA
>>>      e) CPU task is migrated
>>>
>>>      Importantly, the allocation of memory by Host B handles the important
>>>      step of creating HVA->HPA mappings, and the Extended/Nested Page
>>>      Tables can simply be flushed and re-created after the VM is fully
>>>      migrated.
>>>
>>>      to long didn't read: live migration is a virtual address operation,
>>>      and node-migration is a PHYSICAL address operation, the virtual
>>>      addresses remain the same.
>>>
>>>      This is problematic, as it's changing the underlying semantics of the
>>>      migration operation.
>>
>> Those are all valid points, but what if you don't need to recreate HVA->HPA
>> mappings? If I am understanding the CXL 3.0 spec correctly, then both
>> virtual addresses and physical addresses wouldn't have to change. Because
>> the fabric "virtualizes" host physical addresses and the translation is done
>> by the G-FAM/GFD that has the capability to translate multi-host HPAs to
>> it's internal DPAs. So if you have two hypervisors seeing device physical
>> address as the same physical address, that might work?
>>
>>
> 
> Hm.  I hadn't considered the device side translation (decoders), though
> that's obviously a tool in the toolbox.  You still have to know how to
> slide ranges of data (which you mention below).

Hmm, do you have any quick thoughts on that?


>>> The reference in this case is... the page tables.  You need to know how
>>> to interpret the data in the CXL memory region on the remote host, and
>>> that's a "relative page table translation" (to coin a phrase? I'm not
>>> sure how to best describe it).
>>
>> right, coining phrases... I have been thinking of a "super-page" (for the
>> lack of a better word) a metadata region sitting on the switched CXL.mem
>> device that would allow hypervisors to synchronize on various aspects, such
>> as "relative page table translation", host is up, host is down, list of
>> peers, who owns what etc... In a perfect scenario, I would love to see the
>> hypervisors cooperating on switched CXL.mem device the same way cpus on
>> different numa nodes cooperate on memory in a single hypervisor. If either
>> host can allocate and schedule from this space then "NIL" aspect of
>> migration is "free".
>>
>>
> 
> The core of the problem is still that each of the hosts has to agree on
> the location (physically) of this region of memory, which could be
> problematic unless you have very strong BIOS and/or kernel driver
> controls to ensure certain devices are guaranteed to be mapped into
> certain spots in the CFMW.

Right, true. The way I am thinking of it is that this would be a part of 
data-center ops setup which at first pass would be a somewhat of a 
manual setup same way as other pre-OS related setup. But later on down 
the road perhaps this could be automated, either through some pre-agreed 
auto-ranges detection or similar, it's not unusual for dc ops to name 
hypervisors depending of where in dc/rack/etc they sit etc..



> After that it's a matter of treating this memory as incoherent shared
> memory and handling ownership in a safe way.  If the memory is only used
> for migrations, then you don't have to worry about performance.
> 
> So I agree, as long as shared memory mapped into the same CFMW area is
> used, this mechanism is totally sound.
> 
> My main concerns are that I don't know of a mechanism to ensure that.  I
> suppose for those interested, and with special BIOS/EFI, you could do
> that - but I think that's going to be a tall ask in a heterogenous cloud
> environment.

Yeah, I get that. But in my experience even heterogeneous setups have 
some level of homogeneity, weather it's per rack, or per pod. As old 
things are sunset and new things are brought in, it gives you these 
segments of homogeneity with more or less advanced features. So at the 
end of the day, if someone wants a feature X they will need to 
understand the feature requirements or limitations. I feel like I deal 
with hardware/feature fragmentation all the time, but doesn't preclude 
bringing newer things in. You just have to plant it appropriately.


>>> That's... complicated to say the least.
>>>
>>> <... snip ...>
>>>
>>> An Option:  Make pages physically contiguous on migration to CXL
>>>
>>> In this case, you don't necessarily care about the Host Virtual
>>> Addresses, what you actually care about are the structure of the pages
>>> in memory (are they physically contiguous? or do you need to
>>> reconstruct the contiguity by inspecting the page tables?).
>>>
>>> If a migration API were capable of reserving large swaths of contiguous
>>> CXL memory, you could discard individual page information and instead
>>> send page range information, reconstructing the virtual-physical
>>> mappings this way.
>>
>> yeah, good points, but this is all tricky though... it seems this would
>> require quiescing the VM and that is something I would like to avoid if
>> possible. I'd like to see the VM still executing while all of it's pages are
>> migrated onto CXL NUMA on the source hypervisor. And I would like to see the
>> VM executing on the destination hypervisor while migrate_pages is moving
>> pages off of CXL. Of course, what you are describing above would still be a
>> very fast VM migration, but would require quiescing.
>>
>>
> 
> Possibly.  If you're going to quiesce you're probably better off just
> snapshotting to shared memory and migrating the snapshot.

That is exactly my thought too.

> Maybe that's the better option for a first-pass migration mechanism.  I
> don't know.

I definitely see your point, "canning" and "re-hydration" approach as a 
first-pass. I'd be happy with even just a "Hello World" page migration 
as a first pass :)



> 
> Anyway, would love to attend this session.
> 
> ~Gregory
> 

-- 
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-11  6:37   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Huang, Ying
  2023-04-11 15:36     ` Gregory Price
@ 2023-04-14  3:33     ` Dragan Stancevic
  2023-04-14  5:35       ` Huang, Ying
  1 sibling, 1 reply; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-14  3:33 UTC (permalink / raw)
  To: Huang, Ying, Gregory Price; +Cc: lsf-pc, nil-migration, linux-cxl, linux-mm

On 4/11/23 01:37, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> [snip]
> 
>> 2. During the migration process, the memory needs to be forced not to be
>>     migrated to another node by other means (tiering software, swap,
>>     etc).  The obvious way of doing this would be to migrate and
>>     temporarily pin the page... but going back to problem #1 we see that
>>     ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>>     troublesome.
> 
> Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.) to
> avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
> but I think it is fixable.


Hmmm, I don't know about cpusets. For mbind, are you thinking something 
along the lines of MPOL_MF_MOVE_ALL? I guess it does have that 
deterministic placement, but this would have to be called from the 
process itself. Unlike migrate_pages which takes a pid.

Same for set_mempolicy, right?

I mean I guess, if some of this needs to be added into qemu it's not the 
end of the word...


> Best Regards,
> Huang, Ying
> 
> [snip]
> 

-- 
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-12 16:34               ` Gregory Price
@ 2023-04-14  4:16                 ` Dragan Stancevic
  0 siblings, 0 replies; 40+ messages in thread
From: Dragan Stancevic @ 2023-04-14  4:16 UTC (permalink / raw)
  To: Gregory Price, David Hildenbrand
  Cc: Huang, Ying, lsf-pc, nil-migration, linux-cxl, linux-mm

Hi Gregory-


On 4/12/23 11:34, Gregory Price wrote:
> On Wed, Apr 12, 2023 at 05:50:55PM +0200, David Hildenbrand wrote:
>>
>> long-term: possibly forever, controlled by user space. In practice, anything
>> longer than ~10 seconds ( best guess :) ). There can be long-term pinnings
>> that are of very short duration, we just don't know what user space is up to
>> and when it will decide to unpin.
>>
>> Assume user space requests to trigger read/write of a user space page to a
>> file: the page is pinned, DMA is started, once DMA completes the page is
>> unpinned. Short-term. User space does not control how long the page remains
>> pinned.
>>
>> In contrast:
>>
>> Example #1: mapping VM guest memory into an IOMMU using vfio for PCI
>> passthrough requires pinning the pages. Until user space decides to unmap
>> the pages from the IOMMU, the pages will remain pinned. -> long-term
>>
>> Example #2: mapping a user space address range into an IOMMU to repeatedly
>> perform RDMA using that address range requires pinning the pages. Until user
>> space decides to unregister that range, the pages remain pinned. ->
>> long-term
>>
>> Example #3: registering a user space address range with io_uring as a fixed
>> buffer, such that io_uring OPS can avoid the page table walks by simply
>> using the pinned pages that were looked up once. As long as the fixed buffer
>> remains registered, the pages stay pinned. -> long-term
>>
>> -- 
>> Thanks,
>>
>> David / dhildenb
>>
> 
> That pretty much precludes live migration from using CXL as a transport
> mechanism, since live migration would be a user-initiated process, you
> would need what amounts to an atomic move between hosts to ensure pages
> are not left pinned.

Do you really need an atomic-in-between-hots? I mean, it's not really a 
failure if you are in the process of migrating pages onto the switched 
cxl memory memory and one of the pages is pulled out of cxl and back on 
the hypervisor. The running VM cpu can do loads and stores from either. 
So it's running, it's not affected. It's just that your migration is 
potentially "stalled" or "canceled". You only encounter issues when all 
your pages are on cxl and the other hypervisor is pulling pages out.


> The more i'm reading the more i'm somewhat convinced CXL memory should
> not allow pinning at all.

I think you want to be able to somehow pin the pages on one hypervisor 
and unpin them on the other hypervisor. Or in some other way "pass 
ownership" between the hypervisor. Right? Because of the scenario I 
mention above, if your source hypervisor takes a page out of cxl, then 
your destination hypervisor has a hole in VMs address space and can't 
run it.

> I suppose you could implement a new RDMA feature where the remote host's
> CXL memory is temporarily mapped, data is migrated, and then that area
> is unmapped. Basically the exact same RDMA mechanism, but using memory
> instead of network. This would make the operation a kernel-controlled
> if pin/unpin is required.

That would move us from the shared memory in the CXL 3 spec into the 
sections on direct memory placement I think. Which in order of 
preference is a #2 for me personally and a "backup" plan if #1 shared 
memory doesn't pan out.




> Lots to talk about.
> 
> ~Gregory
> 

-- 
--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-14  3:33     ` Dragan Stancevic
@ 2023-04-14  5:35       ` Huang, Ying
  0 siblings, 0 replies; 40+ messages in thread
From: Huang, Ying @ 2023-04-14  5:35 UTC (permalink / raw)
  To: Dragan Stancevic
  Cc: Gregory Price, lsf-pc, nil-migration, linux-cxl, linux-mm

Dragan Stancevic <dragan@stancevic.com> writes:

> On 4/11/23 01:37, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> [snip]
>> 
>>> 2. During the migration process, the memory needs to be forced not to be
>>>     migrated to another node by other means (tiering software, swap,
>>>     etc).  The obvious way of doing this would be to migrate and
>>>     temporarily pin the page... but going back to problem #1 we see that
>>>     ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>>>     troublesome.
>> Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.)
>> to
>> avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
>> but I think it is fixable.
>
>
> Hmmm, I don't know about cpusets. For mbind, are you thinking
> something along the lines of MPOL_MF_MOVE_ALL? I guess it does have
> that deterministic placement, but this would have to be called from
> the process itself. Unlike migrate_pages which takes a pid.

You can still use migrate_pages(2).  But after that, if you want to
prevent the pages to be migrated out of CXL.mem, you can use some kind
of memory policy, such as cpusets, mbind(), set_mempolicy().

Best Regards,
Huang, Ying

> Same for set_mempolicy, right?
>
> I mean I guess, if some of this needs to be added into qemu it's not
> the end of the word...
>
>
>> Best Regards,
>> Huang, Ying
>> [snip]
>> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: FW: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
       [not found]                 ` <CGME20230414084110epcas2p20b90a8d1892110d7ca3ac16290cd4686@epcas2p2.samsung.com>
@ 2023-04-14  8:41                   ` Kyungsan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Kyungsan Kim @ 2023-04-14  8:41 UTC (permalink / raw)
  To: david
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee,
	hj96.nam

>On 12.04.23 13:10, Kyungsan Kim wrote:
>>>> Gregory Price <gregory.price@memverge.com> writes:
>>>>
>>>>> On Tue, Apr 11, 2023 at 02:37:50PM +0800, Huang, Ying wrote:
>>>>>> Gregory Price <gregory.price@memverge.com> writes:
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>> 2. During the migration process, the memory needs to be forced not to be
>>>>>>>      migrated to another node by other means (tiering software, swap,
>>>>>>>      etc).  The obvious way of doing this would be to migrate and
>>>>>>>      temporarily pin the page... but going back to problem #1 we see that
>>>>>>>      ZONE_MOVABLE and Pinning are mutually exclusive.  So that's
>>>>>>>      troublesome.
>>>>>>
>>>>>> Can we use memory policy (cpusets, mbind(), set_mempolicy(), etc.) to
>>>>>> avoid move pages out of CXL.mem node?  Now, there are gaps in tiering,
>>>>>> but I think it is fixable.
>>>>>>
>>>>>> Best Regards,
>>>>>> Huang, Ying
>>>>>>
>>>>>> [snip]
>>>>>
>>>>> That feels like a hack/bodge rather than a proper solution to me.
>>>>>
>>>>> Maybe this is an affirmative argument for the creation of an EXMEM
>>>>> zone.
>>>>
>>>> Let's start with requirements.  What is the requirements for a new zone
>>>> type?
>>>
>>> I'm stills scratching my head regarding this. I keep hearing all
>>> different kind of statements that just add more confusions "we want it
>>> to be hotunpluggable" "we want to allow for long-term pinning memory"
>>> "but we still want it to be movable" "we want to place some unmovable
>>> allocations on it". Huh?
>>>
>>> Just to clarify: ZONE_MOVABLE allows for pinning. It just doesn't allow
>>> for long-term pinning of memory.
>>>
>>> For good reason, because long-term pinning of memory is just the worst
>>> (memory waste, fragmentation, overcommit) and instead of finding new
>>> ways to *avoid* long-term pinnings, we're coming up with advanced
>>> concepts to work-around the fundamental property of long-term pinnings.
>>>
>>> We want all memory to be long-term pinnable and we want all memory to be
>>> movable/hotunpluggable. That's not going to work.
>> 
>> Looks there is misunderstanding about ZONE_EXMEM argument.
>> Pinning and plubbability is mutual exclusive so it can not happen at the same time.
>> What we argue is ZONE_EXMEM does not "confine movability". an allocation context can determine the movability attribute.
>> Even one unmovable allocation will make the entire CXL DRAM unpluggable.
>> When you see ZONE_EXMEM just on movable/unmoable aspect, we think it is the same with ZONE_NORMAL,
>> but ZONE_EXMEM works on an extended memory, as of now CXL DRAM.
>> 
>> Then why ZONE_EXMEM is, ZONE_EXMEM considers not only the pluggability aspect, but CXL identifier for user/kenelspace API,
>> the abstraction of multiple CXL DRAM channels, and zone unit algorithm for CXL HW characteristics.
>> The last one is potential at the moment, though.
>> 
>> As mentioned in ZONE_EXMEM thread, we are preparing slides to explain experiences and proposals.
>> It it not final version now[1].
>> [1] https://protect2.fireeye.com/v1/url?k=265f4f76-47d45a59-265ec439-74fe485cbfe7-1e8ec1d2f0c2fd0a&q=1&e=727e97be-fc78-4fa6-990b-a86c256978d1&u=https%3A%2F%2Fgithub.com%2FOpenMPDK%2FSMDK%2Fwiki%2F93.-%255BLSF-MM-BPF-TOPIC%255D-SMDK-inspired-MM-changes-for-CXL
>
>Yes, hopefully we can discuss at LSF/MM also the problems we are trying 
>to solve instead of focusing on one solution. [did not have the time to 
>look at the slides yet, sorry]

For sure.. The purpose of LSF/MM this year is weighted on sharing experiences and issues as CXL provider for last couple of years.
We don't think our solution is the only way, but propose it.
Hopefully, we gradually figure out the best way with experts here.

>
>-- 
>Thanks,
>
>David / dhildenb

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: RE: FW: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
       [not found]                 ` <CGME20230414084114epcas2p4754d6c0d3c86a0d6d4e855058562100f@epcas2p4.samsung.com>
@ 2023-04-14  8:41                   ` Kyungsan Kim
  0 siblings, 0 replies; 40+ messages in thread
From: Kyungsan Kim @ 2023-04-14  8:41 UTC (permalink / raw)
  To: willy
  Cc: lsf-pc, linux-mm, linux-fsdevel, linux-cxl, a.manzanares,
	viacheslav.dubeyko, dan.j.williams, seungjun.ha, wj28.lee,
	hj96.nam

>On Wed, Apr 12, 2023 at 08:10:33PM +0900, Kyungsan Kim wrote:
>> Pinning and plubbability is mutual exclusive so it can not happen at the same time.
>> What we argue is ZONE_EXMEM does not "confine movability". an allocation context can determine the movability attribute.
>> Even one unmovable allocation will make the entire CXL DRAM unpluggable.
>> When you see ZONE_EXMEM just on movable/unmoable aspect, we think it is the same with ZONE_NORMAL,
>> but ZONE_EXMEM works on an extended memory, as of now CXL DRAM.
>>
>> Then why ZONE_EXMEM is, ZONE_EXMEM considers not only the pluggability aspect, but CXL identifier for user/kenelspace API,
>> the abstraction of multiple CXL DRAM channels, and zone unit algorithm for CXL HW characteristics.
>> The last one is potential at the moment, though.
>>
>> As mentioned in ZONE_EXMEM thread, we are preparing slides to explain experiences and proposals.
>> It it not final version now[1].
>> [1] https://github.com/OpenMPDK/SMDK/wiki/93.-%5BLSF-MM-BPF-TOPIC%5D-SMDK-inspired-MM-changes-for-CXL
>
>The problem is that you're starting out with a solution.  Tell us what
>your requirements are, at a really high level, then walk us through
>why ZONE_EXMEM is the best way to satisfy those requirements.

Thank you for your advice. It makes sense.
We will restate requirements(usecases and issues) rather than our solution aspect.
A sympathy about the requirements should come first at the moment.
Hope we gradually reach up a consensus.

>Also, those slides are terrible.  Even at 200% zoom, the text is tiny.
>
>There is no MAP_NORMAL argument to mmap(), there are no GFP flags to
>sys_mmap() and calling mmap() does not typically cause alloc_page() to
>be called.  I'm not sure that putting your thoughts onto slides is
>making them any better organised.

I'm sorry for your inconvenience. Explaining the version of document, the 1st slide shows SMDK kernel, not vanilla kernel.
Especially, the slide is geared to highlight the flow of the new user/kernel API to implicitly/explicitly access DIMM DRAM or CXL DRAM
to help understanding at previous discussion context.
We added MAP_NORMAL/MAP_EXMEM on mmap()/sys_mmap(), GFP_EXMEM/GFP_NORMAL on alloc_pages().
If you mean COW, please assume the mmap() is called with MAP_POPULATE flag. We wanted to draw it simple to highlight the purpose.
The document is not final version, we will apply your comment while preparing.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory
  2023-04-14  3:32       ` Dragan Stancevic
@ 2023-04-14 13:16         ` Jonathan Cameron
  0 siblings, 0 replies; 40+ messages in thread
From: Jonathan Cameron @ 2023-04-14 13:16 UTC (permalink / raw)
  To: Dragan Stancevic
  Cc: Gregory Price, lsf-pc, nil-migration, linux-cxl, linux-mm

On Thu, 13 Apr 2023 22:32:48 -0500
Dragan Stancevic <dragan@stancevic.com> wrote:

> Hi Gregory-
> 
> 
> On 4/10/23 20:48, Gregory Price wrote:
> > On Mon, Apr 10, 2023 at 07:56:01PM -0500, Dragan Stancevic wrote:  
> >> Hi Gregory-
> >>
> >> On 4/7/23 19:05, Gregory Price wrote:  
> >>> 3. This is changing the semantics of migration from a virtual memory
> >>>      movement to a physical memory movement.  Typically you would expect
> >>>      the RDMA process for live migration to work something like...
> >>>
> >>>      a) migration request arrives
> >>>      b) source host informs destination host of size requirements
> >>>      c) destination host allocations memory and passes a Virtual Address
> >>>         back to source host
> >>>      d) source host initates an RDMA from HostA-VA to HostB-VA
> >>>      e) CPU task is migrated
> >>>
> >>>      Importantly, the allocation of memory by Host B handles the important
> >>>      step of creating HVA->HPA mappings, and the Extended/Nested Page
> >>>      Tables can simply be flushed and re-created after the VM is fully
> >>>      migrated.
> >>>
> >>>      to long didn't read: live migration is a virtual address operation,
> >>>      and node-migration is a PHYSICAL address operation, the virtual
> >>>      addresses remain the same.
> >>>
> >>>      This is problematic, as it's changing the underlying semantics of the
> >>>      migration operation.  
> >>
> >> Those are all valid points, but what if you don't need to recreate HVA->HPA
> >> mappings? If I am understanding the CXL 3.0 spec correctly, then both
> >> virtual addresses and physical addresses wouldn't have to change.

That's implementation defined if we are talking DCD for this.  I would suggest making
it very clear which particular CXL options you are thinking of using.

A CXL 2.0 approach of binding LDs to different switch vPPB (virtual ports) probably doesn't
have this problem, but has it's own limitations and is a much heavier weight thing
to handle.

For DCD if we assuming sharing is used (I'd suggest ignoring other possibilities
for now as there are architectural gaps that I'm not going into and the same
issues will occur with them anyway)... 
Then what you get if you share on multiple LDs presented to multiple hosts is
a set of extents (each is a base + size, any number any size) that have sequence
numbers.

The device may, typically because of fragmentation of the DPA space exposed to
an LD (typically one of those from a device per host) decide to map what was created
in a particular DPA extents pattern (mapped via nice linear decoders into Host PA space)
in a different order and with different size extents.

So in general you can't assume a spec compliant CXL type 3 device (probably a multihead
device in initial deployments) will map anything to an particular location when moving
the memory between hosts.

So ultimately you'd need to translate between:
Page tables on source + DPA extents info.

and

Page table needed on destination to land the parts of the DPA extents (via HDM deoders
applying offsets etc) in the right place in GPA space so the guest gets the right
mapping.

So that will have some complexity and cost associated with it.  Not impossible but
not a simple reuse of tables from source on the destination. 

This is all PA to GPA translation though and in many cases I'd not expect that
to be particularly dynamic - so it's a step before you do any actual migration
hence I'm not sure it matters that might take a bit of maths.


> Because
> >> the fabric "virtualizes" host physical addresses and the translation is done
> >> by the G-FAM/GFD that has the capability to translate multi-host HPAs to
> >> it's internal DPAs. So if you have two hypervisors seeing device physical
> >> address as the same physical address, that might work?
> >>
> >>  
> > 
> > Hm.  I hadn't considered the device side translation (decoders), though
> > that's obviously a tool in the toolbox.  You still have to know how to
> > slide ranges of data (which you mention below).  
> 
> Hmm, do you have any quick thoughts on that?

HDM decoder programming is hard to do in a dynamic fashion (lots of limitations
on what you can do due to ordering restrictions in the spec). I'd ignore it
for this usecase beyond the fact that you get linear offsets from DPA to HPA
that need to be incorporated in your thinking.

> 
> 
> >>> The reference in this case is... the page tables.  You need to know how
> >>> to interpret the data in the CXL memory region on the remote host, and
> >>> that's a "relative page table translation" (to coin a phrase? I'm not
> >>> sure how to best describe it).  
> >>
> >> right, coining phrases... I have been thinking of a "super-page" (for the
> >> lack of a better word) a metadata region sitting on the switched CXL.mem
> >> device that would allow hypervisors to synchronize on various aspects, such
> >> as "relative page table translation", host is up, host is down, list of
> >> peers, who owns what etc... In a perfect scenario, I would love to see the
> >> hypervisors cooperating on switched CXL.mem device the same way cpus on
> >> different numa nodes cooperate on memory in a single hypervisor. If either
> >> host can allocate and schedule from this space then "NIL" aspect of
> >> migration is "free".
> >>
> >>  
> > 
> > The core of the problem is still that each of the hosts has to agree on
> > the location (physically) of this region of memory, which could be
> > problematic unless you have very strong BIOS and/or kernel driver
> > controls to ensure certain devices are guaranteed to be mapped into
> > certain spots in the CFMW.  
> 
> Right, true. The way I am thinking of it is that this would be a part of 
> data-center ops setup which at first pass would be a somewhat of a 
> manual setup same way as other pre-OS related setup. But later on down 
> the road perhaps this could be automated, either through some pre-agreed 
> auto-ranges detection or similar, it's not unusual for dc ops to name 
> hypervisors depending of where in dc/rack/etc they sit etc..
> 

You might be able to constrain particular devices to place nicely with such
a model, but that is out of the scope of the specification and I'd suggest
in Linux at least we'd write the code to deal with the general case then
maybe have a 'fast path' if the stars align.

Jonathan


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-11 18:00 ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Dave Hansen
@ 2023-05-01 23:49   ` Dragan Stancevic
  0 siblings, 0 replies; 40+ messages in thread
From: Dragan Stancevic @ 2023-05-01 23:49 UTC (permalink / raw)
  To: Dave Hansen, lsf-pc; +Cc: nil-migration, linux-cxl, linux-mm

Hi Dave-

sorry, looks like I've missed your email

On 4/11/23 13:00, Dave Hansen wrote:
> On 4/7/23 14:05, Dragan Stancevic wrote:
>> I'd be interested in doing a small BoF session with some slides and get
>> into a discussion/brainstorming with other people that deal with VM/LM
>> cloud loads. Among other things to discuss would be page migrations over
>> switched CXL memory, shared in-memory ABI to allow VM hand-off between
>> hypervisors, etc...
> 
> How would 'struct page' or other kernel metadata be handled?
> 
> I assume you'd want a really big CXL memory device with as many hosts
> connected to it as is feasible.  But, in order to hand the memory off
> from one host to another, both would need to have metadata for it at
> _some_ point.

To be honest, I have not been thinking of this in terms of a "star" 
connection topology. Where say each host in a rack connects to the same 
memory device, I think I'd get bottle-necked on a singular device. Evac 
of a few hypervisors simultaneously might get a bit dicey.

I've been thinking of it more in terms of multiple memory devices per 
rack, connected to various hypervisors to form a hypervisor traversal 
graph[1]. For example in this graph, a VM would migrate across a single 
hop, or a few hops to reach it's destination hypervisor. And for the 
lack of better word, this would be your "migration namespace" to migrate 
the VM across the rack. The critical connections in the graph are 
hostfoo04 and hostfoo09, and those you'd use if you want to pop the VM 
into a different "migration namespace", for example a different rack or 
maybe even a pod.

Of course, this is quite a ways out since there are no CXL 3.0 devices 
yet. As a first step I would like to get to a point where I can emulate 
this with qemu and just prototype various approaches, but starting with 
a single emulated memory device and two hosts.

> So, do all hosts have metadata for the whole CXL memory device all the
> time?  Or, would they create the metadata (hotplug) when a VM is
> migrated in and destroy it (hot unplug) when a VM is migrated out?

To be honest I have not thought about hot plugging, but might be 
something for me to keep in mind and ponder about it. And if you have 
additional thoughts on this I'd love to hear them.

What I was thinking, and this may or may not be possible, or may be 
possible only to a certain extent, but my preference would be to keep as 
much of the metadata as possible on the memory device itself and have 
the hypervisors cooperate through some kind of ownership mechanism.


> That gets back to the granularity question discussed elsewhere in the
> thread.  How would the metadata allocation granularity interact with the
> page allocation granularity?  How would fragmentation be avoided so that
> hosts don't eat up all their RAM with unused metadata?

Yeah, this is something I am still running through my head. Even if we 
have this "ownership-cooperation", is this based on pages, what happens 
to the sub-page allocations, do we move them through the buckets or do 
we attach ownership to sub-page allocations too. In my ideal world, 
you'd have two hypervisors cooperate over this memory as transparently 
as CPUs in a single system collaborating across NUMA nodes. A lot to 
think about, many problems to solve and a lot of work to do. I don't 
have all the answers yet, but value all input & help


[1]. https://nil-migration.org/VM-Graph.png

--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-12 15:15           ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ James Bottomley
@ 2023-05-03 23:42             ` Dragan Stancevic
  0 siblings, 0 replies; 40+ messages in thread
From: Dragan Stancevic @ 2023-05-03 23:42 UTC (permalink / raw)
  To: James Bottomley, David Hildenbrand, Huang, Ying, Gregory Price
  Cc: lsf-pc, nil-migration, linux-cxl, linux-mm

Hi James, sorry looks like I missed your email...


On 4/12/23 10:15, James Bottomley wrote:
> On Wed, 2023-04-12 at 10:38 +0200, David Hildenbrand wrote:
>> On 12.04.23 04:54, Huang, Ying wrote:
>>> Gregory Price <gregory.price@memverge.com> writes:
> [...]
>>>> That feels like a hack/bodge rather than a proper solution to me.
>>>>
>>>> Maybe this is an affirmative argument for the creation of an
>>>> EXMEM zone.
>>>
>>> Let's start with requirements.  What is the requirements for a new
>>> zone type?
>>
>> I'm stills scratching my head regarding this. I keep hearing all
>> different kind of statements that just add more confusions "we want
>> it to be hotunpluggable" "we want to allow for long-term pinning
>> memory" "but we still want it to be movable" "we want to place some
>> unmovable allocations on it". Huh?
> 
> This is the essential question about CXL memory itself: what would its
> killer app be?  The CXL people (or at least the ones I've talked to)
> don't exactly know.


I hope it's not something I've said, I'm not claiming VM migration or 
hypervisor clustering is the killer app for CXL. I would never claim 
that. And I'm not one of the CXL folks. You can chuck me into the "CXL 
enthusiasts" bucket.... For a bit of context, I'm one of the 
co-authors/architects of VMware's clustered filesystem[1] and I've 
worked on live VM migration as far back as 2003 on the original ESX 
server. Back in the day, we introduced the concept of VM live migration 
into the x86 data-center parlance with a combination of a process 
monitor and a clustered filesystem. The basic mechanism we put forward 
at the time was: pre-copy, quiesce, post-copy, un-quiesce. And I think 
most hypervisor after which added live migration are using loosely the 
same basic principles, iirc xen introduced LM 4 years later in 2007 and 
KVM about the same time or perhaps a year later. Anyway, the point that 
I am trying to get to is, it bugged me 20 years ago that we quiesced, 
and it bugs me today :) I think 20 years ago, quiescing was an 
acceptable compromise because we couldn't solve it technologically. 
Maybe 20-25 years later, we've reached a point we can solve it 
technologically. I don't know, but the problem interests me enough to try.


>  Within IBM I've seen lots of ideas but no actual
> concrete applications.  Given the rates at which memory density in
> systems is increasing, I'm a bit dubious of the extensible system pool
> argument.   Providing extensible memory to VMs sounds a bit more
> plausible, particularly as it solves a big part of the local overcommit
> problem (although you still have a global one).  I'm not really sure I
> buy the VM migration use case: iterative transfer works fine with small
> down times so transferring memory seems to be the least of problems
> with the VM migration use case

We do approximately 2.5 Million live migrations per year. Some 
migrations take less than a second, some take roughly a second, and 
others on very noisy VMs can take several seconds. Whatever that average 
is, let's say 1 second per live migration, that's cumulatively roughly 
28 days of steal lost to migration per year. As you probably know, live 
migrations are essential for de-fragmenting hypervisors/de-stranding 
resources and from my perspective, I'd like to see them happen more 
often with a smaller customer impact.


> (it's mostly about problems with attached devices).

That is purely virtualization load type dependent. Maybe for the cloud 
you're running devices are a problem(I'm guessing here). For us this is 
a non existent problem. We serve approximately 600,000 customers and 
don't do forms of pass-through so it's literally a non issue. What I am 
starting to tackle with nil-migration is to be able to migrate live and 
executing memory, instead of frozen memory. Which should especially help 
with noisy VMs, and in my experience customers of noisy VMs are more 
likely to notice steal and complain about steal. I understand everyone 
has their own workloads, and the devices problem will be solved in it's 
own right, but it's out of scope for what I am tackling with 
nil-migration. My main focus at this time is memory and context migration.


<  CXL 3.0 is adding sharing primitives for memory so
> now we have to ask if there are any multi-node shared memory use cases
> for this, but most of us have already been burned by multi-node shared
> clusters once in our career and are a bit leery of a second go around.

Chatting with you at the last LPC, and judging by the combined gray hair 
between us, I'll venture to guess we've both fallen off the proverbial 
bike, many times. It's never stopped me from getting back on. Issue 
interest me enough to try.

If you don't mind me asking, what clustering did you work on? Maybe I am 
familiar with it


> 
> Is there a use case I left out (or needs expanding)?
> 
> James
> 



[1]. https://en.wikipedia.org/wiki/VMware_VMFS

--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​
  2023-04-07 21:05 [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Dragan Stancevic
                   ` (5 preceding siblings ...)
  2023-04-11 18:16 ` RAGHU H
@ 2023-05-09 15:08 ` Dragan Stancevic
  6 siblings, 0 replies; 40+ messages in thread
From: Dragan Stancevic @ 2023-05-09 15:08 UTC (permalink / raw)
  To: lsf-pc; +Cc: nil-migration, linux-cxl, linux-mm

Hey folks-

to those that attended the BoF, just wanted to say thank you for your 
time and attention, I appreciate it.

It was a bit challenging being remote, getting choppy sound, and I 
already have hearing loss so I wasn't able to 100% hear what people were 
saying.

So if you asked me "how you like them apples" and I said "Yeah, bananas 
are cool", I apologize. Please feel free to email me with any additional 
thoughts or questions.

Thanks


On 4/7/23 16:05, Dragan Stancevic wrote:
> Hi folks-
> 
> if it's not too late for the schedule...
> 
> I am starting to tackle VM live migration and hypervisor clustering over 
> switched CXL memory[1][2], intended for cloud virtualization types of 
> loads.
> 
> I'd be interested in doing a small BoF session with some slides and get 
> into a discussion/brainstorming with other people that deal with VM/LM 
> cloud loads. Among other things to discuss would be page migrations over 
> switched CXL memory, shared in-memory ABI to allow VM hand-off between 
> hypervisors, etc...
> 
> A few of us discussed some of this under the ZONE_XMEM thread, but I 
> figured it might be better to start a separate thread.
> 
> If there is interested, thank you.
> 
> 
> [1]. High-level overview available at http://nil-migration.org/
> [2]. Based on CXL spec 3.0
> 
> -- 
> Peace can only come as a natural consequence
> of universal enlightenment -Dr. Nikola Tesla
> 


--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2023-05-09 15:08 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-07 21:05 [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Dragan Stancevic
2023-04-07 22:23 ` James Houghton
2023-04-07 23:17   ` David Rientjes
2023-04-08  1:33     ` Dragan Stancevic
2023-04-08 16:24     ` Dragan Stancevic
2023-04-08  0:05 ` Gregory Price
2023-04-11  0:56   ` Dragan Stancevic
2023-04-11  1:48     ` Gregory Price
2023-04-14  3:32       ` Dragan Stancevic
2023-04-14 13:16         ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Jonathan Cameron
2023-04-11  6:37   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Huang, Ying
2023-04-11 15:36     ` Gregory Price
2023-04-12  2:54       ` Huang, Ying
2023-04-12  8:38         ` David Hildenbrand
     [not found]           ` <CGME20230412111034epcas2p1b46d2a26b7d3ac5db3b0e454255527b0@epcas2p1.samsung.com>
2023-04-12 11:10             ` FW: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Kyungsan Kim
2023-04-12 11:26               ` David Hildenbrand
     [not found]                 ` <CGME20230414084110epcas2p20b90a8d1892110d7ca3ac16290cd4686@epcas2p2.samsung.com>
2023-04-14  8:41                   ` Kyungsan Kim
2023-04-12 15:40               ` Matthew Wilcox
     [not found]                 ` <CGME20230414084114epcas2p4754d6c0d3c86a0d6d4e855058562100f@epcas2p4.samsung.com>
2023-04-14  8:41                   ` Kyungsan Kim
2023-04-12 15:15           ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ James Bottomley
2023-05-03 23:42             ` Dragan Stancevic
2023-04-12 15:26           ` Gregory Price
2023-04-12 15:50             ` David Hildenbrand
2023-04-12 16:34               ` Gregory Price
2023-04-14  4:16                 ` Dragan Stancevic
2023-04-14  3:33     ` Dragan Stancevic
2023-04-14  5:35       ` Huang, Ying
2023-04-09 17:40 ` Shreyas Shah
2023-04-11  1:08   ` Dragan Stancevic
2023-04-11  1:17     ` Shreyas Shah
2023-04-11  1:32       ` Dragan Stancevic
2023-04-11  4:33         ` Shreyas Shah
2023-04-14  3:26           ` Dragan Stancevic
     [not found] ` <CGME20230410030532epcas2p49eae675396bf81658c1a3401796da1d4@epcas2p4.samsung.com>
2023-04-10  3:05   ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Kyungsan Kim
2023-04-10 17:46     ` [External] " Viacheslav A.Dubeyko
2023-04-14  3:27     ` Dragan Stancevic
2023-04-11 18:00 ` [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory​ Dave Hansen
2023-05-01 23:49   ` Dragan Stancevic
2023-04-11 18:16 ` RAGHU H
2023-05-09 15:08 ` Dragan Stancevic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).