From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 535D3C77B72 for ; Fri, 14 Apr 2023 13:16:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CFD5D900003; Fri, 14 Apr 2023 09:16:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CAD68900002; Fri, 14 Apr 2023 09:16:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BC2B3900003; Fri, 14 Apr 2023 09:16:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id AB395900002 for ; Fri, 14 Apr 2023 09:16:56 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 3F27316027A for ; Fri, 14 Apr 2023 13:16:56 +0000 (UTC) X-FDA: 80680046832.08.BF0DABB Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf26.hostedemail.com (Postfix) with ESMTP id EE042140006 for ; Fri, 14 Apr 2023 13:16:52 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf26.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681478214; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ae3ZrIJvFU6yQuaQvecDpdkbwGUZ6BNTPAeUBvezfNA=; b=Gq8U4fVoc9DJgeXFPyq6naGnB+kWfKfkDCWSjvum+/VdTwQGJJFsrEkU5JxOcyaAQCFsjC vAothQV1AGvEPJmuKLYX9kppFVUgaHBHnBnqSBss5Dmg6LcqruuZlMUomgefRzqN172dhB ZpBp1xOVZCLT19+algepUUChGTUHo4Q= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf26.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681478214; a=rsa-sha256; cv=none; b=4k5MGUX/EMwi/GPz1XxdJQjUKwJPGtuQDysviUWCzobZQqJK/7A5b6uq6ufySCKDu9Is5N EA3Hm1y9fn82cdGOHZ+gSqRJQ3KIKkXUPVX0hsGbQv1ryLyfGXZX/CUbK5Hn33PmNt4V1U pg51n++CvjlQvake6GP7bbTNJMcTy9A= Received: from lhrpeml500005.china.huawei.com (unknown [172.18.147.206]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4PycNX0tpxz6JFXG; Fri, 14 Apr 2023 21:14:20 +0800 (CST) Received: from localhost (10.202.227.76) by lhrpeml500005.china.huawei.com (7.191.163.240) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Fri, 14 Apr 2023 14:16:48 +0100 Date: Fri, 14 Apr 2023 14:16:47 +0100 From: Jonathan Cameron To: Dragan Stancevic CC: Gregory Price , , , , Subject: Re: [LSF/MM/BPF TOPIC] BoF VM live migration over CXL memory Message-ID: <20230414141647.000075a6@Huawei.com> In-Reply-To: <253e7a73-be3c-44d4-1ca3-d0d060313517@stancevic.com> References: <5d1156eb-02ae-a6cc-54bb-db3df3ca0597@stancevic.com> <9d22b56b-80ef-b36f-731b-4b3b588bc4bd@stancevic.com> <253e7a73-be3c-44d4-1ca3-d0d060313517@stancevic.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.202.227.76] X-ClientProxiedBy: lhrpeml500002.china.huawei.com (7.191.160.78) To lhrpeml500005.china.huawei.com (7.191.163.240) X-CFilter-Loop: Reflected X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: EE042140006 X-Stat-Signature: 15j7grn8mg68g3e43e9hzyh3tu8wbyyn X-Rspam-User: X-HE-Tag: 1681478212-720856 X-HE-Meta: U2FsdGVkX1/qCaz0OnXAzkykKJRQJwZ2lRlnrVkFVF7oop+nBFolA6wF+PDod6oeP6Uh3oQgJoiG9FCo06s9D74KaygytD0Fwn+znjLUXaqSSNFmE2ZBJX3Yl0A3nRtvCOYxzbTeU8Dg7H3hR9FzjBWb/WYxi17ofjDSJ70U4MXEr/5wjmHsnb326s0q8VwV+oEXIDXPodnOe8iDTw60PwF3gtlzqq0ikbTr8G0n2bYn489Z79RACn3nt+OeGNsxi9x5lCGkidLAE2nwfN2fkZO/v/RNUJUeRESU1YZXFVNEs9yJHne4FOCMyL5MkzOLFs6rLLSK1XMZmNeJ1gp0yDaR2BF6d1v/BBdrK7O/sJMNMKQoYt0JGD5aN+GfkvTuIdl16z8kaOwNX1A6W18SKLj16O+0eL9MGI5WZLpMlxSYCjBOcEyGho671vHohR1Tcj/hsRP6BSbBvqKzyTAPast49nQcbS5swPg01MN+A1qYm/Kqi8OPkmRSwS60PRyyGXqaWZa5mabPQUEQ0rMTb8rnmFfcu0JvR/J/7caYk8CIZoPSxOmPNG0D31WGNpbL6ik/HzEV40z8MPiVmy55C/DnUPAf/iQkeA/QPjfgjCYR30ZoOZvDnoCLblyuiduOZ2be7D8/x+adxy3cCge6Wk8IjDY7++QgjRo3DqBNJPd/aDJj+Ovkj+vAEan+8G9OowxF9HcxuMiIecH9gyHsUOjU7S7MNG5ZfweYb/Ko1fIDelVCG55NfHlg9wOq/zGPL6O5odDzZVIufSTDyL9XIlDwXXRtaM/dKskPNeWBSiXqt34ic1mGIbFGr5CuZCx0L8JntExBWbP7fUWD74DISq9vETNzVlnfv93nlJLKHVvfiGwqiwCR+mXpI676qcnKy/iREoJ9JC0UhJXY+beGh5LzsO1i1f3SpbVrUbvqh4mqSSPCjziSe5ev46KKyPW9XYbAhzqdrNY01bbuFsm hNfnB4ng BFBW6MljsUzsC6jgeESpBlOoZwIDvg7AnRM0r/g3PRaW4/nxcFZVAvD+wpXPjvyxGAAMQ+bvyFC51lHudgqsU+MeT6Bj0DfdTx3H6HD9q1fCEoj+xcxSZTYC4kYHqxGRgvxqF9/BC8NbMaASY9mzyN/z5b36rMCdEUjDYQdikOvxFrAN10XVjEJFzPoPNzkE9cLRptDRLUx6L8dHG+KeC0WLcdNm1toh/CGUSm1fafPtJLGRhDGe1UzjW8pc4lrzpdzEvKhJ4DP7xkP54UKo6EeHNUw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000015, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, 13 Apr 2023 22:32:48 -0500 Dragan Stancevic wrote: > Hi Gregory- > > > On 4/10/23 20:48, Gregory Price wrote: > > On Mon, Apr 10, 2023 at 07:56:01PM -0500, Dragan Stancevic wrote: > >> Hi Gregory- > >> > >> On 4/7/23 19:05, Gregory Price wrote: > >>> 3. This is changing the semantics of migration from a virtual memory > >>> movement to a physical memory movement. Typically you would expect > >>> the RDMA process for live migration to work something like... > >>> > >>> a) migration request arrives > >>> b) source host informs destination host of size requirements > >>> c) destination host allocations memory and passes a Virtual Address > >>> back to source host > >>> d) source host initates an RDMA from HostA-VA to HostB-VA > >>> e) CPU task is migrated > >>> > >>> Importantly, the allocation of memory by Host B handles the important > >>> step of creating HVA->HPA mappings, and the Extended/Nested Page > >>> Tables can simply be flushed and re-created after the VM is fully > >>> migrated. > >>> > >>> to long didn't read: live migration is a virtual address operation, > >>> and node-migration is a PHYSICAL address operation, the virtual > >>> addresses remain the same. > >>> > >>> This is problematic, as it's changing the underlying semantics of the > >>> migration operation. > >> > >> Those are all valid points, but what if you don't need to recreate HVA->HPA > >> mappings? If I am understanding the CXL 3.0 spec correctly, then both > >> virtual addresses and physical addresses wouldn't have to change. That's implementation defined if we are talking DCD for this. I would suggest making it very clear which particular CXL options you are thinking of using. A CXL 2.0 approach of binding LDs to different switch vPPB (virtual ports) probably doesn't have this problem, but has it's own limitations and is a much heavier weight thing to handle. For DCD if we assuming sharing is used (I'd suggest ignoring other possibilities for now as there are architectural gaps that I'm not going into and the same issues will occur with them anyway)... Then what you get if you share on multiple LDs presented to multiple hosts is a set of extents (each is a base + size, any number any size) that have sequence numbers. The device may, typically because of fragmentation of the DPA space exposed to an LD (typically one of those from a device per host) decide to map what was created in a particular DPA extents pattern (mapped via nice linear decoders into Host PA space) in a different order and with different size extents. So in general you can't assume a spec compliant CXL type 3 device (probably a multihead device in initial deployments) will map anything to an particular location when moving the memory between hosts. So ultimately you'd need to translate between: Page tables on source + DPA extents info. and Page table needed on destination to land the parts of the DPA extents (via HDM deoders applying offsets etc) in the right place in GPA space so the guest gets the right mapping. So that will have some complexity and cost associated with it. Not impossible but not a simple reuse of tables from source on the destination. This is all PA to GPA translation though and in many cases I'd not expect that to be particularly dynamic - so it's a step before you do any actual migration hence I'm not sure it matters that might take a bit of maths. > Because > >> the fabric "virtualizes" host physical addresses and the translation is done > >> by the G-FAM/GFD that has the capability to translate multi-host HPAs to > >> it's internal DPAs. So if you have two hypervisors seeing device physical > >> address as the same physical address, that might work? > >> > >> > > > > Hm. I hadn't considered the device side translation (decoders), though > > that's obviously a tool in the toolbox. You still have to know how to > > slide ranges of data (which you mention below). > > Hmm, do you have any quick thoughts on that? HDM decoder programming is hard to do in a dynamic fashion (lots of limitations on what you can do due to ordering restrictions in the spec). I'd ignore it for this usecase beyond the fact that you get linear offsets from DPA to HPA that need to be incorporated in your thinking. > > > >>> The reference in this case is... the page tables. You need to know how > >>> to interpret the data in the CXL memory region on the remote host, and > >>> that's a "relative page table translation" (to coin a phrase? I'm not > >>> sure how to best describe it). > >> > >> right, coining phrases... I have been thinking of a "super-page" (for the > >> lack of a better word) a metadata region sitting on the switched CXL.mem > >> device that would allow hypervisors to synchronize on various aspects, such > >> as "relative page table translation", host is up, host is down, list of > >> peers, who owns what etc... In a perfect scenario, I would love to see the > >> hypervisors cooperating on switched CXL.mem device the same way cpus on > >> different numa nodes cooperate on memory in a single hypervisor. If either > >> host can allocate and schedule from this space then "NIL" aspect of > >> migration is "free". > >> > >> > > > > The core of the problem is still that each of the hosts has to agree on > > the location (physically) of this region of memory, which could be > > problematic unless you have very strong BIOS and/or kernel driver > > controls to ensure certain devices are guaranteed to be mapped into > > certain spots in the CFMW. > > Right, true. The way I am thinking of it is that this would be a part of > data-center ops setup which at first pass would be a somewhat of a > manual setup same way as other pre-OS related setup. But later on down > the road perhaps this could be automated, either through some pre-agreed > auto-ranges detection or similar, it's not unusual for dc ops to name > hypervisors depending of where in dc/rack/etc they sit etc.. > You might be able to constrain particular devices to place nicely with such a model, but that is out of the scope of the specification and I'd suggest in Linux at least we'd write the code to deal with the general case then maybe have a 'fast path' if the stars align. Jonathan