From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Gerry Message-Id: Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.1\)) Subject: Re: [virtio-dev] [PATCH 0/2] introduce virtio-ism: internal shared memory device Date: Wed, 19 Oct 2022 13:29:32 +0800 In-Reply-To: <361c918e-b628-2428-2219-e2fa58ce3645@redhat.com> References: <20221017074724.89569-1-xuanzhuo@linux.alibaba.com> <1666009602.9397366-1-xuanzhuo@linux.alibaba.com> <361c918e-b628-2428-2219-e2fa58ce3645@redhat.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_235BC977-D379-4BFA-A2EC-F3DC7FB5D296" To: Jason Wang Cc: Xuan Zhuo , virtio-dev@lists.oasis-open.org, hans@linux.alibaba.com, herongguang@linux.alibaba.com, zmlcc@linux.alibaba.com, dust.li@linux.alibaba.com, tonylu@linux.alibaba.com, zhenzao@linux.alibaba.com, helinguo@linux.alibaba.com, mst@redhat.com, cohuck@redhat.com, Stefan Hajnoczi List-ID: --Apple-Mail=_235BC977-D379-4BFA-A2EC-F3DC7FB5D296 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > 2022=E5=B9=B410=E6=9C=8819=E6=97=A5 11:55=EF=BC=8CJason Wang =E5=86=99=E9=81=93=EF=BC=9A >=20 >=20 > =E5=9C=A8 2022/10/18 16:33, Gerry =E5=86=99=E9=81=93: >>=20 >>=20 >>> 2022=E5=B9=B410=E6=9C=8818=E6=97=A5 14:54=EF=BC=8CJason Wang =E5=86=99=E9=81=93=EF=BC=9A >>>=20 >>> On Mon, Oct 17, 2022 at 8:31 PM Xuan Zhuo = wrote: >>>>=20 >>>> On Mon, 17 Oct 2022 16:17:31 +0800, Jason Wang w= rote: >>>>> Adding Stefan. >>>>>=20 >>>>>=20 >>>>> On Mon, Oct 17, 2022 at 3:47 PM Xuan Zhuo wrote: >>>>>>=20 >>>>>> Hello everyone, >>>>>>=20 >>>>>> # Background >>>>>>=20 >>>>>> Nowadays, there is a common scenario to accelerate communication bet= ween >>>>>> different VMs and containers, including light weight virtual machine= based >>>>>> containers. One way to achieve this is to colocate them on the same = host. >>>>>> However, the performance of inter-VM communication through network s= tack is not >>>>>> optimal and may also waste extra CPU cycles. This scenario has been = discussed >>>>>> many times, but still no generic solution available [1] [2] [3]. >>>>>>=20 >>>>>> With pci-ivshmem + SMC(Shared Memory Communications: [4]) based PoC[= 5], >>>>>> We found that by changing the communication channel between VMs from= TCP to SMC >>>>>> with shared memory, we can achieve superior performance for a common >>>>>> socket-based application[5]: >>>>>> - latency reduced by about 50% >>>>>> - throughput increased by about 300% >>>>>> - CPU consumption reduced by about 50% >>>>>>=20 >>>>>> Since there is no particularly suitable shared memory management sol= ution >>>>>> matches the need for SMC(See ## Comparison with existing technology)= , and virtio >>>>>> is the standard for communication in the virtualization world, we wa= nt to >>>>>> implement a virtio-ism device based on virtio, which can support on-= demand >>>>>> memory sharing across VMs, containers or VM-container. To match the = needs of SMC, >>>>>> the virtio-ism device need to support: >>>>>>=20 >>>>>> 1. Dynamic provision: shared memory regions are dynamically allocate= d and >>>>>> provisioned. >>>>>> 2. Multi-region management: the shared memory is divided into region= s, >>>>>> and a peer may allocate one or more regions from the same shared m= emory >>>>>> device. >>>>>> 3. Permission control: The permission of each region can be set sepe= rately. >>>>>=20 >>>>> Looks like virtio-ROCE >>>>>=20 >>>>> https://lore.kernel.org/all/20220511095900.343-1-xieyongji@bytedance.= com/T/ >>>>>=20 >>>>> and virtio-vhost-user can satisfy the requirement? >>>>>=20 >>>>>>=20 >>>>>> # Virtio ism device >>>>>>=20 >>>>>> ISM devices provide the ability to share memory between different gu= ests on a >>>>>> host. A guest's memory got from ism device can be shared with multip= le peers at >>>>>> the same time. This shared relationship can be dynamically created a= nd released. >>>>>>=20 >>>>>> The shared memory obtained from the device is divided into multiple = ism regions >>>>>> for share. ISM device provides a mechanism to notify other ism regio= n referrers >>>>>> of content update events. >>>>>>=20 >>>>>> # Usage (SMC as example) >>>>>>=20 >>>>>> Maybe there is one of possible use cases: >>>>>>=20 >>>>>> 1. SMC calls the interface ism_alloc_region() of the ism driver to r= eturn the >>>>>> location of a memory region in the PCI space and a token. >>>>>> 2. The ism driver mmap the memory region and return to SMC with the = token >>>>>> 3. SMC passes the token to the connected peer >>>>>> 3. the peer calls the ism driver interface ism_attach_region(token) = to >>>>>> get the location of the PCI space of the shared memory >>>>>>=20 >>>>>>=20 >>>>>> # About hot plugging of the ism device >>>>>>=20 >>>>>> Hot plugging of devices is a heavier, possibly failed, time-consum= ing, and >>>>>> less scalable operation. So, we don't plan to support it for now. >>>>>>=20 >>>>>> # Comparison with existing technology >>>>>>=20 >>>>>> ## ivshmem or ivshmem 2.0 of Qemu >>>>>>=20 >>>>>> 1. ivshmem 1.0 is a large piece of memory that can be seen by all = devices that >>>>>> use this VM, so the security is not enough. >>>>>>=20 >>>>>> 2. ivshmem 2.0 is a shared memory belonging to a VM that can be re= ad-only by all >>>>>> other VMs that use the ivshmem 2.0 shared memory device, which als= o does not >>>>>> meet our needs in terms of security. >>>>>>=20 >>>>>> ## vhost-pci and virtiovhostuser >>>>>>=20 >>>>>> Does not support dynamic allocation and therefore not suitable for= SMC. >>>>>=20 >>>>> I think this is an implementation issue, we can support VHOST IOTLB >>>>> message then the regions could be added/removed on demand. >>>>=20 >>>>=20 >>>> 1. After the attacker connects with the victim, if the attacker does n= ot >>>> dereference memory, the memory will be occupied under virtiovhostuse= r. In the >>>> case of ism devices, the victim can directly release the reference, = and the >>>> maliciously referenced region only occupies the attacker's resources >>>=20 >>> Let's define the security boundary here. E.g do we trust the device or >>> not? If yes, in the case of virtiovhostuser, can we simple do >>> VHOST_IOTLB_UNMAP then we can safely release the memory from the >>> attacker. >> Thanks, Jason:) >> In our the design, there are several roles involved: >> 1) a virtio-ism-smc front-end driver >> 2) a Virtio-ism backend device driver and its associated vmm >> 3) a global device manager >> 4) a group of remote/peer virtio-ism backend devices/vmms >> 5) a group of remote/peer virtio-ism-smc front-end drivers >>=20 >> Among which , we treat 1, 2 and 3 as trusted, 4 and 5 as untrusted. >=20 >=20 > It looks to me VIRTIO_ISM_PERM_MANAGE violates what you've described here= . E.g what happens if 1 grant this permission to 5? My mistake, missed some background information. We split the communication into control plain and data plain. The above thr= ead model is for control plain. Once a peer has been granted permissions to= access a memory region, it becomes trusted to read/write the memory region= . >=20 >=20 >> Because 4 and 5 are trusted, we can=E2=80=99t guarantee that IOTLB Inval= idate requests have been executed as expected. >=20 >=20 > Interesting, I wonder how this is guaranteed by ISM. Anything that can wo= rk for ISM but not IOTLB? Note that the only difference for me is the devic= e API. We can hook anything that works for ISM to IOTLB. The difference is who is the resource owner. For IOTLB based design, guest vm is the resource owner, so it could only re= claim a shared memory region from peers. For our design, the device manager is the resource owner, guest vm allocate= /free memory region from the device manager. So for each SMC connection, a = new memory region is allocated/freed, a memory region won=E2=80=99t be reus= ed for SMCs connections with different (local, peer) pairs. >=20 >=20 >> Say when disconnecting an SMC connection, a malicious peer may ignore th= e IOTLB invalidation request and keep access the shared memory region. >>=20 >> We have considered the IOTLB based design but encountered several issues= : >> 1) It depends on the way to provision guest vm memory. We need a memory = resource descriptor to support vhost-user IOTLB messages, thus can=E2=80=99= t support anonymous memory based vm. >=20 >=20 > Hypervisor (Qemu) is free to hook IOTLB message to any kind of memory bac= kend, isn't? E.g Qemu can choose to implement IOTLB by its own instead of f= orwarding it to another VM. A memory resource file descriptor is needed to share a memory region among = VMs. If the guest memory is provisioned by anonymous mapped memory, it can= =E2=80=99t be shared to other VMs. In other words, vhost may work with process virtual address, ghost-user alw= ays works with file descriptors. =20 >=20 >=20 >> 2) Lack of fine-grain access control of memory resource descriptor. When= send a memory resource descriptor to an untrusted peer, we can=E2=80=99t e= nforce region based access control. Memfd supports file level seal operatio= ns, but still lack of region based permission control. Hugetlbfs based fd d= oesn=E2=80=99t support seal at all. >=20 >=20 > So in the above, you said 4 and 5 are untrusted. If yes how you can enfor= ce regioned based access control (the memory is still mapped by the untrsut= ed VMM)? And again, virtio-vhost-user is not limited to memfd/hugetlbfs, it= can do want you've done in your protoype (hooking to /dev/shm). Let=E2=80=99s take an example. Say vmm provisions 1GB memory to guest A thr= ough a memfd, among which 1MB is allocated by guest A as shared memory and = want to share it with vm B. We lack of technologies to share the memfd to guest B and restrict guest B = to only access the shared 1MB region.=20 >=20 >=20 >> 3) Lack of reliable way to reclaim granted access permissions from untru= sted peers, as stated above. >=20 >=20 > It would be better to explain how this "reclaim" works. >=20 >=20 >> 4) How implement resource accounting. Say a vm has shared some memory re= gions from peers, and those peers exited unexpectedly, then those shared me= mory will be accounted to the victim vm, and may cause unexpected OOM. >>=20 >> Based on the above consideration, we adopted another design and introduc= ed the device manager to solve above issues: >> 1) the device manager is the owner of memory buffers. >=20 >=20 > I don't see the definition "device manager" in your proposal, this needs = to be clarified in both the spec and the changelog or the cover letter. We will add the description about =E2=80=9Cdevice manager=E2=80=9D in next = version. >=20 >=20 >> 2) the device manager creates a memfd for each memory buffer/region, and= configure SEALs according to requested access permissions. >=20 >=20 > Ok, but this seems not what you've implemented in your qemu prototype? Not yet, we are still working on it. BTW, is rust/golang based prototype acceptable to Qemu community? Or it mus= t be written in C? >=20 >=20 >> 3) When a guest vm reclaims a shared memory buffer, the device manager w= ill provision a new memfd to the guest vm. >=20 >=20 > How can this be done for the untrusted peers? Please refer to above explanations:) >=20 >=20 >> And it will take the responsibility to reclaim the old buffer from peer = and eventually release the old buffer. >> 4) Simplify the control communication channel. Every guest vm only needs= to talk with the device manager and no need to discover and communicate wi= th other peers. >=20 >=20 > Not sure but it's better not mandate any model in the application layer. Made another mistake here, it should be stated as Every backend device driver only needs to talk with the device manager and = no need to discover and communicate with other peers. It may help to give more information about the design goal we want to achie= ve: 1) extend pci-ivshmem v1/v2 to support more complex usage scenarios like SM= C. 2) virtio-ism is a generic shmem device and should support pci-ivshmem v1/v= 2 usage scenarios.=20 3) SMC is the first/example usage scenario. 4) support communications among vm<=E2=80=94>vm, vm<=E2=80=94>run container= , runC container<=E2=80=94>runC container.=20 Design goal 4 forces us to design virtio-ism instead of pci-ivshmem v2/v3.= =20 The virtio/vDPA/vDUSE stack provides us a perfect way to implement userspac= e shmem device to support runC containers. And currently there=E2=80=99s no way to emulate PCI devices in userspace.= =20 >=20 > Thanks >=20 >=20 >>=20 >> Thanks, >> Gerry >>=20 >>>=20 >>>>=20 >>>> 2. The ism device of a VM can be shared with multiple (1000+) VMs at t= he same >>>> time, which is a challenge for virtiovhostuser >>>=20 >>> Please elaborate more the the challenges, anything make >>> virtiovhostuser different? >>>=20 >>>>=20 >>>> 3. The sharing relationship of ism is dynamically increased, and virti= ovhostuser >>>> determines the sharing relationship at startup. >>>=20 >>> Not necessarily with IOTLB API? >>>=20 >>>>=20 >>>> 4. For security issues, the device under virtiovhostuser may mmap more= memory, >>>> while ism only maps one region to other devices >>>=20 >>> With VHOST_IOTLB_MAP, the map could be done per region. >>>=20 >>> Thanks >>>=20 >>>>=20 >>>> Thanks. >>>>=20 >>>>>=20 >>>>> Thanks >>>>>=20 >>>>>>=20 >>>>>> # Design >>>>>>=20 >>>>>> This is a structure diagram based on ism sharing between two vms. >>>>>>=20 >>>>>> |----------------------------------------------------------------= ---------------------------------------------| >>>>>> | |------------------------------------------------| |-----= -------------------------------------------| | >>>>>> | | Guest | | Gues= t | | >>>>>> | | | | = | | >>>>>> | | ---------------- | | --= -------------- | | >>>>>> | | | driver | [M1] [M2] [M3] | | | = driver | [M2] [M3] | | >>>>>> | | ---------------- | | | | | --= -------------- | | | | >>>>>> | | |cq| |map |map |map | | |= cq| |map |map | | >>>>>> | | | | | | | | | |= | | | | | >>>>>> | | | | ------------------- | | |= | -------------------- | | >>>>>> | |----|--|----------------| device memory |-----| |----|= --|----------------| device memory |----| | >>>>>> | | | | ------------------- | | |= | -------------------- | | >>>>>> | | | | | = | | | >>>>>> | | | | | = | | | >>>>>> | | Qemu | | | Qemu= | | | >>>>>> | |--------------------------------+---------------| |-----= --------------------------+----------------| | >>>>>> | | = | | >>>>>> | | = | | >>>>>> | |-----------------------------= -+------------------------| | >>>>>> | = | | >>>>>> | = | | >>>>>> | -------------= ------------- | >>>>>> | | M1 | | M= 2 | | M3 | | >>>>>> | -------------= ------------- | >>>>>> | = | >>>>>> | HOST = | >>>>>> -----------------------------------------------------------------= ---------------------------------------------- >>>>>>=20 >>>>>> # POC code >>>>>>=20 >>>>>> Kernel: https://github.com/fengidri/linux-kernel-virtio-ism/commit= s/ism >>>>>> Qemu: https://github.com/fengidri/qemu/commits/ism >>>>>>=20 >>>>>> If there are any problems, please point them out. >>>>>>=20 >>>>>> Hope to hear from you, thank you. >>>>>>=20 >>>>>> [1] https://projectacrn.github.io/latest/tutorials/enable_ivshmem.ht= ml >>>>>> [2] https://dl.acm.org/doi/10.1145/2847562 >>>>>> [3] https://hal.archives-ouvertes.fr/hal-00368622/document >>>>>> [4] https://lwn.net/Articles/711071/ >>>>>> [5] https://lore.kernel.org/netdev/20220720170048.20806-1-tonylu@lin= ux.alibaba.com/T/ >>>>>>=20 >>>>>>=20 >>>>>> Xuan Zhuo (2): >>>>>> Reserve device id for ISM device >>>>>> virtio-ism: introduce new device virtio-ism >>>>>>=20 >>>>>> content.tex | 3 + >>>>>> virtio-ism.tex | 340 +++++++++++++++++++++++++++++++++++++++++++++++= ++ >>>>>> 2 files changed, 343 insertions(+) >>>>>> create mode 100644 virtio-ism.tex >>>>>>=20 >>>>>> -- >>>>>> 2.32.0.3.g01195cf9f >>>>>>=20 >>>>>>=20 >>>>>> --------------------------------------------------------------------= - >>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org = > >>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.or= g > >>>>>>=20 >>>>>=20 >>>>=20 >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail:virtio-dev-unsubscribe@lists.oasis-open.org >>>> For additional commands, e-mail:virtio-dev-help@lists.oasis-open.org <= mailto:virtio-dev-help@lists.oasis-open.org> --Apple-Mail=_235BC977-D379-4BFA-A2EC-F3DC7FB5D296 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

2022=E5=B9=B4= 10=E6=9C=8819=E6=97=A5 11:55=EF=BC=8CJason Wang <jasowang@redhat.com> =E5=86=99=E9=81=93= =EF=BC=9A


=E5= =9C=A8 2022/10/18 16:33, Gerry =E5=86=99=E9=81=93:


2022=E5=B9=B410=E6=9C=8818=E6=97=A5 14:54=EF=BC= =8CJason Wang <jasowan= g@redhat.com> =E5=86=99=E9=81=93=EF=BC=9A

On Mon, Oct 17, 2022 at 8:31 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:

On Mon, 17 = Oct 2022 16:17:31 +0800, Jason Wang <jasowang@redhat.com> wrote:
Adding Stefan.


On Mon, Oct 17, 2022 at 3:47 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wr= ote:

Hell= o everyone,

# Background

Nowadays, there is a common scenario to accelerate communication be= tween
different VMs and containers, including light weight vi= rtual machine based
containers. One way to achieve this is to= colocate them on the same host.
However, the performance of = inter-VM communication through network stack is not
optimal a= nd may also waste extra CPU cycles. This scenario has been discussed
many times, but still no generic solution available [1] [2] [3].
With pci-ivshmem + SMC(Shared Memory Communicati= ons: [4]) based PoC[5],
We found that by changing the communi= cation channel between VMs from TCP to SMC
with shared memory= , we can achieve superior performance for a common
socket-bas= ed application[5]:
 - latency reduced by about 50%
 - throughput increased by about 300%
 - C= PU consumption reduced by about 50%

Since ther= e is no particularly suitable shared memory management solution
matches the need for SMC(See ## Comparison with existing technology), an= d virtio
is the standard for communication in the virtualizat= ion world, we want to
implement a virtio-ism device based on = virtio, which can support on-demand
memory sharing across VMs= , containers or VM-container. To match the needs of SMC,
the = virtio-ism device need to support:

1. Dynamic = provision: shared memory regions are dynamically allocated and
  provisioned.
2. Multi-region management: the sh= ared memory is divided into regions,
  and a peer m= ay allocate one or more regions from the same shared memory
&= nbsp; device.
3. Permission control: The permission of e= ach region can be set seperately.

Looks like virtio-ROCE

https://lore.kernel.org/all/20220511095900.343-1-xieyongji@bytedance.com/= T/

and virtio-vhost-user can satisfy the r= equirement?


# Virtio ism device

ISM devi= ces provide the ability to share memory between different guests on a
host. A guest's memory got from ism device can be shared with mult= iple peers at
the same time. This shared relationship can be = dynamically created and released.

The shared m= emory obtained from the device is divided into multiple ism regions
for share. ISM device provides a mechanism to notify other ism regio= n referrers
of content update events.

# Usage (SMC as example)

Maybe there is = one of possible use cases:

1. SMC calls the in= terface ism_alloc_region() of the ism driver to return the
&n= bsp; location of a memory region in the PCI space and a token.
2. The ism driver mmap the memory region and return to SMC with the = token
3. SMC passes the token to the connected peer
3. the peer calls the ism driver interface ism_attach_region(token) t= o
  get the location of the PCI space of the shared= memory


# About hot plugging of= the ism device

  Hot plugging of de= vices is a heavier, possibly failed, time-consuming, and
&nbs= p; less scalable operation. So, we don't plan to support it for now.
# Comparison with existing technology

## ivshmem or ivshmem 2.0 of Qemu

  1. ivshmem 1.0 is a large piece of memory that can be = seen by all devices that
  use this VM, so the secu= rity is not enough.

  2. ivshmem 2.0= is a shared memory belonging to a VM that can be read-only by all
  other VMs that use the ivshmem 2.0 shared memory device, = which also does not
  meet our needs in terms of se= curity.

## vhost-pci and virtiovhostuser

  Does not support dynamic allocation and= therefore not suitable for SMC.

= I think this is an implementation issue, we can support VHOST IOTLB
message then the regions could be added/removed on demand.


1. After the attacker conn= ects with the victim, if the attacker does not
  de= reference memory, the memory will be occupied under virtiovhostuser. In the=
  case of ism devices, the victim can directly rel= ease the reference, and the
  maliciously reference= d region only occupies the attacker's resources
=
Let's define the security boundary here. E.g do we trust the= device or
not? If yes, in the case of virtiovhostuser, can w= e simple do
VHOST_IOTLB_UNMAP then we can safely release the = memory from the
attacker.
Thanks, = Jason:)
In our the design, there are several roles involved:<= br class=3D"">1) a virtio-ism-smc front-end driver
2) a Virti= o-ism backend device driver and its associated vmm
3) a globa= l device manager
4) a group of remote/peer virtio-ism backend= devices/vmms
5) a group of remote/peer virtio-ism-smc front-= end drivers

Among which , we treat 1, 2 and 3 = as trusted, 4 and 5 as untrusted.


It looks to m= e VIRTIO_ISM_PERM_MANAGE violates what you've described here. E.g what happ= ens if 1 grant this permission to 5?

My mistake, missed = some background information.
We split the communication into cont= rol plain and data plain. The above thread model is for control plain. Once= a peer has been granted permissions to access a memory region, it becomes = trusted to read/write the memory region.


=
Because 4 and 5 are tr= usted, we can=E2=80=99t guarantee that IOTLB Invalidate requests have been = executed as expected.


Interesting, I wonder how= this is guaranteed by ISM. Anything that can work for ISM but not IOTLB? N= ote that the only difference for me is the device API. We can hook anything= that works for ISM to IOTLB.
The difference is who is the resource owner.
For IOTLB based design, guest vm is the resource owner, so it could only r= eclaim a shared memory region from peers.
For our design, the dev= ice manager is the resource owner, guest vm allocate/free memory region fro= m the device manager. So for each SMC connection, a new memory region is al= located/freed, a memory region won=E2=80=99t be reused for SMCs connections= with different (local, peer) pairs.


Say when disconnecting a= n SMC connection, a malicious peer may ignore the IOTLB invalidation reques= t and keep access the shared memory region.

We= have considered the IOTLB based design but encountered several issues:
1) It depends on the way to provision guest vm memory. We need a= memory resource descriptor to support vhost-user IOTLB messages, thus can= =E2=80=99t support anonymous memory based vm.

H= ypervisor (Qemu) is free to hook IOTLB message to any kind of memory backen= d, isn't? E.g Qemu can choose to implement IOTLB by its own instead of forw= arding it to another VM.
A memory resource file descriptor is needed to share a memo= ry region among VMs. If the guest memory is provisioned by anonymous mapped= memory, it can=E2=80=99t be shared to other VMs.
In other words,= vhost may work with process virtual address, ghost-user always works with = file descriptors.  



2) Lack of fine-grain acces= s control of memory resource descriptor. When send a memory resource descri= ptor to an untrusted peer, we can=E2=80=99t enforce region based access con= trol. Memfd supports file level seal operations, but still lack of region b= ased permission control. Hugetlbfs based fd doesn=E2=80=99t support seal at= all.


So in the above, you said 4 and 5 are unt= rusted. If yes how you can enforce regioned based access control (the memor= y is still mapped by the untrsuted VMM)? And again, virtio-vhost-user is no= t limited to memfd/hugetlbfs, it can do want you've done in your protoype (= hooking to /dev/shm).
Let=E2=80=99s take an example. Say vmm provisions 1GB memory t= o guest A through a memfd, among which 1MB is allocated by guest A as share= d memory and want to share it with vm B.
We lack of technologies = to share the memfd to guest B and restrict guest B to only access the share= d 1MB region. 



3) Lack of reliable way to reclaim granted = access permissions from untrusted peers, as stated above.


It would be better to explain how this "reclaim" works.

4) How implement resource a= ccounting. Say a vm has shared some memory regions from peers, and those pe= ers exited unexpectedly, then those shared memory will be accounted to the = victim vm, and may cause unexpected OOM.

Based= on the above consideration, we adopted another design and introduced the d= evice manager to solve above issues:
1) the device manager is= the owner of memory buffers.


I don't see the d= efinition "device manager" in your proposal, this needs to be clarified in = both the spec and the changelog or the cover letter.
We will add the description abo= ut =E2=80=9Cdevice manager=E2=80=9D in next version.



2) the dev= ice manager creates a memfd for each memory buffer/region, and configure SE= ALs according to requested access permissions.
<= br style=3D"caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 1= 8px; font-style: normal; font-variant-caps: normal; font-weight: 400; lette= r-spacing: normal; text-align: start; text-indent: 0px; text-transform: non= e; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; = text-decoration: none;" class=3D"">
= Ok, but this seems not what you've implemented in your qemu prototype?
Not yet, we are st= ill working on it.
BTW, is rust/golang based prototype acceptable= to Qemu community? Or it must be written in C?



3) When a = guest vm reclaims a shared memory buffer, the device manager will provision= a new memfd to the guest vm.


How can this be d= one for the untrusted peers?
Please refer to above explanations:)



And it= will take the responsibility to reclaim the old buffer from peer and event= ually release the old buffer.
4) Simplify the control communi= cation channel. Every guest vm only needs to talk with the device manager a= nd no need to discover and communicate with other peers.


Not sure but it's better not mandate any model in the application= layer.
Made a= nother mistake here, it should be stated as

=
Every backend device driver only needs to talk with the device manager= and no need to discover and communicate with other peers.

It may help to give more information about the design g= oal we want to achieve:
1) extend pci-ivshmem v1/v2 to support mo= re complex usage scenarios like SMC.
2) virtio-ism is a generic s= hmem device and should support pci-ivshmem v1/v2 usage scenarios. 
3) SMC is the first/example usage scenario.
4) support com= munications among vm<=E2=80=94>vm, vm<=E2=80=94>run container, = runC container<=E2=80=94>runC container. 

Design goal 4 forces us to design virtio-ism instead of pci-i= vshmem v2/v3. 
The virtio/vDPA/vDUSE stack provides us a per= fect way to implement userspace shmem device to support runC containers.
And currently there=E2=80=99s no way to emulate PCI devices in user= space. 



Thanks



Thanks,
Gerry



2. The ism device of a VM can be shared with multiple = (1000+) VMs at the same
  time, which is a challeng= e for virtiovhostuser

Please elab= orate more the the challenges, anything make
virtiovhostuser = different?


3. The sharing relationship of ism is dynamically increase= d, and virtiovhostuser
  determines the sharing rel= ationship at startup.

Not necessa= rily with IOTLB API?


4. For security issues, the device under virtiov= hostuser may mmap more memory,
  while ism only map= s one region to other devices

Wit= h VHOST_IOTLB_MAP, the map could be done per region.

Thanks


Thanks.


Thanks


# Design
  This is a structure diagram based on ism sharing be= tween two vms.

   |------------= ---------------------------------------------------------------------------= ----------------------|
   | |----------------= --------------------------------|       |----= --------------------------------------------| |
  &= nbsp;| | Guest           =             &nb= sp;            =       |       |= Guest            &n= bsp;            = ;            &n= bsp;    | |
   | |  &= nbsp;           &nbs= p;            &= nbsp;           &nbs= p;        |     = ;  |           =             &nb= sp;            =             | |=
   | |   ----------------  &nb= sp;            =             &nb= sp; |       |   --------------= --             =             &nb= sp;   | |
   | |   | =    driver    |     [M1] &= nbsp; [M2]   [M3]      |  &nbs= p;    |   |    driver  &n= bsp; |           &nb= sp; [M2]   [M3]     | |
&n= bsp;  | |   ----------------     &n= bsp; |      |      |=       |       = |   ----------------        &n= bsp;      |      | &= nbsp;    | |
   | |  =   |cq|          &nbs= p;       |map   |map  &nb= sp;|map    |       |  &nb= sp; |cq|           &= nbsp;           &nbs= p;  |map   |map   | |
 &nb= sp; | |    |  |      &nbs= p;           |  = ;    |      |   &nbs= p;   |       |   &nb= sp;|  |           &n= bsp;            = ;  |      |     &nbs= p;| |
   | |    |  |  = ;            &n= bsp; -------------------     |    &= nbsp;  |    |  |     &nbs= p;          -------------= -------    | |
   | |----|--|--= --------------|  device memory  |-----|     &= nbsp; |----|--|----------------|  device memory   |----= | |
   | |    |  |  &= nbsp;           &nbs= p; -------------------     |    &nb= sp;  |    |  |      =           ---------------= -----    | |
   | |   = ;            &n= bsp;            = ;    |         =       |       |=             &n= bsp;            = ;     |        =         | |
 &nb= sp; | |           &n= bsp;            = ;        |     =           |   &= nbsp;   |         &n= bsp;            = ;         |    =             | |=
   | | Qemu      &nb= sp;            =         |     &= nbsp;         |   &n= bsp;   | Qemu        &nbs= p;            &= nbsp;    |        &n= bsp;       | |
  = ; | |--------------------------------+---------------|   &nb= sp;   |-------------------------------+----------------| |   |        =             &nb= sp;            =  |            &= nbsp;           &nbs= p;            &= nbsp;           &nbs= p;     |        = ;          |
   |          = ;            &n= bsp;           | &nb= sp;            =             &nb= sp;            =             &nb= sp;   |         &nbs= p;        |
 &nb= sp; |           &nbs= p;            &= nbsp;         |---------------= ---------------+------------------------|      &nb= sp;           |
   |        &nb= sp;            =             &nb= sp;            =             &nb= sp;      |      &nbs= p;            &= nbsp;           &nbs= p;           |
   |        &nbs= p;            &= nbsp;           &nbs= p;            &= nbsp;           &nbs= p;      |       = ;            &n= bsp;            = ;           |
   |         = ;            &n= bsp;            = ;            &n= bsp;    --------------------------    &n= bsp;            = ;            &n= bsp;  |
   |     = ;            &n= bsp;            = ;            &n= bsp;         | M1 |  &nbs= p;| M2 |   | M3 |         = ;            &n= bsp;           |
   |        &n= bsp;            = ;            &n= bsp;            = ;     --------------------------    = ;            &n= bsp;            = ;   |
   |    &n= bsp;            = ;            &n= bsp;            = ;            &n= bsp;            = ;            &n= bsp;            = ;            &n= bsp;    |
   | HOST  =             &nb= sp;            =             &nb= sp;            =             &nb= sp;            =             &nb= sp;            =   |
   -----------------------------= ---------------------------------------------------------------------------= -------

# POC code

  Kernel: https://github.com/fengidri/linux-kernel-= virtio-ism/commits/ism
  Qemu: https://github.com/feng= idri/qemu/commits/ism

If there are any pro= blems, please point them out.

Hope to hear fro= m you, thank you.

[1] https://p= rojectacrn.github.io/latest/tutorials/enable_ivshmem.html
[2] https://= dl.acm.org/doi/10.1145/2847562
[3] https://hal.archive= s-ouvertes.fr/hal-00368622/document
[4] https://lwn.net/Articles/711071/=
[5] https://lore.kernel.org/ne= tdev/20220720170048.20806-1-tonylu@linux.alibaba.com/T/
<= br class=3D"">
Xuan Zhuo (2):
 Reserve dev= ice id for ISM device
 virtio-ism: introduce new device = virtio-ism

content.tex    | &nb= sp; 3 +
virtio-ism.tex | 340 +++++++++++++++++++++++++++= ++++++++++++++++++++++
2 files changed, 343 insertions(+)
create mode 100644 virtio-ism.tex

-= -
2.32.0.3.g01195cf9f


---------------------------------------------------------------------=
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org <mailto:virtio-dev-unsubscribe@lists.= oasis-open.org>
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org <mailto:virtio-dev-help@lis= ts.oasis-open.org>



---------------------------------------= ------------------------------
To unsubscribe, e-mail:virtio-d= ev-unsubscribe@lists.oasis-open.org
For additional comman= ds, e-mail:virtio-dev-help@lists.oasis-open.org

--Apple-Mail=_235BC977-D379-4BFA-A2EC-F3DC7FB5D296--