On Tue, Dec 01, 2020 at 02:45:18PM +0100, Stefano Garzarella wrote: > On Tue, Dec 01, 2020 at 12:59:43PM +0000, Stefan Hajnoczi wrote: > > On Fri, Nov 20, 2020 at 07:31:08AM -0500, Michael S. Tsirkin wrote: > > > On Fri, Nov 20, 2020 at 08:45:49AM +0000, Stefan Hajnoczi wrote: > > > > On Thu, Nov 19, 2020 at 5:08 PM Stefan Hajnoczi wrote: > > > > > > > > > > On Thu, Nov 19, 2020 at 4:43 PM Mike Christie > > > > > wrote: > > > > > > > > > > > > On 11/19/20 10:24 AM, Stefan Hajnoczi wrote: > > > > > > > On Thu, Nov 19, 2020 at 4:13 PM Mike Christie > > > > > > > wrote: > > > > > > >> > > > > > > >> On 11/19/20 8:46 AM, Michael S. Tsirkin wrote: > > > > > > >>> On Wed, Nov 18, 2020 at 11:31:17AM +0000, Stefan Hajnoczi wrote: > > > > > > > struct vhost_run_worker_info { > > > > > > > struct timespec *timeout; > > > > > > > sigset_t *sigmask; > > > > > > > > > > > > > > /* List of virtqueues to process */ > > > > > > > unsigned nvqs; > > > > > > > unsigned vqs[]; > > > > > > > }; > > > > > > > > > > > > > > /* This blocks until the timeout is reached, a signal is received, or > > > > > > > the vhost device is destroyed */ > > > > > > > int ret = ioctl(vhost_fd, VHOST_RUN_WORKER, &info); > > > > > > > > > > > > > > As you can see, userspace isn't involved with dealing with the > > > > > > > requests. It just acts as a thread donor to the vhost driver. > > > > > > > > > > > > > > We would want the VHOST_RUN_WORKER calls to be infrequent to avoid the > > > > > > > penalty of switching into the kernel, copying in the arguments, etc. > > > > > > > > > > > > I didn't get this part. Why have the timeout? When the timeout expires, > > > > > > does userspace just call right back down to the kernel or does it do > > > > > > some sort of processing/operation? > > > > > > > > > > > > You could have your worker function run from that ioctl wait for a > > > > > > signal or a wake up call from the vhost_work/poll functions. > > > > > > > > > > An optional timeout argument is common in blocking interfaces like > > > > > poll(2), recvmmsg(2), etc. > > > > > > > > > > Although something can send a signal to the thread instead, > > > > > implementing that in an application is more awkward than passing a > > > > > struct timespec. > > > > > > > > > > Compared to other blocking calls we don't expect > > > > > ioctl(VHOST_RUN_WORKER) to return soon, so maybe the timeout will > > > > > rarely be used and can be dropped from the interface. > > > > > > > > > > BTW the code I posted wasn't a carefully thought out proposal :). The > > > > > details still need to be considered and I'm going to be offline for > > > > > the next week so maybe someone else can think it through in the > > > > > meantime. > > > > > > > > One final thought before I'm offline for a week. If > > > > ioctl(VHOST_RUN_WORKER) is specific to a single vhost device instance > > > > then it's hard to support poll-mode (busy waiting) workers because > > > > each device instance consumes a whole CPU. If we stick to an interface > > > > where the kernel manages the worker threads then it's easier to share > > > > workers between devices for polling. > > > > > > > > > Yes that is the reason vhost did its own reason in the first place. > > > > > > > > > I am vaguely thinking about poll(2) or a similar interface, > > > which can wait for an event on multiple FDs. > > > > I can imagine how using poll(2) would work from a userspace perspective, > > but on the kernel side I don't think it can be implemented cleanly. > > poll(2) is tied to the file_operations->poll() callback and > > read/write/error events. Not to mention there isn't a way to substitue > > the vhost worker thread function instead of scheduling out the current > > thread while waiting for poll fd events. > > > > But maybe ioctl(VHOST_WORKER_RUN) can do it: > > > > struct vhost_run_worker_dev { > > int vhostfd; /* /dev/vhost-TYPE fd */ > > unsigned nvqs; /* number of virtqueues in vqs[] */ > > unsigned vqs[]; /* virtqueues to process */ > > }; > > > > struct vhost_run_worker_info { > > struct timespec *timeout; > > sigset_t *sigmask; > > > > unsigned ndevices; > > struct vhost_run_worker_dev *devices[]; > > }; > > > > In the simple case userspace sets ndevices to 1 and we just handle > > virtqueues for the current device. > > > > In the fancier shared worker thread case the userspace process has the > > vhost fds of all the devices it is processing and passes them to > > ioctl(VHOST_WORKER_RUN) via struct vhost_run_worker_dev elements. > > Which fd will be used for this IOCTL? One of the 'vhostfd' or we should > create a new /dev/vhost-workers (or something similar)? > > Maybe the new device will be cleaner and can be reused also for other stuff > (I'm thinking about vDPA software devices). > > > > > From a security perspective it means the userspace thread has access to > > all vhost devices (because it has their fds). > > > > I'm not sure how the mm is supposed to work. The devices might be > > associated with different userspace processes (guests) and therefore > > have different virtual memory. > > Maybe in this case we should do something similar to io_uring SQPOLL kthread > where kthread_use_mm()/kthread_unuse_mm() is used to switch virtual memory > spaces. > > After writing, I saw that we already do it this in the vhost_worker() in > drivers/vhost/vhost.c > > > > > Just wanted to push this discussion along a little further. I'm buried > > under emails and probably wont be very active over the next few days. > > > > I think ioctl(VHOST_WORKER_RUN) might be the right way and also maybe the > least difficult one. Sending an ioctl API proposal email could help progress this discussion. Interesting questions: 1. How to specify which virtqueues to process (Mike's use case)? 2. How to process multiple devices? Stefan