Re: [RFC PATCH 00/13] x86 User Interrupts support

From: Sohil Mehta <sohil.mehta@intel.com>
To: Chrisma Pakha <cpakha@andrew.cmu.edu>
Cc: Tony Luck <tony.luck@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"H . Peter Anvin" <hpa@zytor.com>,
	"Andy Lutomirski" <luto@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	Christian Brauner <christian@brauner.io>,
	Peter Zijlstra <peterz@infradead.org>,
	Shuah Khan <shuah@kernel.org>, Arnd Bergmann <arnd@arndb.de>,
	Jonathan Corbet <corbet@lwn.net>, Ashok Raj <ashok.raj@intel.com>,
	Jacob Pan <jacob.jun.pan@linux.intel.com>,
	Gayatri Kammela <gayatri.kammela@intel.com>,
	Zeng Guang <guang.zeng@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Randy E Witt <randy.e.witt@intel.com>,
	Ravi V Shankar <ravi.v.shankar@intel.com>,
	Ramesh Thomas <ramesh.thomas@intel.com>,
	<linux-api@vger.kernel.org>, <linux-arch@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-kselftest@vger.kernel.org>,
	<seth@cmu.edu>, <x86@kernel.org>
Subject: Re: [RFC PATCH 00/13] x86 User Interrupts support
Date: Thu, 6 Jan 2022 18:08:55 -0800	[thread overview]
Message-ID: <7c1038de-0bae-3b87-d4e4-8a30a910ebdd@intel.com> (raw)
In-Reply-To: <9b5a8cf9-5ce6-c6ba-951f-757135d74492@andrew.cmu.edu>

Hi Chrisma,

On 12/22/2021 8:17 AM, Chrisma Pakha wrote:
> 
> The following is our understanding of the proposed User Interrupt.
> 

Thank you for giving this some thought.

> 
> We have been exploring how user-level interrupts (UIs) can be used to
> improve performance and programmability in several different areas:
> e.g., parallel programming, memory management, I/O, and floating-point
> libraries.

Can you please share more details on this? It would really help improve 
the API design.

> 
> # Current Use Cases
> 
> The Current RFC is focused on sending an interrupt from one user-space
> thread (UST) to another user-space thread (UST2UST).  These threads
> could be in different processes, as long as the sender has access to
> the receiver's User Interrupt File Descriptor (uifd).  Based on our
> understanding, UIs are currently targeted as a low overhead
> alternative for the current IPC mechanisms.
> 

That's correct.

> # Preparing for future use cases
> > If someone could point out an example for Kernel to
> user-space thread (K2UST) UI, we would appreciate it.
> 

The idea here is improve the kernel-to-user event notification latency. 
Theoretically, this can be useful when the kernel sees event completion 
on one cpu but it want to signal (notify) a thread actively running on 
some other CPU. The receiver thread can save some cycles by avoiding 
ring transitions to receive the event.

IO_URING is one of the examples for kernel-to-user event notifications. 
We are evaluating whether providing a UINTR based completion mechanism 
can have benefit over eventfd based completions. The benefits in 
practice are yet to be measured and proven.

> In our work, we have also been exploring precise UIs from the
> currently running thread.  We call these CPU to UST (CPU2UST) UIs.
> For example, a SIGSEGV generated by writing to a read-only page, a
> SIGFPE generated by dividing a number by zero.
> 

It is definitely possible in future to delivery CPU events as User 
Interrupts. The hardware architecture for this is still being worked on 
internally.

Though our focus isn't on exceptions being delivered as User Interrupts. 
Do you have details on what type of benefit is expected?

> - QUESTION: Is there is a rough draft/plan that we can refer to that 
> describes the
> current thinking on these three cases.
> 
> - QUESTION: Are there use cases for K2UST, or is K2UST the same as CPU2UST?
> 

No, K2UST isn't the same as CPU2UST. We would expect limited benefits 
from K2UST but on the other hand CPU2UST can provide significant speedup 
since it avoids the kernel completely.

Unfortunately, due to the large scope of the feature, the hardware 
architecture development is happening in stages. I don't have detailed 
plans for each of the sources of User Interrupts.

Here is our rough plan:

1. Provide a common infrastructure to receive User Interrupts. This is 
independent of the source of the interrupt. The intention here is to 
keep the software APIs generic and extendable so that future sources can 
be added without causing much disturbance to the older APIs.

2. Introduce various sources of User Interrupts in stages:

UST2UST - This RFC. Available in the upcoming Sapphire Rapids processor.

K2UST - Also available in upcoming Sapphire Rapids. Working towards 
proving the value before sending something out.

D2UST - Future processor. Hardware architecture being worked on 
internally. Not much to share right now.

CPU2UST - Future processor. Hardware architecture being worked on 
internally. Not much to share right now.

> # Basic Understanding
> 

The overall description you have mentioned below looks good to me. I 
have added some minor comments for clarification.

Also, the abbreviations that you have used are somewhat different from 
the ones I have used in the patches.

> First, we would like to make sure that our understanding of the 
> terminology and the data structures is correct.
> 
> - User Interrupt Vector (UIV): The identity of the user interrupt.
> - User Interrupt Target Table (UITT):
>    This allows the sender to locate the "address" of the receiver 
> through the uifd.

The UITT refers to the 'UPID' address which is different from the uifd 
that you mention below.

> Below outlines our understanding of the current API for UIs.
> 

All of the statements below seem accurate.

However, some of the restrictions below are due to hardware design and 
some are mainly due to the software implementation. The software design 
and APIs might change significantly as this patch series evolves.

Please feel free to provide input wherever you think the APIs can be 
improved.

> - Each thread that can receive UIs has exactly one handler
>    registered with `uintr_register_handler` (a syscall).
> - Each thread that registers a handler calls `uintr_create_fd` for
>    every user-level interrupt vector (UIV) that they expect to receive.
> - The only information delivered to the handler is the UIV.
> - There are 64 UIVs that can be used per thread.

Though only one generic handler is registered with the hardware, an 
application can choose to implement 64 unique sub-handlers in user space 
based on each unique UIV.

> - A thread that wants to send a UI must register the receiver's uifd 
> with `uintr_register_sender`  (a syscall).
>    This returns an index the sender uses to locate the receiver.
> - `_senduipi(index)` sends a user interrupt to a particular destination.
>    The sender's UITT and index determine the destination.
> - A thread uses `_stui` (and `_clui`) to enable (and disable) the 
> reception of UIs.
> - As for now, there is no mechanism to mask a particular UIV.
> - A UI is delivered to the receiver immediately only if it is currently 
> running.
> - If a thread executes the `uintr_wait()`, it will be scheduled only 
> after receiving a UI.
>    There is no guarantee on the delay between the processor receiving 
> the UI and when the thread is scheduled.
> - If a thread is the target of a UI and another thread is running, or 
> the target thread is blocked in the kernel,
>    then the target thread will handle the UI when it is next scheduled.
> - Ordinary interrupts (interrupt delivered with CPL=0) have a higher 
> priority over user interrupts.
> - UI handler only saves general-purpose registers (e.g., do not save 
> floating-point registers).

The saving and restoring of the registers is done by gcc when the muintr 
flag along with the 'interrupt' attribute is used. Applications can 
choose to save floating point registers as part of the interrupt handler 
as well.

To make it easier for applications we are working on implementing a thin 
library that can help with some of this common functionality like saving 
floating point registers or redirecting to 64 sub-handlers.

> - User Interrupts with higher UIV are given a higher priority than those 
> with smaller UIV.
> 
> ## Private UITT
> 
> The Current RFC focuses on a private UITT where each thread has its own
> UITT.  Thus, different threads executing `_senduipi(index1)` with the
> same `index1` may cause different receiver threads to be interrupted.
> 

That's right.

> In many cases, the receiver of an interrupt needs to know which thread
> sent the interrupt. If we understand the proposal correctly, there are
> only 64 user-level interrupt vectors (UIVs), and the UIV is the only
> information transmitted to the receiver. The UIV itself allows the
> receiver to distinguish different senders through careful management
> of the receiver's UIV.
> 

That's correct. User Interrupts mainly provide a door bell mechanism 
with the actual data expected to be shared through some existing mechanism.

If multiple senders want to share the same interrupt vector then they 
would have to rely on some sort of shared memory (or similar) mechanism 
to relay the relevant information to the receiver. This would likely 
come with some latency cost.

> - QUESTION: Given the code below where the same UIV is registered twice:
> ```c
>    uintr_fd1 = uintr_create_fd(vector1, flags)
>    uintr_fd2 = uintr_create_fd(vector1, flags)
> ```
> Would `uintr_fd1` be the same as `uintr_fd2`, or would it be registered 
> with a different index in the UITT table?

In the current design, if the same thread tries to register the same 
vector again the second uintr_create_fd() would fail with a EBUSY error 
code.

> 
> - QUESTION: If it is registered in a different index, would the
>    receiver be able to distinguish the sender if `uintr_fd1` and
>    `uintr_fd2` are used from two different threads?
> 
> - QUESTION: What is the intended future use of the `flags` argument?
> 

In the uintr_create_fd() call, flags would be used to provide options 
such as O_CLOEXEC. In general, I added flags argument to all the system 
calls to keep them extendable when new boolean options need to be added.

> ## Shared UITT
> 
> In the case of the shared UITT model, all the threads share the same
> UITT and thus, if two different threads execute `_senduipi(index)`
> with the same index, they would both cause an interrupt in the
> same destination/receiver.
> 
> - QUESTION: Since both threads use the same entry (same
>    destination/receiver), does this mean that the receiver will not be
>    able to distinguish the sender of the interrupt?
> 

Yes. However this is true even in case of a private UITT. This isn't 
because the senders used the same UITT index rather it is the result of 
the senders generating the same UIVs.

For example, even if a receiver created 2 FDs with 2 unique vectors.

	uintr_fd1 = uintr_create_fd(vector1, flags)
	uintr_fd2 = uintr_create_fd(vector2, flags)

In case of the a private UITT, both sender threads can register 
themselves with uintr_fd1. They might get different uitt indexes 
returned to them. But when they generate a User interrupt using their 
respective index, the end result would be the same. The receiver will 
see the same vector1 being generated. There is no way for the receiver 
to distinguish the sender without some additional information being 
shared somewhere.

> # Multi-threaded parallel programming example
> 
> One of the uses for UIs that we have been exploring is combining the
> message-passing and shared memory models for parallel programming.  In
> our approach, message-passing is used for synchronization and shared
> memory for data sharing.  The message passing part of the programming
> pattern is based loosely on Active Messages (See ISCA92), where a
> particular thread can turn off/on interrupts to ignore incoming
> messages so they can execute critical sections without having to
> notify any other threads in the system.
> 

This look like a good fit for the User IPI (UST2UST) implementation in 
this RFC. Have you had a chance to evaluate the current API design for 
this usage?

Also, is any of the above work publicly available?

> - QUESTION: Is there any data on the performance impact of `_stui` and 
> `_clui`?
> 

_stui and _clui are expected to have very minimal overhead since they 
only modify a local flag. I'll try to measure this next time I am doing 
some performance measurement.

Thanks,
Sohil