linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/13] x86 User Interrupts support
@ 2021-09-13 20:01 Sohil Mehta
  2021-09-13 20:01 ` [RFC PATCH 01/13] x86/uintr/man-page: Include man pages draft for reference Sohil Mehta
                   ` (17 more replies)
  0 siblings, 18 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

User Interrupts Introduction
============================

User Interrupts (Uintr) is a hardware technology that enables delivering
interrupts directly to user space.

Today, virtually all communication across privilege boundaries happens by going
through the kernel. These include signals, pipes, remote procedure calls and
hardware interrupt based notifications. User interrupts provide the foundation
for more efficient (low latency and low CPU utilization) versions of these
common operations by avoiding transitions through the kernel.

In the User Interrupts hardware architecture, a receiver is always expected to
be a user space task. However, a user interrupt can be sent by another user
space task, kernel or an external source (like a device).

In addition to the general infrastructure to receive user interrupts, this
series introduces a single source: interrupts from another user task.  These
are referred to as User IPIs.

The first implementation of User IPIs will be in the Intel processor code-named
Sapphire Rapids. Refer Chapter 11 of the Intel Architecture instruction set
extensions for details of the hardware architecture [1].

Series-reviewed-by: Tony Luck <tony.luck@intel.com>

Main goals of this RFC
======================
- Introduce this upcoming technology to the community.
This cover letter includes a hardware architecture summary along with the
software architecture and kernel design choices. This post is a bit long as a
result. Hopefully, it helps answer more questions than it creates :) I am also
planning to talk about User Interrupts next week at the LPC Kernel summit.

- Discuss potential use cases.
We are starting to look at actual usages and libraries (like libevent[2] and
liburing[3]) that can take advantage of this technology. Unfortunately, we
don't have much to share on this right now. We need some help from the
community to identify usages that can benefit from this. We would like to make
sure the proposed APIs work for the eventual consumers.

- Get early feedback on the software architecture.
We are hoping to get some feedback on the direction of overall software
architecture - starting with User IPI, extending it for kernel-to-user
interrupt notifications and external interrupts in the future. 

- Discuss some of the main architecture opens.
There is lot of work that still needs to happen to enable this technology. We
are looking for some input on future patches that would be of interest. Here
are some of the big opens that we are looking to resolve.
* Should Uintr interrupt all blocking system calls like sleep(), read(),
  poll(), etc? If so, should we implement an SA_RESTART type of mechanism
  similar to signals? - Refer Blocking for interrupts section below.

* Should the User Interrupt Target table (UITT) be shared between threads of a
  multi-threaded application or maybe even across processes? - Refer Sharing
  the UITT section below.

Why care about this? - Micro benchmark performance
==================================================
There is a ~9x or higher performance improvement using User IPI over other IPC
mechanisms for event signaling.

Below is the average normalized latency for a 1M ping-pong IPC notifications
with message size=1.

+------------+-------------------------+
| IPC type   |   Relative Latency      |
|            |(normalized to User IPI) |
+------------+-------------------------+
| User IPI   |                     1.0 |
| Signal     |                    14.8 |
| Eventfd    |                     9.7 |
| Pipe       |                    16.3 |
| Domain     |                    17.3 |
+------------+-------------------------+

Results have been estimated based on tests on internal hardware with Linux
v5.14 + User IPI patches.

Original benchmark: https://github.com/goldsborough/ipc-bench
Updated benchmark: https://github.com/intel/uintr-ipc-bench/tree/linux-rfc-v1

*Performance varies by use, configuration and other factors.

How it works underneath? - Hardware Summary
===========================================
User Interrupts is a posted interrupt delivery mechanism. The interrupts are
first posted to a memory location and then delivered to the receiver when they
are running with CPL=3.

Kernel managed architectural data structures
--------------------------------------------
UPID: User Posted Interrupt Descriptor - Holds receiver interrupt vector
information and notification state (like an ongoing notification, suppressed
notifications).

UITT: User Interrupt Target Table - Stores UPID pointer and vector information
for interrupt routing on the sender side. Referred by the senduipi instruction.

The interrupt state of each task is referenced via MSRs which are saved and
restored by the kernel during context switch.

Instructions
------------
senduipi <index> - send a user IPI to a target task based on the UITT index.

clui - Mask user interrupts by clearing UIF (User Interrupt Flag).

stui - Unmask user interrupts by setting UIF.

testui - Test current value of UIF.

uiret - return from a user interrupt handler.

User IPI
--------
When a User IPI sender executes 'senduipi <index>', the hardware refers the
UITT table entry pointed by the index and posts the interrupt vector (63-0)
into the receiver's UPID.

If the receiver is running (CPL=3), the sender cpu would send a physical IPI to
the receiver's cpu. On the receiver side this IPI is detected as a User
Interrupt. The User Interrupt handler for the receiver is invoked and the
vector number (63-0) is pushed onto the stack.

Upon execution of 'uiret' in the interrupt handler, the control is transferred
back to instruction that was interrupted.

Refer Chapter 11 of the Intel Architecture instruction set extensions [1] for
more details.

Application interface - Software Architecture
=============================================
User Interrupts (Uintr) is an opt-in feature (unlike signals). Applications
wanting to use Uintr are expected to register themselves with the kernel using
the Uintr related system calls. A Uintr receiver is always a userspace task. A
Uintr sender can be another userspace task, kernel or a device.

1) A receiver can register/unregister an interrupt handler using the Uintr
receiver related syscalls. 
		uintr_register_handler(handler, flags)
		uintr_unregister_handler(flags)

2) A syscall also allows a receiver to register a vector and create a user
interrupt file descriptor - uintr_fd. 
		uintr_fd = uintr_create_fd(vector, flags)

Uintr can be useful in some of the usages where eventfd or signals are used for
frequent userspace event notifications. The semantics of uintr_fd are somewhat
similar to an eventfd() or the write end of a pipe.

3) Any sender with access to uintr_fd can use it to deliver events (in this
case - interrupts) to a receiver. A sender task can manage its connection with
the receiver using the sender related syscalls based on uintr_fd.
		uipi_index = uintr_register_sender(uintr_fd, flags)

Using an FD abstraction provides a secure mechanism to connect with a receiver.
The FD sharing and isolation mechanisms put in place by the kernel would extend
to Uintr as well. 

4a) After the initial setup, a sender task can use the SENDUIPI instruction
along with the uipi_index to generate user IPIs without any kernel
intervention.
		SENDUIPI <uipi_index>

If the receiver is running (CPL=3), then the user interrupt is delivered
directly without a kernel transition. If the receiver isn't running the
interrupt is delivered when the receiver gets context switched back. If the
receiver is blocked in the kernel, the user interrupt is delivered to the
kernel which then unblocks the intended receiver to deliver the interrupt.

4b) If the sender is the kernel or a device, the uintr_fd can be passed onto
the related kernel entity to allow them to setup a connection and then generate
a user interrupt for event delivery. <The exact details of this API are still
being worked upon.>

For details of the user interface and associated system calls refer the Uintr
man-pages draft:
https://github.com/intel/uintr-linux-kernel/tree/rfc-v1/tools/uintr/manpages.
We have also included the same content as patch 1 of this series to make it
easier to review.

Refer the Uintr compiler programming guide [4] for details on Uintr integration
with GCC and Binutils.

Kernel design choices
=====================
Here are some of the reasons and trade-offs for the current design of the APIs.

System call interface
---------------------
Why a system call interface?: The 2 options we considered are using a char
device at /dev or use system calls (current approach). A syscall approach
avoids exposing a core cpu feature through a driver model. Also, we want to
have a user interrupt FD per vector and share a single common interrupt handler
among all vectors. This seems easier for the kernel and userspace to accomplish
using a syscall based approach.

Data sharing using user interrupts: Uintr doesn't include a mechanism to
share/transmit data. The expectation is applications use existing data sharing
mechanisms to share data and use Uintr only for signaling.

An FD for each vector: A uintr_fd is assigned to each vector to allow fine
grained priority and event management by the receiver. The alternative we
considered was to allocate an FD to the interrupt handler and having that
shared with the sender. However, that approach relies on the sender selecting
the vector and moves the vector priority management to the sender. Also, if
multiple senders want to send unique user interrupts they would need to
coordinate the vector selection amongst them.

Extending the APIs: Currently, the system calls are only extendable using the
flags argument. We can add a variable size struct to some of the syscalls if
needed.

Extending existing mechanisms
-----------------------------
Uintr can be beneficial in some of the usages where eventfd() or signals are
used. Since Uintr is hardware-dependent, thread-specific and bypasses the
kernel in the fast path, it makes extending existing mechanisms harder.

Main issues with extending signals:
Signal handlers are defined significantly differently than a User interrupt
handler. An application needs to save/restore registers in a user interrupt
handler and call uiret to return from it. Also, signals can be process directed
(or thread directed) but user interrupts are always thread directed.

Comparison of signals with User Interrupts:
+=====================+===========================+===========================+
|                     | Signals                   | User Interrupts           |
+=====================+===========================+===========================+
| Stacks              | Has alt stacks            | Uses application stack    |
|                     |                           | (alternate stack option   |
|                     |                           | not yet enabled)          |
+---------------------+---------------------------+---------------------------+
| Registers state     | Kernel manages incl.      | App responsible (Use GCC  |
|                     | FPU/XSTATE area           | 'interrupt' attribute for |
|                     |                           | general purpose registers)|
+---------------------+---------------------------+---------------------------+
| Blocking/Masking    | sigprocmask(2)/sa_mask    | CLUI instruction (No per  |
|                     |                           | vector masking)           |
+---------------------+---------------------------+---------------------------+
| Direction           | Uni-directional           | Uni-directional           |
+---------------------+---------------------------+---------------------------+
| Post event          | kill(), signal(),         | SENDUIPI <index> - index  |
|                     | sigqueue(), etc.          | derived from uintr_fd     |
+---------------------+---------------------------+---------------------------+
| Target              | Process-directed or       | Thread-directed           |
|                     | thread-directed           |                           |
+---------------------+---------------------------+---------------------------+
| Fork/inheritance    | Empty signal set          | Nothing is inherited      |
+---------------------+---------------------------+---------------------------+
| Execv               | Pending signals preserved | Nothing is inherited      |
+---------------------+---------------------------+---------------------------+
| Order of delivery   | Undetermined              | High to low vector numbers|
| for multiple signals|                           |                           |
+---------------------+---------------------------+---------------------------+
| Handler re-entry    | All signals except the    | No interrupts can cause   |
|                     | one being handled         | handler re-entry.         |
+---------------------+---------------------------+---------------------------+
| Delivery feedback   | 0 or -1 based on whether  | No feedback on whether the|
|                     | the signal was sent       | interrupt was sent or     |
|                     |                           | received.                 |
+---------------------+---------------------------+---------------------------+

Main issues with extending eventfd():
eventfd() has a counter value that is core to the API. User interrupts can't
have an associated counter since the signaling happens at the user level and
the hardware doesn't have a memory counter mechanism. Also, eventfd can be used
for bi-directional signaling where as uintr_fd is uni-directional.

Comparison of eventfd with uintr_fd:
+====================+======================+==============================+
|                    | Eventfd              | uintr_fd (User Interrupt FD) |
+====================+======================+==============================+
| Object             | Counter - uint64     | Receiver vector information  |
+--------------------+----------------------+------------------------------+
| Post event         | write() to eventfd   | SENDUIPI <index> - index     |
|                    |                      | derived from uintr_fd        |
+--------------------+----------------------+------------------------------+
| Receive event      | read() on eventfd    | Implicit - Handler is        |
|                    |                      | invoked with associated      |
|                    |                      | vector.                      |
+--------------------+----------------------+------------------------------+
| Direction          | Bi-directional       | Uni-directional              |
+--------------------+----------------------+------------------------------+
| Data transmitted   | Counter - uint64     | None                         |
+--------------------+----------------------+------------------------------+
| Waiting for events | Poll() family of     | No per vector wait.          |
|                    | syscalls             | uintr_wait() allows waiting  |
|                    |                      | for all user interrupts      |
+--------------------+----------------------+------------------------------+

Security Model
==============
User Interrupts is designed as an opt-in feature (unlike signals). The security
model for user interrupts is intended to be similar to eventfd(). The general
idea is that any sender with access to uintr_fd would be able to generate the
associated interrupt vector for the receiver task that created the fd.

Untrusted processes
-------------------
The current implementation expects only trusted and cooperating processes to
communicate using user interrupts. Coordination is expected between processes
for a connection teardown. In situations where coordination doesn't happen
(say, due to abrupt process exit), the kernel would end up keeping shared
resources (like UPID) allocated to avoid faults.

Currently, a sender can easily cause a denial of service for the receiver by
generating a storm of user interrupts. A user interrupt handler is invoked with
interrupts disabled, but upon execution of uiret, interrupts get enabled again
by the hardware. This can lead to the handler being invoked again before normal
execution can resume. There isn't a hardware mechanism to mask specific
interrupt vectors. 

To enable untrusted processes to communicate, we need to add a per-vector
masking option through another syscall (or maybe IOCTL). However, this can add
some complexity to the kernel code. A vector can only be masked by modifying
the UITT entries at the source. We need to be careful about races while
removing and restoring the UPID from the UITT.

Resource limits
---------------
The maximum number of receiver-sender connections would be limited by the
maximum number of open file descriptors and the size of the UITT.

The UITT size is chosen as 4kB fixed size arbitrarily right now. We plan to
make it dynamic and configurable in size. RLIMIT_MEMLOCK or ENOMEM should be
triggered when the size limits have been hit.

Main Opens
==========

Blocking for interrupts
-----------------------
User interrupts are delivered to applications immediately if they are running
in userspace. If a receiver task has blocked in the kernel using the placeholder
uintr_wait() syscall, the task would be woken up to deliver the user interrupt.
However, if the task is blocked due to any other blocking calls like read(),
sleep(), etc; the interrupt will only get delivered when the application gets
scheduled again. We need to consider if applications need to receive User
Interrupts as soon as they are posted (similar to signals) when they are
blocked due to some other reason. Adding this capability would likely make the
kernel implementation more complex.

Interrupting system calls using User Interrupts would also mean we need to
consider an SA_RESTART type of mechanism. We also need to evaluate if some of
the signal handler related semantics in the kernel can be reused for User
Interrupts.

Sharing the User Interrupt Target Table (UITT)
----------------------------------------------
The current implementation assigns a unique UITT to each task. This assumes
that User interrupts are used for point-to-point communication between 2 tasks.
Also, this keeps the kernel implementation relatively simple.

However, there are of benefits to sharing the UITT between threads of a
multi-threaded application. One, they would see a consistent view of the UITT.
i.e. SENDUIPI <index> would mean the same on all threads of the application.
Also, each thread doesn't have to register itself using the common uintr_fd.
This would simplify the userspace setup and make efficient use of kernel
memory. The potential downside is that the kernel implementation to allocate,
modify, expand and free the UITT would be more complex.

A similar argument can be made for a set of processes that do a lot of IPC
amongst them. They would prefer to have a shared UITT that lets them target any
process from any process. With the current file descriptor based approach, the
connection setup can be time consuming and somewhat cumbersome. We need to
evaluate if this can be made simpler as well.

Kernel page table isolation (KPTI)
----------------------------------
SENDUIPI is a special ring-3 instruction that makes a supervisor mode memory
access to the UPID and UITT memory. The current patches need KPTI to be
disabled for User IPIs to work. To make User IPI work with KPTI, we need to
allocate these structures from a special memory region that has supervisor
access but it is mapped into userspace. The plan is to implement a mechanism
similar to LDT. 

Processors that support user interrupts are not affected by Meltdown so the
auto mode of KPTI will default to off. Users who want to force enable KPTI will
need to wait for a later version of this patch series to use user interrupts.
Please let us know if you want the development of these patches to be
prioritized (or deprioritized).

FAQs
====
Q: What happens if a process is "surprised" by a user interrupt?
A: For tasks that haven't registered with the kernel and requested for user
interrupts aren't expected or able to receive to user interrupts.

Q: Do user interrupts affect kernel scheduling?
A: No. If a task is blocked waiting for user interrupts, when the kernel
receives a notification on behalf of that task we only put it back on the
runqueue. Delivery of a user interrupt in no way changes the scheduling
priorities of a task.

Q: Does the sender get to know if the interrupt was delivered?
A: No. User interrupts only provides a posted interrupt delivery mechanism. If
applications need to rely on whether the interrupt was delivered they should
consider a userspace mechanism for feedback (like a shared memory counter or a
user interrupt back to the sender).

Q: Why is there no feedback on interrupt delivery?
A: Being a posted interrupt delivery mechanism, the interrupt delivery
happens in 2 steps:
1) The interrupt information is stored in a memory location (UPID).
2) The physical interrupt is delivered to the interrupt receiver.

The 2nd step could happen immediately, after an extended period, or it might
never happen based on the state of the receiver after step 1. (The receiver
could have disabled interrupts, have been context switched out or it might have
crashed during that time.) This makes it very hard for the hardware to reliably
provide feedback upon execution of SENDUIPI.

Q: Can user interrupts be nested?
A: Yes. Using STUI instruction in the interrupt handler would allow new user
interrupts to be delivered. However, there no TPR(thread priority register)
like mechanism to allow only higher priority interrupts. Any user interrupt can
be taken when nesting is enabled.

Q: Can a task receive all pending user interrupts in one go?
A: No. The hardware allows only one vector to be processed at a time. If a task
is interested in knowing all the interrupts that are pending then we could add
a syscall that provides the pending interrupts information.

Q: Do the processes need to be pinned to a cpu?
A: No. User interrupts will be routed correctly to whichever cpu the receiver
is running on. The kernel updates the cpu information in the UPID during
context switch.

Q: Why are UPID and UITT allocated by the kernel?
A: If allocated by user space, applications could misuse the UPID and UITT to
write to unauthorized memory and generate interrupts on any cpu. The UPID and
UITT are allocated by the kernel and accessed by the hardware with supervisor
privilege.

Patch structure for this series
===============================
- Man-pages and Kernel documentation (patch 1,2)
- Hardware enumeration (patch 3, 4)
- User IPI kernel vector reservation (patch 5)
- Syscall interface for interrupt receiver, sender and vector
  management(uintr_fd) (patch 6-12)
- Basic selftests (patch 13)

Along with the patches in this RFC, there are additional tests and samples that
are available at:
https://github.com/intel/uintr-linux-kernel/tree/rfc-v1

Links
=====
[1]: https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
[2]: https://libevent.org/
[3]: https://github.com/axboe/liburing
[4]: https://github.com/intel/uintr-compiler-guide/blob/uintr-gcc-11.1/UINTR-compiler-guide.pdf

Sohil Mehta (13):
  x86/uintr/man-page: Include man pages draft for reference
  Documentation/x86: Add documentation for User Interrupts
  x86/cpu: Enumerate User Interrupts support
  x86/fpu/xstate: Enumerate User Interrupts supervisor state
  x86/irq: Reserve a user IPI notification vector
  x86/uintr: Introduce uintr receiver syscalls
  x86/process/64: Add uintr task context switch support
  x86/process/64: Clean up uintr task fork and exit paths
  x86/uintr: Introduce vector registration and uintr_fd syscall
  x86/uintr: Introduce user IPI sender syscalls
  x86/uintr: Introduce uintr_wait() syscall
  x86/uintr: Wire up the user interrupt syscalls
  selftests/x86: Add basic tests for User IPI

 .../admin-guide/kernel-parameters.txt         |   2 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/user-interrupts.rst         | 107 +++
 arch/x86/Kconfig                              |  12 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   6 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   6 +
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/entry-common.h           |   4 +
 arch/x86/include/asm/fpu/types.h              |  20 +-
 arch/x86/include/asm/fpu/xstate.h             |   3 +-
 arch/x86/include/asm/hardirq.h                |   4 +
 arch/x86/include/asm/idtentry.h               |   5 +
 arch/x86/include/asm/irq_vectors.h            |   6 +-
 arch/x86/include/asm/msr-index.h              |   8 +
 arch/x86/include/asm/processor.h              |   8 +
 arch/x86/include/asm/uintr.h                  |  76 ++
 arch/x86/include/uapi/asm/processor-flags.h   |   2 +
 arch/x86/kernel/Makefile                      |   1 +
 arch/x86/kernel/cpu/common.c                  |  61 ++
 arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
 arch/x86/kernel/fpu/core.c                    |  17 +
 arch/x86/kernel/fpu/xstate.c                  |  20 +-
 arch/x86/kernel/idt.c                         |   4 +
 arch/x86/kernel/irq.c                         |  51 +
 arch/x86/kernel/process.c                     |  10 +
 arch/x86/kernel/process_64.c                  |   4 +
 arch/x86/kernel/uintr_core.c                  | 880 ++++++++++++++++++
 arch/x86/kernel/uintr_fd.c                    | 300 ++++++
 include/linux/syscalls.h                      |   8 +
 include/uapi/asm-generic/unistd.h             |  15 +-
 kernel/sys_ni.c                               |   8 +
 scripts/checksyscalls.sh                      |   6 +
 tools/testing/selftests/x86/Makefile          |  10 +
 tools/testing/selftests/x86/uintr.c           | 147 +++
 tools/uintr/manpages/0_overview.txt           | 265 ++++++
 tools/uintr/manpages/1_register_receiver.txt  | 122 +++
 .../uintr/manpages/2_unregister_receiver.txt  |  62 ++
 tools/uintr/manpages/3_create_fd.txt          | 104 +++
 tools/uintr/manpages/4_register_sender.txt    | 121 +++
 tools/uintr/manpages/5_unregister_sender.txt  |  79 ++
 tools/uintr/manpages/6_wait.txt               |  59 ++
 42 files changed, 2626 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/x86/user-interrupts.rst
 create mode 100644 arch/x86/include/asm/uintr.h
 create mode 100644 arch/x86/kernel/uintr_core.c
 create mode 100644 arch/x86/kernel/uintr_fd.c
 create mode 100644 tools/testing/selftests/x86/uintr.c
 create mode 100644 tools/uintr/manpages/0_overview.txt
 create mode 100644 tools/uintr/manpages/1_register_receiver.txt
 create mode 100644 tools/uintr/manpages/2_unregister_receiver.txt
 create mode 100644 tools/uintr/manpages/3_create_fd.txt
 create mode 100644 tools/uintr/manpages/4_register_sender.txt
 create mode 100644 tools/uintr/manpages/5_unregister_sender.txt
 create mode 100644 tools/uintr/manpages/6_wait.txt


base-commit: 6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 01/13] x86/uintr/man-page: Include man pages draft for reference
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-13 20:01 ` [RFC PATCH 02/13] Documentation/x86: Add documentation for User Interrupts Sohil Mehta
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

Included here in plain text format for reference and review.

<Will eventually send the man pages in groff format separately to the
man-pages repository.>

The formatting for the man pages still needs a little bit of work.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 tools/uintr/manpages/0_overview.txt           | 265 ++++++++++++++++++
 tools/uintr/manpages/1_register_receiver.txt  | 122 ++++++++
 .../uintr/manpages/2_unregister_receiver.txt  |  62 ++++
 tools/uintr/manpages/3_create_fd.txt          | 104 +++++++
 tools/uintr/manpages/4_register_sender.txt    | 121 ++++++++
 tools/uintr/manpages/5_unregister_sender.txt  |  79 ++++++
 tools/uintr/manpages/6_wait.txt               |  59 ++++
 7 files changed, 812 insertions(+)
 create mode 100644 tools/uintr/manpages/0_overview.txt
 create mode 100644 tools/uintr/manpages/1_register_receiver.txt
 create mode 100644 tools/uintr/manpages/2_unregister_receiver.txt
 create mode 100644 tools/uintr/manpages/3_create_fd.txt
 create mode 100644 tools/uintr/manpages/4_register_sender.txt
 create mode 100644 tools/uintr/manpages/5_unregister_sender.txt
 create mode 100644 tools/uintr/manpages/6_wait.txt

diff --git a/tools/uintr/manpages/0_overview.txt b/tools/uintr/manpages/0_overview.txt
new file mode 100644
index 000000000000..349538effb15
--- /dev/null
+++ b/tools/uintr/manpages/0_overview.txt
@@ -0,0 +1,265 @@
+UINTR(7)               Miscellaneous Information Manual               UINTR(7)
+
+
+
+NAME
+       Uintr - overview of User Interrupts
+
+DESCRIPTION
+       User Interrupts (Uintr) provides a low latency event delivery and inter
+       process communication mechanism. These events can be delivered directly
+       to userspace without a transition to the kernel.
+
+       In  the  User  Interrupts  hardware  architecture, a receiver is always
+       expected to be a user space task. However, a user interrupt can be sent
+       by  another  user  space  task,  kernel  or  an external source (like a
+       device). The feature that allows another  userspace  task  to  send  an
+       interrupt is referred to as User IPI.
+
+       Uintr  is  a  hardware  dependent,  opt-in  feature. Application aren't
+       expected or able to send or receive  interrupts  unless  they  register
+       themselves with the kernel using the syscall interface described below.
+       It is recommended that applications wanting to use User Interrupts call
+       uintr_register_handler(2) and test whether the call succeeds.
+
+       Hardware  support  for  User  interrupts  may  be  detected using other
+       mechanisms but they could be misleading and are generally not needed:
+        - Using the cpuid instruction (Refer  the  Intel  Software  Developers
+       Manual).
+        -  Checking  for the "uintr" string in /proc/cpuinfo under the "flags"
+       field.
+
+
+       Applications wanting to use Uintr  should  also  be  able  to  function
+       without  it  as well. Uintr support might be unavailable because of any
+       one of the following reasons:
+        - the kernel code does not contain support
+        - the kernel support has been disabled
+        - the hardware does not support it
+
+
+   Uintr syscall interface
+       Applications can use and manage Uintr using the system calls  described
+       here.   The  Uintr  system  calls  are  available only if the kernel is
+       configured with the CONFIG_X86_USER_INTERRUPTS option.
+
+       1)  A  user  application   registers   an   interrupt   handler   using
+       uintr_register_handler(2).  The  registered  interrupt  handler will be
+       invoked when a user interrupt is delivered to  that  thread.  Only  one
+       interrupt  handler  can  be  registered by a particular thread within a
+       process.
+
+       2) Each thread that registered a handler  has  its  own  unique  vector
+       space  of  64  vectors.  The  thread can then use uintr_create_fd(2) to
+       register a vector  and  create  a  user  interrupt  file  descriptor  -
+       uintr_fd.
+
+       3)  The  uintr_fd is only associated with only one Uintr vector.  A new
+       uintr_fd must be created for  each  of  the  64  vectors.  uintr_fd  is
+       automatically  inherited  by forked processes but the receiver can also
+       share the uintr_fd with potential senders using any of the existing  FD
+       sharing  mechanisms  (like pidfd_getfd(2) or socket sendmsg(2)). Access
+       to  uintr_fd  allows  a  sender  to  generate  an  interrupt  with  the
+       associated  vector.  Upon  interrupt delivery, the interrupt handler is
+       invoked with the vector number pushed onto the stack  to  identify  the
+       source of the interrupt.
+
+       4)  Each  thread has a local flag called User Interrupt flag (UIF). The
+       thread can set or clear this flag to enable or disable interrupts.  The
+       default value of UIF is always 0 (Interrupts disabled). A receiver must
+       execute the _stui() intrinsic instruction  at  some  point  (before  or
+       anytime  after  registration)  to  start  receiving user interrupts. To
+       disable interrupts during critical sections the  thread  can  call  the
+       _clui() instruction to clear UIF.
+
+       5a)  For  sending a user IPI, the sender task registers with the kernel
+       using uintr_register_sender(2). The  kernel  would  setup  the  routing
+       tables to connect the sender and receiver. The syscall returns an index
+       that can be used with the 'SENDUIPI <index>' instruction to send a user
+       IPI.   If  the receiver is running, the interrupt is delivered directly
+       to the receiver without any kernel intervention.
+
+       5b) If the sender is the kernel or an external source, the uintr_fd can
+       be  passed  onto the related kernel entity to allow them to connect and
+       generate the user interrupt.  <The exact details of this API are  still
+       being worked upon.>
+
+       6)  The  receiver  can block in the kernel while it is waiting for user
+       interrupts to get delivered using uintr_wait(2). If  the  receiver  has
+       been  context switched out due to other reasons the user interrupt will
+       be delivered when the receiver gets scheduled back in.
+
+       <The behavior when the receiver has made  some  other  blocking  system
+       call like sleep(2) or read(2) is still to be decided. We are evaluating
+       if a thread made another blocking syscall  should  be  interrupted  and
+       woken  up  when a user interrupt arrives for that thread. uintr_wait(2)
+       has been implemented as a placeholder in the meantime.>
+
+       7) The sender and receiver are expected to coordinate and then call the
+       teardown syscalls to terminate the connection:
+         a. A sender unregisters with uintr_unregister_sender(2)
+         b. A vector is unregistered using close(uintr_fd)
+         c. A receiver unregisters with uintr_unregister_handler(2)
+
+       If  the  sender  and  receiver  aren't  able to coordinate, some shared
+       kernel resources between them would  get  freed  later  when  the  file
+       descriptors get released automatically on process exit.
+
+
+       Multi-threaded  applications  need to be careful when using Uintr since
+       it is a thread specific feature. Actions by one thread don't reflect on
+       other threads of the same application.
+
+
+   Toolchain support
+       Support  has  added  to  GCC(11.1)  and Binutils(2.36.1) to enable user
+       interrupt intrinsic instructions and compiler flag (-muintr).
+
+       The "(interrupt)" attribute can be used to compile a function as a user
+       interrupt  handler. In conjunction with the 'muintr' flag, the compiler
+       would:
+         - Generate the entry  and  exit  sequences  for  the  User  interrupt
+       handler
+         - Handle the saving and restoring of registers
+         - Call uiret to return from a user interrupt handler
+
+       User    Interrupts    related   compiler   intrinsic   instructions   -
+       <x86gprintrin.h>:
+
+       _clui() - Disable user interrupts - clear UIF (User Interrupt Flag).
+
+       _stui() - enable user interrupts - set UIF.
+
+       _testui() - test current value of UIF.
+
+       _uiret() - return from a user interrupt handler.
+
+       _senduipi <uipi_index> -  send  a  user  IPI  to  a  target  task.  The
+       uipi_index is obtained using uintr_register_sender(2).
+
+
+   Interrupt handler restrictions
+       There are restrictions on what can be done in a user interrupt handler.
+
+       For  example,  the  handler  and  the functions called from the handler
+       should only use general purpose registers.
+
+       For   details   refer   the   Uintr   compiler    programming    guide.
+       https://github.com/intel/uintr-compiler-guide/blob/uintr-
+       gcc-11.1/UINTR-compiler-guide.pdf
+
+
+CONFORMING TO
+              Uintr related system calls are Linux specific.
+
+EXAMPLES
+   Build
+       To compile this sample an updated toolchain is needed.
+        - Use GCC release 11 or higher &
+        - Use Binutils release 2.36 or higher
+
+       gcc -muintr -mgeneral-regs-only -minline-all-stringops uipi_sample.c -lpthread -o uipi_sample
+
+
+   Run
+       $./uipi_sample
+       Receiver enabled interrupts
+       Sending IPI from sender thread
+            -- User Interrupt handler --
+       Success
+
+
+   Program source
+       #define _GNU_SOURCE
+       #include <pthread.h>
+       #include <stdio.h>
+       #include <stdlib.h>
+       #include <syscall.h>
+       #include <unistd.h>
+       #include <x86gprintrin.h>
+
+       #define __NR_uintr_register_handler     449
+       #define __NR_uintr_unregister_handler   450
+       #define __NR_uintr_create_fd       451
+       #define __NR_uintr_register_sender 452
+       #define __NR_uintr_unregister_sender    453
+
+       #define uintr_register_handler(handler, flags)    syscall(__NR_uintr_register_handler, handler, flags)
+       #define uintr_unregister_handler(flags)      syscall(__NR_uintr_unregister_handler, flags)
+       #define uintr_create_fd(vector, flags)       syscall(__NR_uintr_create_fd, vector, flags)
+       #define uintr_register_sender(fd, flags)     syscall(__NR_uintr_register_sender, fd, flags)
+       #define uintr_unregister_sender(fd, flags)   syscall(__NR_uintr_unregister_sender, fd, flags)
+
+       unsigned int uintr_received;
+       unsigned int uintr_fd;
+
+       void __attribute__ ((interrupt)) uintr_handler(struct __uintr_frame *ui_frame,
+                                    unsigned long long vector)
+       {
+            static const char print[] = "\t-- User Interrupt handler --\n";
+
+            write(STDOUT_FILENO, print, sizeof(print) - 1);
+            uintr_received = 1;
+       }
+
+       void *sender_thread(void *arg)
+       {
+            int uipi_index;
+
+            uipi_index = uintr_register_sender(uintr_fd, 0);
+            if (uipi_index < 0) {
+                 printf("Sender register error\n");
+                 exit(EXIT_FAILURE);
+            }
+
+            printf("Sending IPI from sender thread\n");
+            _senduipi(uipi_index);
+
+            uintr_unregister_sender(uintr_fd, 0);
+
+            return NULL;
+       }
+
+       int main(int argc, char *argv[])
+       {
+            pthread_t pt;
+
+            if (uintr_register_handler(uintr_handler, 0)) {
+                 printf("Interrupt handler register error\n");
+                 exit(EXIT_FAILURE);
+            }
+
+            uintr_fd = uintr_create_fd(0, 0);
+            if (uintr_fd < 0) {
+                 printf("Interrupt vector registration error\n");
+                 exit(EXIT_FAILURE);
+            }
+
+            _stui();
+            printf("Receiver enabled interrupts\n");
+
+            if (pthread_create(&pt, NULL, &sender_thread, NULL)) {
+                 printf("Error creating sender thread\n");
+                 exit(EXIT_FAILURE);
+            }
+
+            /* Do some other work */
+            while (!uintr_received)
+                 usleep(1);
+
+            pthread_join(pt, NULL);
+            close(uintr_fd);
+            uintr_unregister_handler(0);
+
+            printf("Success\n");
+            exit(EXIT_SUCCESS);
+       }
+
+
+NOTES
+       Currently, there is no glibc wrapper for the Uintr related system call;
+       call  the system calls using syscall(2).
+
+
+
+                                                                      UINTR(7)
diff --git a/tools/uintr/manpages/1_register_receiver.txt b/tools/uintr/manpages/1_register_receiver.txt
new file mode 100644
index 000000000000..4b6652c94faa
--- /dev/null
+++ b/tools/uintr/manpages/1_register_receiver.txt
@@ -0,0 +1,122 @@
+uintr_register_handler(2)     System Calls Manual    uintr_register_handler(2)
+
+
+
+NAME
+       uintr_register_handler - register a user interrupt handler
+
+
+SYNOPSIS
+        int uintr_register_handler(u64 handler_address, unsigned int flags);
+
+
+DESCRIPTION
+       uintr_register_handler()  registers  a  user  interrupt handler for the
+       calling process. In case of multi-threaded processes the user interrupt
+       handler is only registered for the thread that makes this system call.
+
+       The  handler_address  is  the  function  that would be invoked when the
+       process receives a user interrupt. The function should  be  defined  as
+       below:
+
+       void __attribute__ ((interrupt)) ui_handler(struct __uintr_frame *frame,
+                                                   unsigned long long vector)
+
+       For  more  details  and  an  example  for  the handler definition refer
+       uintr(7).
+
+       Providing an invalid handler_address could lead to  undefined  behavior
+       for the process.
+
+       The  flags  argument is reserved for future use.  Currently, it must be
+       specified as 0.
+
+       Each user thread can register only one interrupt handler.  Each  thread
+       that  would  like to be a receiver must register once. The registration
+       is not inherited across forks(2) or when additional threads are created
+       within the same process.
+
+       Each thread within a process gets its own interrupt vector space for 64
+       vectors. The vector number  is  pushed  onto  the  stack  when  a  user
+       interrupt  is  delivered.  Since  the  vector space is per-thread, each
+       receiver can receive up to 64 unique interrupt events.
+
+       For information on creating uintr_fd to register and  manage  interrupt
+       vectors, refer uintr_create_fd(2) system call.
+
+       Once an interrupt handler is registered it cannot be changed before the
+       handler  is  unregistered  via   uintr_unregister_handler(2).   Calling
+       uintr_unregister_handler(2)   would   however  invalidate  the  current
+       interrupt resources registered with the kernel.
+
+       The interrupt handler gets invoked only while the process  is  running.
+       If  the  process  is scheduled out or blocked in the kernel, interrupts
+       will be delivered when the process is scheduled again. <A mechanism  to
+       unblock a process as soon as a user interrupt is posted is being worked
+       upon.>
+
+
+   Interrupt handler restrictions
+       There are restrictions on what can be done in a user interrupt handler.
+
+       For example, the handler and the  functions  called  from  the  handler
+       should only use general purpose registers.
+
+       For    details    refer   the   Uintr   compiler   programming   guide.
+       https://github.com/intel/uintr-compiler-guide/blob/uintr-
+       gcc-11.1/UINTR-compiler-guide.pdf
+
+
+   Security implications
+       A  lot  of security issues that are applicable to signal handlers, also
+       apply to user interrupt handlers.
+
+       The user interrupt handler  by-itself  need  not  be  re-entrant  since
+       interrupts  are automatically disabled when the handler is invoked. But
+       this isn't valid if the handler is shared between multiple  threads  or
+       nested interrupts have been enabled.
+
+       Similar  to  signal handlers, the functions that are called from a user
+       interrupt should be async-signal-safe.  Refer  signal-safety(7)  for  a
+       discussion of async-signal-safe functions.
+
+       It  is  recommended  to  disable  interrupts  using _clui() instruction
+       before executing any privileged code. Doing so  would  prevent  a  user
+       interrupt handler from running at a higher privilege level.
+
+
+RETURN VALUE
+       On  success,  uintr_register_handler()  returns  0.   On  error,  -1 is
+       returned and errno is set to indicate the cause of the error.
+
+
+ERRORS
+       EOPNOTSUPP  Underlying hardware doesn't have support for Uintr.
+
+       EINVAL      flags is not 0.
+
+       EFAULT      handler address is not valid.
+
+       ENOMEM      The system is out of available memory.
+
+       EBUSY       An interrupt handler has already been registered.
+
+
+VERSIONS
+       uintr_register_handler() first appeared in Linux <tbd>.
+
+
+CONFORMING TO
+       uintr_register_handler() is Linux specific.
+
+
+NOTES
+       Currently, there is no glibc wrapper for  this  system  call;  call  it
+       using syscall(2).
+
+       The  user  interrupt  related  system  calls  need  hardware support to
+       generate and receive user interrupts. Refer uintr(7) for details.
+
+
+
+                                                     uintr_register_handler(2)
diff --git a/tools/uintr/manpages/2_unregister_receiver.txt b/tools/uintr/manpages/2_unregister_receiver.txt
new file mode 100644
index 000000000000..dd6981f33597
--- /dev/null
+++ b/tools/uintr/manpages/2_unregister_receiver.txt
@@ -0,0 +1,62 @@
+uintr_unregister_handler(2)   System Calls Manual  uintr_unregister_handler(2)
+
+
+
+NAME
+       uintr_unregister_handler - unregister a user interrupt handler
+
+
+SYNOPSIS
+        int uintr_unregister_handler(unsigned int flags);
+
+
+DESCRIPTION
+       uintr_unregister_handler()  unregisters  a  previously  registered user
+       interrupt handler. If  no  interrupt  handler  was  registered  by  the
+       process uintr_unregister_handler() would return an error.
+
+       Since  interrupt handler is local to a thread, only the thread that has
+       registered      via      uintr_register_handler(2)       can       call
+       uintr_unregister_handler().
+
+       The  interrupt  resources  such as interrupt vectors and uintr_fd, that
+       have been allocated  for  this  thread,  would  be  deactivated.  Other
+       senders posting interrupts to this thread will not be delivered.
+
+       The  kernel  does not automatically close the uintr_fds related to this
+       process/thread   when   uintr_unregister_handler()   is   called.   The
+       application  is  expected  to  close the unused uintr_fds before or the
+       after the handler has been unregistered.
+
+
+RETURN VALUE
+       On success, uintr_unregister_handler() returns  0.   On  error,  -1  is
+       returned and errno is set to indicate the cause of the error.
+
+
+ERRORS
+       EOPNOTSUPP  Underlying hardware doesn't have support for Uintr.
+
+       EINVAL       flags is not 0.
+
+       EINVAL       No registered user interrupt handler.
+
+
+VERSIONS
+       uintr_unregister_handler() first appeared in Linux <tbd>.
+
+
+CONFORMING TO
+       uintr_unregister_handler() is Linux specific.
+
+
+NOTES
+       Currently,  there  is  no  glibc  wrapper for this system call; call it
+       using syscall(2).
+
+       The user interrupt  related  system  calls  need  hardware  support  to
+       generate and receive user interrupts. Refer uintr(7) for details.
+
+
+
+                                                   uintr_unregister_handler(2)
diff --git a/tools/uintr/manpages/3_create_fd.txt b/tools/uintr/manpages/3_create_fd.txt
new file mode 100644
index 000000000000..e90b0dce2703
--- /dev/null
+++ b/tools/uintr/manpages/3_create_fd.txt
@@ -0,0 +1,104 @@
+uintr_create_fd(2)            System Calls Manual           uintr_create_fd(2)
+
+
+
+NAME
+       uintr_create_fd - Create a user interrupt file descriptor - uintr_fd
+
+
+SYNOPSIS
+        int uintr_create_fd(u64 vector, unsigned int flags);
+
+
+DESCRIPTION
+       uintr_create_fd()  allocates  a  new  user  interrupt  file  descriptor
+       (uintr_fd) based on the vector registered by the calling  process.  The
+       uintr_fd  can  be  shared  with other processes and the kernel to allow
+       them to generate interrupts with the associated vector.
+
+       The caller must have registered a handler via uintr_register_handler(2)
+       before attempting to create uintr_fd. The interrupts generated based on
+       this uintr_fd will be delivered only to the thread  that  created  this
+       file  descriptor.  A  unique  uintr_fd  is  generated  for  each vector
+       registered using uintr_create_fd().
+
+       Each thread has a private vector space of 64 vectors ranging from 0-63.
+       Vector number 63 has the highest priority while vector number 0 has the
+       lowest.  If two or more interrupts are pending to be delivered then the
+       interrupt  with  the  higher  vector  number  will  be  delivered first
+       followed by the ones with lower vector numbers. Applications can choose
+       appropriate  vector  numbers  to  prioritize  certain  interrupts  over
+       others.
+
+       Upon interrupt delivery, the handler is invoked with the vector  number
+       pushed  onto  the  stack  to help identify the source of the interrupt.
+       Since the vector space is per-thread, each receiver can receive  up  to
+       64 unique interrupt events.
+
+       A receiver can choose to share the same uintr_fd with multiple senders.
+       Since an interrupt with the same vector number would be delivered,  the
+       receiver  would  need  to  use  other  mechanisms to identify the exact
+       source of the interrupt.
+
+       The flags argument is reserved for future use.  Currently, it  must  be
+       specified as 0.
+
+       close(2)
+             When the file descriptor is no longer required it should be
+             closed.  When all file descriptors associated with the same
+             uintr_fd object have been closed, the resources for object are
+             freed by the kernel.
+
+       fork(2)
+             A copy of the file descriptor created by uintr_create_fd() is
+             inherited by the child produced by fork(2).  The duplicate file
+             descriptor is associated with the same uintr_fd object. The
+             close-on-exec flag (FD_CLOEXEC; see fcntl(2)) is set on the
+             file descriptor returned by uintr_create_fd().
+
+       For  information  on  how  to  generate  interrupts with uintr_fd refer
+       uintr_register_sender(2).
+
+
+RETURN VALUE
+       On success, uintr_create_fd() returns a new uintr_fd  file  descriptor.
+       On  error, -1 is returned and errno is set to indicate the cause of the
+       error.
+
+
+ERRORS
+       EOPNOTSUPP  Underlying hardware doesn't have support for Uintr.
+
+       EINVAL      flags is not 0.
+
+       EFAULT      handler address is not valid.
+
+       EMFILE        The  per-process  limit  on  the  number  of  open   file
+       descriptors has been reached.
+
+       ENFILE        The  system-wide  limit on the total number of open files
+       has been reached.
+
+       ENODEV       Could not mount (internal) anonymous inode device.
+
+       ENOMEM      The system is out of available memory to allocate uintr_fd.
+
+
+VERSIONS
+       uintr_create_fd() first appeared in Linux <tbd>.
+
+
+CONFORMING TO
+       uintr_create_fd() is Linux specific.
+
+
+NOTES
+       Currently, there is no glibc wrapper for  this  system  call;  call  it
+       using syscall(2).
+
+       The  user  interrupt  related  system  calls  need  hardware support to
+       generate and receive user interrupts. Refer uintr(7) for details.
+
+
+
+                                                            uintr_create_fd(2)
diff --git a/tools/uintr/manpages/4_register_sender.txt b/tools/uintr/manpages/4_register_sender.txt
new file mode 100644
index 000000000000..1dc17f4c041f
--- /dev/null
+++ b/tools/uintr/manpages/4_register_sender.txt
@@ -0,0 +1,121 @@
+uintr_register_sender(2)      System Calls Manual     uintr_register_sender(2)
+
+
+
+NAME
+       uintr_register_sender - Register a user inter-process interrupt sender
+
+
+SYNOPSIS
+        int uintr_register_sender(int uintr_fd, unsigned int flags);
+
+
+DESCRIPTION
+       uintr_register_sender() allows a sender process to connect with a Uintr
+       receiver  based  on  the  uintr_fd.  It  returns  a  user   IPI   index
+       (uipi_index)  that  the  sender process can use in conjunction with the
+       SENDUIPI instruction to generate a user IPI.
+
+       When a sender executes 'SENDUIPI  <uipi_index>',  a  user  IPI  can  be
+       delivered by the hardware to the receiver without any intervention from
+       the kernel. Upon IPI delivery, the handler is invoked with  the  vector
+       number,  associated  with  uintr_fd,  pushed  onto  the  stack  to help
+       identify the source of the interrupt.
+
+       If the receiver for the thread is running the hardware  would  directly
+       deliver the user IPI to the receiver. If the receiver is not running or
+       has disabled receiving  interrupts  using  the  STUI  instruction,  the
+       interrupt  will  be stored in memory and delivered when the receiver is
+       able to receive it.
+
+       If the sender tries to send multiple IPIs while  the  receiver  is  not
+       able  to  receive  them then all the IPIs with the same vector would be
+       coalesced.  Only a single IPI per vector would be delivered.
+
+       uintr_register_sender() can be used to connect with multiple uintr_fds.
+       uintr_register_sender()  would  return  a  unique  uipi_index  for each
+       uintr_fd the sender connects with.
+
+       In case of a multi-threaded process, the uipi_index is only  valid  for
+       the thread that registered itself. Other threads would need to register
+       themselves if they intend to be a user IPI sender.  Executing  SENDUIPI
+       on  different threads can have varying results based on the connections
+       that have been setup.
+
+       <We  are  also  considering  an  alternate  approach  where  the   same
+       uipi_index  would  be  valid  for all threads that are part of the same
+       process.  All threads would see consistent SENDUIPI behaviour  in  that
+       case.>
+
+       If    a    process    uses    SENDUIPI    without   registering   using
+       uintr_register_sender() it receives a SIGILL signal. If a process  uses
+       an  illegal  uipi_index, it receives a SIGSEGV signal. See sigaction(2)
+       for details of the information available with that signal.
+
+       The flags argument is reserved for future use.  Currently, it  must  be
+       specified as 0.
+
+       close(2)
+             When the file descriptor is no longer required it should be
+             closed.  When all file descriptors associated with the same
+             uintr_fd object have been closed, the resources for object are
+             freed by the kernel. Freeing the uintr_fd object would also
+             result in the associated uipi_index to be freed.
+
+       fork(2)
+             A copy of uintr_fd is inherited by the child produced by
+             fork(2). However the uipi_index would not get inherited by the
+             child. If the child wants to send a user IPI it would have to
+             explicitly register itself using the uintr_register_sender()
+             system call.
+
+       For    information    on    how    to   unregister   a   sender   refer
+       uintr_unregister_sender(2).
+
+
+RETURN VALUE
+       On success, uintr_register_sender() returns a  new  user  IPI  index  -
+       uipi_index.  On  error, -1 is returned and errno is set to indicate the
+       cause of the error.
+
+
+ERRORS
+       EOPNOTSUPP  Underlying hardware doesn't have support for uintr(7).
+
+       EOPNOTSUPP  uintr_fd does not refer to a Uintr instance.
+
+       EBADF       The uintr_fd passed to the kernel is invalid.
+
+       EINVAL      flags is not 0.
+
+       EISCONN     A connection to this uintr_fd has already been established.
+
+       ECONNRESET  The user interrupt receiver has disabled the connection.
+
+       ESHUTDOWN   The user interrupt receiver has exited the connection.
+
+       ENOSPC       No uipi_index can be allocated. The system has run out  of
+       the available user IPI indexes.
+
+       ENOMEM       The  system  is out of available memory to register a user
+       IPI sender.
+
+
+VERSIONS
+       uintr_register_sender() first appeared in Linux <tbd>.
+
+
+CONFORMING TO
+       uintr_register_sender() is Linux specific.
+
+
+NOTES
+       Currently, there is no glibc wrapper for  this  system  call;  call  it
+       using syscall(2).
+
+       The  user  interrupt  related  system  calls  need  hardware support to
+       generate and receive user interrupts. Refer uintr(7) for details.
+
+
+
+                                                      uintr_register_sender(2)
diff --git a/tools/uintr/manpages/5_unregister_sender.txt b/tools/uintr/manpages/5_unregister_sender.txt
new file mode 100644
index 000000000000..31a8c574dc25
--- /dev/null
+++ b/tools/uintr/manpages/5_unregister_sender.txt
@@ -0,0 +1,79 @@
+uintr_unregister_sender(2)    System Calls Manual   uintr_unregister_sender(2)
+
+
+
+NAME
+       uintr_unregister_sender  -  Unregister  a  user inter-process interrupt
+       sender
+
+
+SYNOPSIS
+        int uintr_unregister_sender(int uintr_fd, unsigned int flags);
+
+
+DESCRIPTION
+       uintr_unregister_sender() unregisters a sender process from a  uintr_fd
+       it had previously connected with. If no connection is present with this
+       uintr_fd the system call return an error.
+
+       The uipi_index that was allocated during uintr_register_sender(2)  will
+       also be freed. If a process tries to use a uipi_index after it has been
+       freed it would receive a SIGSEGV signal.
+
+       In case of a multi-threaded process uintr_unregister_sender() will only
+       disconnect  the thread that makes this call. Other threads can continue
+       to use their connection with the uintr_fd based on their uipi_index.
+
+       <We are considering an  alternate  approach  where  all  threads  in  a
+       process  see  a  consistent  view  of  the  uipi_index.  In  that case,
+       uintr_unregister_sender() would disconnect all threads from uintr_fd.>
+
+       The flags argument is reserved for future use.  Currently, it  must  be
+       specified as 0.
+
+       close(2)
+             When the file descriptor is no longer required it should be
+             closed.  When all file descriptors associated with the same
+             uintr_fd object have been closed, the resources for object are
+             freed by the kernel. Freeing the uintr_fd object would also
+             result in the associated uipi_index to be freed.
+
+       The  behavior  of  uintr_unregister_sender() system call after uintr_fd
+       has been close is undefined.
+
+
+RETURN VALUE
+       On success,  uintr_unregister_sender()  returns  0.  On  error,  -1  is
+       returned and errno is set to indicate the cause of the error.
+
+
+ERRORS
+       EOPNOTSUPP  Underlying hardware doesn't have support for uintr(7).
+
+       EOPNOTSUPP  uintr_fd does not refer to a Uintr instance.
+
+       EBADF       The uintr_fd passed to the kernel is invalid.
+
+       EINVAL      flags is not 0.
+
+       EINVAL      No connection has been setup with this uintr_fd.
+
+
+VERSIONS
+       uintr_unregister_sender() first appeared in Linux <tbd>.
+
+
+CONFORMING TO
+       uintr_unregister_sender() is Linux specific.
+
+
+NOTES
+       Currently,  there  is  no  glibc  wrapper for this system call; call it
+       using syscall(2).
+
+       The user interrupt  related  system  calls  need  hardware  support  to
+       generate and receive user interrupts. Refer uintr(7) for details.
+
+
+
+                                                    uintr_unregister_sender(2)
diff --git a/tools/uintr/manpages/6_wait.txt b/tools/uintr/manpages/6_wait.txt
new file mode 100644
index 000000000000..f281a6ce83aa
--- /dev/null
+++ b/tools/uintr/manpages/6_wait.txt
@@ -0,0 +1,59 @@
+uintr_wait(2)                 System Calls Manual                uintr_wait(2)
+
+
+
+NAME
+       uintr_wait - wait for user interrupts
+
+
+SYNOPSIS
+        int uintr_wait(unsigned int flags);
+
+
+DESCRIPTION
+       uintr_wait()  causes  the  calling process (or thread) to sleep until a
+       user interrupt is delivered.
+
+       uintr_wait() will block in the kernel only when a interrupt handler has
+       been registered using uintr_register_handler(2)
+
+       <uintr_wait() is a placeholder syscall while we decide on the behaviour
+       of blocking system calls like sleep(2) and read(2) being interrupted by
+       uintr(7).>
+
+
+RETURN VALUE
+       uintr_wait()  returns  only  when  a user interrupt is received and the
+       interrupt handler function returned.  In this case, -1 is returned  and
+       errno is set to EINTR.
+
+
+ERRORS
+       EOPNOTSUPP  Underlying hardware doesn't have support for Uintr.
+
+       EOPNOTSUPP  No interrupt handler registered.
+
+       EINVAL        flags is not 0.
+
+       EINTR        A  user  interrupt  was received and the interrupt handler
+       returned.
+
+
+VERSIONS
+       uintr_wait() first appeared in Linux <tbd>.
+
+
+CONFORMING TO
+       uintr_wait() is Linux specific.
+
+
+NOTES
+       Currently, there is no glibc wrapper for  this  system  call;  call  it
+       using syscall(2).
+
+       The  user  interrupt  related  system  calls  need  hardware support to
+       generate and receive user interrupts. Refer uintr(7) for details.
+
+
+
+                                                                 uintr_wait(2)
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 02/13] Documentation/x86: Add documentation for User Interrupts
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
  2021-09-13 20:01 ` [RFC PATCH 01/13] x86/uintr/man-page: Include man pages draft for reference Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-13 20:01 ` [RFC PATCH 03/13] x86/cpu: Enumerate User Interrupts support Sohil Mehta
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

For now, include just the hardware and software architecture summary.

<This is the same content as the cover letter.

Some of the kernel design details and other information from the cover
letter can eventually be moved here.>

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 Documentation/x86/index.rst           |   1 +
 Documentation/x86/user-interrupts.rst | 107 ++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 Documentation/x86/user-interrupts.rst

diff --git a/Documentation/x86/index.rst b/Documentation/x86/index.rst
index 383048396336..0d416b02131b 100644
--- a/Documentation/x86/index.rst
+++ b/Documentation/x86/index.rst
@@ -31,6 +31,7 @@ x86-specific Documentation
    tsx_async_abort
    buslock
    usb-legacy-support
+   user-interrupts
    i386/index
    x86_64/index
    sva
diff --git a/Documentation/x86/user-interrupts.rst b/Documentation/x86/user-interrupts.rst
new file mode 100644
index 000000000000..bc90251d6c2e
--- /dev/null
+++ b/Documentation/x86/user-interrupts.rst
@@ -0,0 +1,107 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+User Interrupts (UINTR)
+=======================
+
+Overview
+========
+User Interrupts provides a low latency event delivery and inter process
+communication mechanism. These events can be delivered directly to userspace
+without a transition through the kernel.
+
+In the User Interrupts architecture, a receiver is always expected to be a user
+space task. However, a user interrupt can be sent by another user space task,
+kernel or an external source (like a device). The feature that allows another
+task to send an interrupt is referred to as User IPI.
+
+Hardware Summary
+================
+User Interrupts is a posted interrupt delivery mechanism. The interrupts are
+first posted to a memory location and then delivered to the receiver when they
+are running with CPL=3.
+
+Kernel managed architectural data structures
+--------------------------------------------
+UPID: User Posted Interrupt Descriptor - Holds receiver interrupt vector
+information and notification state (like an ongoing notification, suppressed
+notifications).
+
+UITT: User Interrupt Target Table - Stores UPID pointer and vector information
+for interrupt routing on the sender side. Referred by the senduipi instruction.
+
+The interrupt state of each task is referenced via MSRs which are saved and
+restored by the kernel during context switch.
+
+Instructions
+------------
+senduipi <index> - send a user IPI to a target task based on the UITT index.
+
+clui - Mask user interrupts by clearing UIF (User Interrupt Flag).
+
+stui - Unmask user interrupts by setting UIF.
+
+testui - Test current value of UIF.
+
+uiret - return from a user interrupt handler.
+
+User IPI
+--------
+When a User IPI sender executes 'senduipi <index>' the hardware refers the UITT
+table entry pointed by the index and posts the interrupt vector into the
+receiver's UPID.
+
+If the receiver is running the sender cpu would send a physical IPI to the
+receiver's cpu. On the receiver side this IPI is detected as a User Interrupt.
+The User Interrupt handler for the receiver is invoked and the vector number is
+pushed onto the stack.
+
+Upon execution of 'uiret' in the interrupt handler, the control is transferred
+back to instruction that was interrupted.
+
+Refer the Intel Software Developer's Manual for more details.
+
+Software Architecture
+=====================
+User Interrupts (Uintr) is an opt-in feature (unlike signals). Applications
+wanting to use Uintr are expected to register themselves with the kernel using
+the Uintr related system calls. A Uintr receiver is always a userspace task. A
+Uintr sender can be another userspace task, kernel or a device.
+
+1) A receiver can register/unregister an interrupt handler using the Uintr
+receiver related syscalls.
+		uintr_register_handler(handler, flags)
+
+2) A syscall also allows a receiver to register a vector and create a user
+interrupt file descriptor - uintr_fd.
+		uintr_fd = uintr_create_fd(vector, flags)
+
+Uintr can be useful in some of the usages where eventfd or signals are used for
+frequent userspace event notifications. The semantics of uintr_fd are somewhat
+similar to an eventfd() or the write end of a pipe.
+
+3) Any sender with access to uintr_fd can use it to deliver events (in this
+case - interrupts) to a receiver. A sender task can manage its connection with
+the receiver using the sender related syscalls based on uintr_fd.
+		uipi_index = uintr_register_sender(uintr_fd, flags)
+
+Using an FD abstraction provides a secure mechanism to connect with a receiver.
+The FD sharing and isolation mechanisms put in place by the kernel would extend
+to Uintr as well.
+
+4a) After the initial setup, a sender task can use the SENDUIPI instruction to
+generate user IPIs without any kernel intervention.
+		SENDUIPI <uipi_index>
+
+If the receiver is running (CPL=3), then the user interrupt is delivered
+directly without a kernel transition. If the receiver isn't running the
+interrupt is delivered when the receiver gets context switched back. If the
+receiver is blocked in the kernel, the user interrupt is delivered to the
+kernel which then unblocks the intended receiver to deliver the interrupt.
+
+4b) If the sender is the kernel or a device, the uintr_fd can be passed onto
+the related kernel entity to allow them to setup a connection and then generate
+a user interrupt for event delivery. <The exact details of this API are still
+being worked upon.>
+
+Refer the Uintr man-pages for details on the syscall interface.
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 03/13] x86/cpu: Enumerate User Interrupts support
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
  2021-09-13 20:01 ` [RFC PATCH 01/13] x86/uintr/man-page: Include man pages draft for reference Sohil Mehta
  2021-09-13 20:01 ` [RFC PATCH 02/13] Documentation/x86: Add documentation for User Interrupts Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-23 22:24   ` Thomas Gleixner
  2021-09-13 20:01 ` [RFC PATCH 04/13] x86/fpu/xstate: Enumerate User Interrupts supervisor state Sohil Mehta
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

User Interrupts support including user IPIs is enumerated through cpuid.
The 'uintr' flag in /proc/cpuinfo can be used to identify it. The
recommended mechanism for user applications to detect support is calling
the uintr related syscalls.

Use CONFIG_X86_USER_INTERRUPTS to compile with User Interrupts support.
The feature can be disabled at boot time using the 'nouintr' kernel
parameter.

SENDUIPI is a special ring-3 instruction that makes a supervisor mode
memory access to the UPID and UITT memory. Currently, KPTI needs to be
off for User IPIs to work.  Processors that support user interrupts are
not affected by Meltdown so the auto mode of KPTI will default to off.

Users who want to force enable KPTI will need to wait for a later
version of this patch series that is compatible with KPTI. We need to
allocate the UPID and UITT structures from a special memory region that
has supervisor access but it is mapped into userspace. The plan is to
implement a mechanism similar to LDT.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 .../admin-guide/kernel-parameters.txt         |  2 +
 arch/x86/Kconfig                              | 12 ++++
 arch/x86/include/asm/cpufeatures.h            |  1 +
 arch/x86/include/asm/disabled-features.h      |  8 ++-
 arch/x86/include/asm/msr-index.h              |  8 +++
 arch/x86/include/uapi/asm/processor-flags.h   |  2 +
 arch/x86/kernel/cpu/common.c                  | 55 +++++++++++++++++++
 arch/x86/kernel/cpu/cpuid-deps.c              |  1 +
 8 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 91ba391f9b32..471e82be87ff 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3288,6 +3288,8 @@
 
 	nofsgsbase	[X86] Disables FSGSBASE instructions.
 
+	nouintr		[X86-64] Disables User Interrupts support.
+
 	no_console_suspend
 			[HW] Never suspend the console
 			Disable suspending of consoles during suspend and
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 4e001bbbb425..6f7f31e92f3e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1845,6 +1845,18 @@ config X86_INTEL_MEMORY_PROTECTION_KEYS
 
 	  If unsure, say y.
 
+config X86_USER_INTERRUPTS
+	bool "User Interrupts (UINTR)"
+	depends on X86_LOCAL_APIC && X86_64
+	depends on CPU_SUP_INTEL
+	help
+	  User Interrupts are events that can be delivered directly to
+	  userspace without a transition through the kernel. The interrupts
+	  could be generated by another userspace application, kernel or a
+	  device.
+
+	  Refer, Documentation/x86/user-interrupts.rst for details.
+
 choice
 	prompt "TSX enable mode"
 	depends on CPU_SUP_INTEL
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..634e80ee5db5 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -375,6 +375,7 @@
 #define X86_FEATURE_AVX512_4VNNIW	(18*32+ 2) /* AVX-512 Neural Network Instructions */
 #define X86_FEATURE_AVX512_4FMAPS	(18*32+ 3) /* AVX-512 Multiply Accumulation Single precision */
 #define X86_FEATURE_FSRM		(18*32+ 4) /* Fast Short Rep Mov */
+#define X86_FEATURE_UINTR		(18*32+ 5) /* User Interrupts support */
 #define X86_FEATURE_AVX512_VP2INTERSECT (18*32+ 8) /* AVX-512 Intersect for D/Q */
 #define X86_FEATURE_SRBDS_CTRL		(18*32+ 9) /* "" SRBDS mitigation MSR available */
 #define X86_FEATURE_MD_CLEAR		(18*32+10) /* VERW clears CPU buffers */
diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h
index 8f28fafa98b3..27fb1c70ade6 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -65,6 +65,12 @@
 # define DISABLE_SGX	(1 << (X86_FEATURE_SGX & 31))
 #endif
 
+#ifdef CONFIG_X86_USER_INTERRUPTS
+# define DISABLE_UINTR		0
+#else
+# define DISABLE_UINTR		(1 << (X86_FEATURE_UINTR & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -87,7 +93,7 @@
 #define DISABLED_MASK16	(DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
 			 DISABLE_ENQCMD)
 #define DISABLED_MASK17	0
-#define DISABLED_MASK18	0
+#define DISABLED_MASK18	(DISABLE_UINTR)
 #define DISABLED_MASK19	0
 #define DISABLED_MASK_CHECK BUILD_BUG_ON_ZERO(NCAPINTS != 20)
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index a7c413432b33..4fdba281d002 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -375,6 +375,14 @@
 #define MSR_HWP_REQUEST 		0x00000774
 #define MSR_HWP_STATUS			0x00000777
 
+/* User Interrupt interface */
+#define MSR_IA32_UINTR_RR		0x985
+#define MSR_IA32_UINTR_HANDLER		0x986
+#define MSR_IA32_UINTR_STACKADJUST	0x987
+#define MSR_IA32_UINTR_MISC		0x988	/* 39:32-UINV, 31:0-UITTSZ */
+#define MSR_IA32_UINTR_PD		0x989
+#define MSR_IA32_UINTR_TT		0x98a
+
 /* CPUID.6.EAX */
 #define HWP_BASE_BIT			(1<<7)
 #define HWP_NOTIFICATIONS_BIT		(1<<8)
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index bcba3c643e63..919ce7f456d4 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -130,6 +130,8 @@
 #define X86_CR4_SMAP		_BITUL(X86_CR4_SMAP_BIT)
 #define X86_CR4_PKE_BIT		22 /* enable Protection Keys support */
 #define X86_CR4_PKE		_BITUL(X86_CR4_PKE_BIT)
+#define X86_CR4_UINTR_BIT	25 /* enable User Interrupts support */
+#define X86_CR4_UINTR		_BITUL(X86_CR4_UINTR_BIT)
 
 /*
  * x86-64 Task Priority Register, CR8
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 0f8885949e8c..55fee930b6d1 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -308,6 +308,58 @@ static __always_inline void setup_smep(struct cpuinfo_x86 *c)
 		cr4_set_bits(X86_CR4_SMEP);
 }
 
+static __init int setup_disable_uintr(char *arg)
+{
+	/* No additional arguments expected */
+	if (strlen(arg))
+		return 0;
+
+	/* Do not emit a message if the feature is not present. */
+	if (!boot_cpu_has(X86_FEATURE_UINTR))
+		return 1;
+
+	setup_clear_cpu_cap(X86_FEATURE_UINTR);
+	pr_info_once("x86: 'nouintr' specified, User Interrupts support disabled\n");
+	return 1;
+}
+__setup("nouintr", setup_disable_uintr);
+
+static __always_inline void setup_uintr(struct cpuinfo_x86 *c)
+{
+	/* check the boot processor, plus compile options for UINTR. */
+	if (!cpu_feature_enabled(X86_FEATURE_UINTR))
+		goto disable_uintr;
+
+	/* checks the current processor's cpuid bits: */
+	if (!cpu_has(c, X86_FEATURE_UINTR))
+		goto disable_uintr;
+
+	/*
+	 * User Interrupts currently doesn't support PTI. For processors that
+	 * support User interrupts PTI in auto mode will default to off.  Need
+	 * this check only for users who have force enabled PTI.
+	 */
+	if (boot_cpu_has(X86_FEATURE_PTI)) {
+		pr_info_once("x86: User Interrupts (UINTR) not enabled. Please disable PTI using 'nopti' kernel parameter\n");
+		goto clear_uintr_cap;
+	}
+
+	cr4_set_bits(X86_CR4_UINTR);
+	pr_info_once("x86: User Interrupts (UINTR) enabled\n");
+
+	return;
+
+clear_uintr_cap:
+	setup_clear_cpu_cap(X86_FEATURE_UINTR);
+
+disable_uintr:
+	/*
+	 * Make sure UINTR is disabled in case it was enabled in a
+	 * previous boot (e.g., via kexec).
+	 */
+	cr4_clear_bits(X86_CR4_UINTR);
+}
+
 static __init int setup_disable_smap(char *arg)
 {
 	setup_clear_cpu_cap(X86_FEATURE_SMAP);
@@ -1564,6 +1616,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 	setup_smap(c);
 	setup_umip(c);
 
+	/* Set up User Interrupts */
+	setup_uintr(c);
+
 	/* Enable FSGSBASE instructions if available. */
 	if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
 		cr4_set_bits(X86_CR4_FSGSBASE);
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index defda61f372d..6f7eb4af5b4a 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -75,6 +75,7 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_SGX_LC,			X86_FEATURE_SGX	      },
 	{ X86_FEATURE_SGX1,			X86_FEATURE_SGX       },
 	{ X86_FEATURE_SGX2,			X86_FEATURE_SGX1      },
+	{ X86_FEATURE_UINTR,			X86_FEATURE_XSAVES    },
 	{}
 };
 
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 04/13] x86/fpu/xstate: Enumerate User Interrupts supervisor state
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (2 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 03/13] x86/cpu: Enumerate User Interrupts support Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-23 22:34   ` Thomas Gleixner
  2021-09-13 20:01 ` [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector Sohil Mehta
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

Enable xstate supervisor support for User Interrupts by default.

The user interrupt state for a task consists of the MSR state and the
User Interrupt Flag (UIF) value. XSAVES and XRSTORS handle saving and
restoring both of these states.

<The supervisor XSTATE code might be reworked based on issues reported
in the past. The Uintr context switching code would also need rework and
additional testing in that regard.>

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 arch/x86/include/asm/fpu/types.h  | 20 +++++++++++++++++++-
 arch/x86/include/asm/fpu/xstate.h |  3 ++-
 arch/x86/kernel/cpu/common.c      |  6 ++++++
 arch/x86/kernel/fpu/xstate.c      | 20 +++++++++++++++++---
 4 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index f5a38a5f3ae1..b614f1416bea 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -118,7 +118,7 @@ enum xfeature {
 	XFEATURE_RSRVD_COMP_11,
 	XFEATURE_RSRVD_COMP_12,
 	XFEATURE_RSRVD_COMP_13,
-	XFEATURE_RSRVD_COMP_14,
+	XFEATURE_UINTR,
 	XFEATURE_LBR,
 
 	XFEATURE_MAX,
@@ -135,6 +135,7 @@ enum xfeature {
 #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_PASID		(1 << XFEATURE_PASID)
+#define XFEATURE_MASK_UINTR		(1 << XFEATURE_UINTR)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
@@ -237,6 +238,23 @@ struct pkru_state {
 	u32				pad;
 } __packed;
 
+/*
+ * State component 14 is supervisor state used for User Interrupts state.
+ * The size of this state is 48 bytes
+ */
+struct uintr_state {
+	u64 handler;
+	u64 stack_adjust;
+	u32 uitt_size;
+	u8  uinv;
+	u8  pad1;
+	u8  pad2;
+	u8  uif_pad3;		/* bit 7 - UIF, bits 6:0 - reserved */
+	u64 upid_addr;
+	u64 uirr;
+	u64 uitt_addr;
+} __packed;
+
 /*
  * State component 15: Architectural LBR configuration state.
  * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 109dfcc75299..4dd4e83c0c9d 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -44,7 +44,8 @@
 	(XFEATURE_MASK_USER_SUPPORTED & ~XFEATURE_MASK_PKRU)
 
 /* All currently supported supervisor features */
-#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID)
+#define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
+					    XFEATURE_MASK_UINTR)
 
 /*
  * A supervisor state component may not always contain valuable information,
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 55fee930b6d1..3a0a3f5cfe0f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -334,6 +334,12 @@ static __always_inline void setup_uintr(struct cpuinfo_x86 *c)
 	if (!cpu_has(c, X86_FEATURE_UINTR))
 		goto disable_uintr;
 
+	/* Confirm XSAVE support for UINTR is present. */
+	if (!cpu_has_xfeatures(XFEATURE_MASK_UINTR, NULL)) {
+		pr_info_once("x86: User Interrupts (UINTR) not enabled. XSAVE support for UINTR is missing.\n");
+		goto clear_uintr_cap;
+	}
+
 	/*
 	 * User Interrupts currently doesn't support PTI. For processors that
 	 * support User interrupts PTI in auto mode will default to off.  Need
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c8def1b7f8fb..ab19403effb0 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -38,6 +38,10 @@ static const char *xfeature_names[] =
 	"Processor Trace (unused)"	,
 	"Protection Keys User registers",
 	"PASID state",
+	"unknown xstate feature 11",
+	"unknown xstate feature 12",
+	"unknown xstate feature 13",
+	"User Interrupts registers",
 	"unknown xstate feature"	,
 };
 
@@ -53,6 +57,10 @@ static short xsave_cpuid_features[] __initdata = {
 	X86_FEATURE_INTEL_PT,
 	X86_FEATURE_PKU,
 	X86_FEATURE_ENQCMD,
+	-1,			/* Unknown 11 */
+	-1,			/* Unknown 12 */
+	-1,			/* Unknown 13 */
+	X86_FEATURE_UINTR,
 };
 
 /*
@@ -236,6 +244,7 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
 	print_xstate_feature(XFEATURE_MASK_PASID);
+	print_xstate_feature(XFEATURE_MASK_UINTR);
 }
 
 /*
@@ -372,7 +381,8 @@ static void __init print_xstate_offset_size(void)
 	 XFEATURE_MASK_PKRU |			\
 	 XFEATURE_MASK_BNDREGS |		\
 	 XFEATURE_MASK_BNDCSR |			\
-	 XFEATURE_MASK_PASID)
+	 XFEATURE_MASK_PASID |			\
+	 XFEATURE_MASK_UINTR)
 
 /*
  * setup the xstate image representing the init state
@@ -532,6 +542,7 @@ static void check_xstate_against_struct(int nr)
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PASID,     struct ia32_pasid_state);
+	XCHECK_SZ(sz, nr, XFEATURE_UINTR,     struct uintr_state);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
@@ -539,9 +550,12 @@ static void check_xstate_against_struct(int nr)
 	 * numbers.
 	 */
 	if ((nr < XFEATURE_YMM) ||
-	    (nr >= XFEATURE_MAX) ||
 	    (nr == XFEATURE_PT_UNIMPLEMENTED_SO_FAR) ||
-	    ((nr >= XFEATURE_RSRVD_COMP_11) && (nr <= XFEATURE_LBR))) {
+	    (nr == XFEATURE_RSRVD_COMP_11) ||
+	    (nr == XFEATURE_RSRVD_COMP_12) ||
+	    (nr == XFEATURE_RSRVD_COMP_13) ||
+	    (nr == XFEATURE_LBR) ||
+	    (nr >= XFEATURE_MAX)) {
 		WARN_ONCE(1, "no structure for xstate: %d\n", nr);
 		XSTATE_WARN_ON(1);
 	}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (3 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 04/13] x86/fpu/xstate: Enumerate User Interrupts supervisor state Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-23 23:07   ` Thomas Gleixner
  2021-09-13 20:01 ` [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls Sohil Mehta
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

A user interrupt notification vector is used on the receiver's cpu to
identify an interrupt as a user interrupt (and not a kernel interrupt).
Hardware uses the same notification vector to generate an IPI from a
sender's cpu core when the SENDUIPI instruction is executed.

Typically, the kernel shouldn't receive an interrupt with this vector.
However, it is possible that the kernel might receive this vector.

Scenario that can cause the spurious interrupt:

Step	cpu 0 (receiver task)		cpu 1 (sender task)
----	---------------------		-------------------
1	task is running
2					executes SENDUIPI
3					IPI sent
4	context switched out
5	IPI delivered
	(kernel interrupt detected)

A kernel interrupt can be detected, if a receiver task gets scheduled
out after the SENDUIPI-based IPI was sent but before the IPI was
delivered.

The kernel doesn't need to do anything in this case other than receiving
the interrupt and clearing the local APIC. The user interrupt is always
stored in the receiver's UPID before the IPI is generated. When the
receiver gets scheduled back the interrupt would be delivered based on
its UPID.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 arch/x86/include/asm/hardirq.h     |  3 +++
 arch/x86/include/asm/idtentry.h    |  4 ++++
 arch/x86/include/asm/irq_vectors.h |  5 ++++-
 arch/x86/kernel/idt.c              |  3 +++
 arch/x86/kernel/irq.c              | 33 ++++++++++++++++++++++++++++++
 5 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 275e7fd20310..279afc01f1ac 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -19,6 +19,9 @@ typedef struct {
 	unsigned int kvm_posted_intr_ipis;
 	unsigned int kvm_posted_intr_wakeup_ipis;
 	unsigned int kvm_posted_intr_nested_ipis;
+#endif
+#ifdef CONFIG_X86_USER_INTERRUPTS
+	unsigned int uintr_spurious_count;
 #endif
 	unsigned int x86_platform_ipis;	/* arch dependent */
 	unsigned int apic_perf_irqs;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..5929a6f9eeee 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -671,6 +671,10 @@ DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_WAKEUP_VECTOR,	sysvec_kvm_posted_intr_wakeup
 DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested_ipi);
 #endif
 
+#ifdef CONFIG_X86_USER_INTERRUPTS
+DECLARE_IDTENTRY_SYSVEC(UINTR_NOTIFICATION_VECTOR,	sysvec_uintr_spurious_interrupt);
+#endif
+
 #if IS_ENABLED(CONFIG_HYPERV)
 DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_hyperv_callback);
 DECLARE_IDTENTRY_SYSVEC(HYPERV_REENLIGHTENMENT_VECTOR,	sysvec_hyperv_reenlightenment);
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 43dcb9284208..d26faa504931 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -104,7 +104,10 @@
 #define HYPERV_STIMER0_VECTOR		0xed
 #endif
 
-#define LOCAL_TIMER_VECTOR		0xec
+/* Vector for User interrupt notifications */
+#define UINTR_NOTIFICATION_VECTOR       0xec
+
+#define LOCAL_TIMER_VECTOR		0xeb
 
 #define NR_VECTORS			 256
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index df0fa695bb09..d8c45e0728f0 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -147,6 +147,9 @@ static const __initconst struct idt_data apic_idts[] = {
 	INTG(POSTED_INTR_WAKEUP_VECTOR,		asm_sysvec_kvm_posted_intr_wakeup_ipi),
 	INTG(POSTED_INTR_NESTED_VECTOR,		asm_sysvec_kvm_posted_intr_nested_ipi),
 # endif
+#ifdef CONFIG_X86_USER_INTERRUPTS
+	INTG(UINTR_NOTIFICATION_VECTOR,		asm_sysvec_uintr_spurious_interrupt),
+#endif
 # ifdef CONFIG_IRQ_WORK
 	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
 # endif
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index e28f6a5d14f1..e3c35668c7c5 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -181,6 +181,12 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 		seq_printf(p, "%10u ",
 			   irq_stats(j)->kvm_posted_intr_wakeup_ipis);
 	seq_puts(p, "  Posted-interrupt wakeup event\n");
+#endif
+#ifdef CONFIG_X86_USER_INTERRUPTS
+	seq_printf(p, "%*s: ", prec, "UIS");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", irq_stats(j)->uintr_spurious_count);
+	seq_puts(p, "  User-interrupt spurious event\n");
 #endif
 	return 0;
 }
@@ -325,6 +331,33 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
 }
 #endif
 
+#ifdef CONFIG_X86_USER_INTERRUPTS
+/*
+ * Handler for UINTR_NOTIFICATION_VECTOR.
+ *
+ * The notification vector is used by the cpu to detect a User Interrupt. In
+ * the typical usage, the cpu would handle this interrupt and clear the local
+ * apic.
+ *
+ * However, it is possible that the kernel might receive this vector. This can
+ * happen if the receiver thread was running when the interrupt was sent but it
+ * got scheduled out before the interrupt was delivered. The kernel doesn't
+ * need to do anything other than clearing the local APIC. A pending user
+ * interrupt is always saved in the receiver's UPID which can be referenced
+ * when the receiver gets scheduled back.
+ *
+ * If the kernel receives a storm of these, it could mean an issue with the
+ * kernel's saving and restoring of the User Interrupt MSR state; Specifically,
+ * the notification vector bits in the IA32_UINTR_MISC_MSR.
+ */
+DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_uintr_spurious_interrupt)
+{
+	/* TODO: Add entry-exit tracepoints */
+	ack_APIC_irq();
+	inc_irq_stat(uintr_spurious_count);
+}
+#endif
+
 
 #ifdef CONFIG_HOTPLUG_CPU
 /* A cpu has been removed from cpu_online_mask.  Reset irq affinities. */
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (4 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-23 12:26   ` Greg KH
  2021-09-23 23:52   ` Thomas Gleixner
  2021-09-13 20:01 ` [RFC PATCH 07/13] x86/process/64: Add uintr task context switch support Sohil Mehta
                   ` (11 subsequent siblings)
  17 siblings, 2 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

Any application that wants to receive a user interrupt needs to register
an interrupt handler with the kernel. Add a registration syscall that
sets up the interrupt handler and the related kernel structures for
the task that makes this syscall.

Only one interrupt handler per task can be registered with the
kernel/hardware. Each task has its private interrupt vector space of 64
vectors. The vector registration and the related FD management is
covered later.

Also add an unregister syscall to let a task unregister the interrupt
handler.

The UPID for each receiver task needs to be updated whenever a task gets
context switched or it moves from one cpu to another. This will also be
covered later. The system calls haven't been wired up yet so no real
harm is done if we don't update the UPID right now.

<Code typically in the x86/kernel directory doesn't deal with file
descriptor management. I have kept uintr_fd.c separate to make it easier
to move it somewhere else if needed.>

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 arch/x86/include/asm/processor.h |   6 +
 arch/x86/include/asm/uintr.h     |  13 ++
 arch/x86/kernel/Makefile         |   1 +
 arch/x86/kernel/uintr_core.c     | 240 +++++++++++++++++++++++++++++++
 arch/x86/kernel/uintr_fd.c       |  58 ++++++++
 5 files changed, 318 insertions(+)
 create mode 100644 arch/x86/include/asm/uintr.h
 create mode 100644 arch/x86/kernel/uintr_core.c
 create mode 100644 arch/x86/kernel/uintr_fd.c

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 9ad2acaaae9b..d229bfac8b4f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -9,6 +9,7 @@ struct task_struct;
 struct mm_struct;
 struct io_bitmap;
 struct vm86;
+struct uintr_receiver;
 
 #include <asm/math_emu.h>
 #include <asm/segment.h>
@@ -529,6 +530,11 @@ struct thread_struct {
 	 */
 	u32			pkru;
 
+#ifdef CONFIG_X86_USER_INTERRUPTS
+	/* User Interrupt state*/
+	struct uintr_receiver	*ui_recv;
+#endif
+
 	/* Floating point and extended processor state */
 	struct fpu		fpu;
 	/*
diff --git a/arch/x86/include/asm/uintr.h b/arch/x86/include/asm/uintr.h
new file mode 100644
index 000000000000..4f35bd8bd4e0
--- /dev/null
+++ b/arch/x86/include/asm/uintr.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_UINTR_H
+#define _ASM_X86_UINTR_H
+
+#ifdef CONFIG_X86_USER_INTERRUPTS
+
+bool uintr_arch_enabled(void);
+int do_uintr_register_handler(u64 handler);
+int do_uintr_unregister_handler(void);
+
+#endif /* CONFIG_X86_USER_INTERRUPTS */
+
+#endif /* _ASM_X86_UINTR_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8f4e8fa6ed75..060ca9f23e23 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -140,6 +140,7 @@ obj-$(CONFIG_UPROBES)			+= uprobes.o
 obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
 obj-$(CONFIG_TRACING)			+= tracepoint.o
 obj-$(CONFIG_SCHED_MC_PRIO)		+= itmt.o
+obj-$(CONFIG_X86_USER_INTERRUPTS)	+= uintr_fd.o uintr_core.o
 obj-$(CONFIG_X86_UMIP)			+= umip.o
 
 obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
diff --git a/arch/x86/kernel/uintr_core.c b/arch/x86/kernel/uintr_core.c
new file mode 100644
index 000000000000..2c6042a6840a
--- /dev/null
+++ b/arch/x86/kernel/uintr_core.c
@@ -0,0 +1,240 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021, Intel Corporation.
+ *
+ * Sohil Mehta <sohil.mehta@intel.com>
+ * Jacob Pan <jacob.jun.pan@linux.intel.com>
+ */
+#define pr_fmt(fmt)    "uintr: " fmt
+
+#include <linux/refcount.h>
+#include <linux/sched.h>
+#include <linux/sched/task.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <asm/apic.h>
+#include <asm/fpu/internal.h>
+#include <asm/irq_vectors.h>
+#include <asm/msr.h>
+#include <asm/msr-index.h>
+#include <asm/uintr.h>
+
+/* User Posted Interrupt Descriptor (UPID) */
+struct uintr_upid {
+	struct {
+		u8 status;	/* bit 0: ON, bit 1: SN, bit 2-7: reserved */
+		u8 reserved1;	/* Reserved */
+		u8 nv;		/* Notification vector */
+		u8 reserved2;	/* Reserved */
+		u32 ndst;	/* Notification destination */
+	} nc __packed;		/* Notification control */
+	u64 puir;		/* Posted user interrupt requests */
+} __aligned(64);
+
+/* UPID Notification control status */
+#define UPID_ON		0x0	/* Outstanding notification */
+#define UPID_SN		0x1	/* Suppressed notification */
+
+struct uintr_upid_ctx {
+	struct uintr_upid *upid;
+	refcount_t refs;
+};
+
+struct uintr_receiver {
+	struct uintr_upid_ctx *upid_ctx;
+};
+
+inline bool uintr_arch_enabled(void)
+{
+	return static_cpu_has(X86_FEATURE_UINTR);
+}
+
+static inline bool is_uintr_receiver(struct task_struct *t)
+{
+	return !!t->thread.ui_recv;
+}
+
+static inline u32 cpu_to_ndst(int cpu)
+{
+	u32 apicid = (u32)apic->cpu_present_to_apicid(cpu);
+
+	WARN_ON_ONCE(apicid == BAD_APICID);
+
+	if (!x2apic_enabled())
+		return (apicid << 8) & 0xFF00;
+
+	return apicid;
+}
+
+static void free_upid(struct uintr_upid_ctx *upid_ctx)
+{
+	kfree(upid_ctx->upid);
+	upid_ctx->upid = NULL;
+	kfree(upid_ctx);
+}
+
+/* TODO: UPID needs to be allocated by a KPTI compatible allocator */
+static struct uintr_upid_ctx *alloc_upid(void)
+{
+	struct uintr_upid_ctx *upid_ctx;
+	struct uintr_upid *upid;
+
+	upid_ctx = kzalloc(sizeof(*upid_ctx), GFP_KERNEL);
+	if (!upid_ctx)
+		return NULL;
+
+	upid = kzalloc(sizeof(*upid), GFP_KERNEL);
+
+	if (!upid) {
+		kfree(upid_ctx);
+		return NULL;
+	}
+
+	upid_ctx->upid = upid;
+	refcount_set(&upid_ctx->refs, 1);
+
+	return upid_ctx;
+}
+
+static void put_upid_ref(struct uintr_upid_ctx *upid_ctx)
+{
+	if (refcount_dec_and_test(&upid_ctx->refs))
+		free_upid(upid_ctx);
+}
+
+int do_uintr_unregister_handler(void)
+{
+	struct task_struct *t = current;
+	struct fpu *fpu = &t->thread.fpu;
+	struct uintr_receiver *ui_recv;
+	u64 msr64;
+
+	if (!is_uintr_receiver(t))
+		return -EINVAL;
+
+	pr_debug("recv: Unregister handler and clear MSRs for task=%d\n",
+		 t->pid);
+
+	/*
+	 * TODO: Evaluate usage of fpregs_lock() and get_xsave_addr(). Bugs
+	 * have been reported recently for PASID and WRPKRU.
+	 *
+	 * UPID and ui_recv will be referenced during context switch. Need to
+	 * disable preemption while modifying the MSRs, UPID and ui_recv thread
+	 * struct.
+	 */
+	fpregs_lock();
+
+	/* Clear only the receiver specific state. Sender related state is not modified */
+	if (fpregs_state_valid(fpu, smp_processor_id())) {
+		/* Modify only the relevant bits of the MISC MSR */
+		rdmsrl(MSR_IA32_UINTR_MISC, msr64);
+		msr64 &= ~GENMASK_ULL(39, 32);
+		wrmsrl(MSR_IA32_UINTR_MISC, msr64);
+		wrmsrl(MSR_IA32_UINTR_PD, 0ULL);
+		wrmsrl(MSR_IA32_UINTR_RR, 0ULL);
+		wrmsrl(MSR_IA32_UINTR_STACKADJUST, 0ULL);
+		wrmsrl(MSR_IA32_UINTR_HANDLER, 0ULL);
+	} else {
+		struct uintr_state *p;
+
+		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_UINTR);
+		if (p) {
+			p->handler = 0;
+			p->stack_adjust = 0;
+			p->upid_addr = 0;
+			p->uinv = 0;
+			p->uirr = 0;
+		}
+	}
+
+	ui_recv = t->thread.ui_recv;
+	/*
+	 * Suppress notifications so that no further interrupts are generated
+	 * based on this UPID.
+	 */
+	set_bit(UPID_SN, (unsigned long *)&ui_recv->upid_ctx->upid->nc.status);
+
+	put_upid_ref(ui_recv->upid_ctx);
+	kfree(ui_recv);
+	t->thread.ui_recv = NULL;
+
+	fpregs_unlock();
+
+	return 0;
+}
+
+int do_uintr_register_handler(u64 handler)
+{
+	struct uintr_receiver *ui_recv;
+	struct uintr_upid *upid;
+	struct task_struct *t = current;
+	struct fpu *fpu = &t->thread.fpu;
+	u64 misc_msr;
+	int cpu;
+
+	if (is_uintr_receiver(t))
+		return -EBUSY;
+
+	ui_recv = kzalloc(sizeof(*ui_recv), GFP_KERNEL);
+	if (!ui_recv)
+		return -ENOMEM;
+
+	ui_recv->upid_ctx = alloc_upid();
+	if (!ui_recv->upid_ctx) {
+		kfree(ui_recv);
+		pr_debug("recv: alloc upid failed for task=%d\n", t->pid);
+		return -ENOMEM;
+	}
+
+	/*
+	 * TODO: Evaluate usage of fpregs_lock() and get_xsave_addr(). Bugs
+	 * have been reported recently for PASID and WRPKRU.
+	 *
+	 * UPID and ui_recv will be referenced during context switch. Need to
+	 * disable preemption while modifying the MSRs, UPID and ui_recv thread
+	 * struct.
+	 */
+	fpregs_lock();
+
+	cpu = smp_processor_id();
+	upid = ui_recv->upid_ctx->upid;
+	upid->nc.nv = UINTR_NOTIFICATION_VECTOR;
+	upid->nc.ndst = cpu_to_ndst(cpu);
+
+	t->thread.ui_recv = ui_recv;
+
+	if (fpregs_state_valid(fpu, cpu)) {
+		wrmsrl(MSR_IA32_UINTR_HANDLER, handler);
+		wrmsrl(MSR_IA32_UINTR_PD, (u64)ui_recv->upid_ctx->upid);
+
+		/* Set value as size of ABI redzone */
+		wrmsrl(MSR_IA32_UINTR_STACKADJUST, 128);
+
+		/* Modify only the relevant bits of the MISC MSR */
+		rdmsrl(MSR_IA32_UINTR_MISC, misc_msr);
+		misc_msr |= (u64)UINTR_NOTIFICATION_VECTOR << 32;
+		wrmsrl(MSR_IA32_UINTR_MISC, misc_msr);
+	} else {
+		struct xregs_state *xsave;
+		struct uintr_state *p;
+
+		xsave = &fpu->state.xsave;
+		xsave->header.xfeatures |= XFEATURE_MASK_UINTR;
+		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_UINTR);
+		if (p) {
+			p->handler = handler;
+			p->upid_addr = (u64)ui_recv->upid_ctx->upid;
+			p->stack_adjust = 128;
+			p->uinv = UINTR_NOTIFICATION_VECTOR;
+		}
+	}
+
+	fpregs_unlock();
+
+	pr_debug("recv: task=%d register handler=%llx upid %px\n",
+		 t->pid, handler, upid);
+
+	return 0;
+}
diff --git a/arch/x86/kernel/uintr_fd.c b/arch/x86/kernel/uintr_fd.c
new file mode 100644
index 000000000000..a1a9c105fdab
--- /dev/null
+++ b/arch/x86/kernel/uintr_fd.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2021, Intel Corporation.
+ *
+ * Sohil Mehta <sohil.mehta@intel.com>
+ */
+#define pr_fmt(fmt)	"uintr: " fmt
+
+#include <linux/sched.h>
+#include <linux/syscalls.h>
+
+#include <asm/uintr.h>
+
+/*
+ * sys_uintr_register_handler - setup user interrupt handler for receiver.
+ */
+SYSCALL_DEFINE2(uintr_register_handler, u64 __user *, handler, unsigned int, flags)
+{
+	int ret;
+
+	if (!uintr_arch_enabled())
+		return -EOPNOTSUPP;
+
+	if (flags)
+		return -EINVAL;
+
+	/* TODO: Validate the handler address */
+	if (!handler)
+		return -EFAULT;
+
+	ret = do_uintr_register_handler((u64)handler);
+
+	pr_debug("recv: register handler task=%d flags %d handler %lx ret %d\n",
+		 current->pid, flags, (unsigned long)handler, ret);
+
+	return ret;
+}
+
+/*
+ * sys_uintr_unregister_handler - Teardown user interrupt handler for receiver.
+ */
+SYSCALL_DEFINE1(uintr_unregister_handler, unsigned int, flags)
+{
+	int ret;
+
+	if (!uintr_arch_enabled())
+		return -EOPNOTSUPP;
+
+	if (flags)
+		return -EINVAL;
+
+	ret = do_uintr_unregister_handler();
+
+	pr_debug("recv: unregister handler task=%d flags %d ret %d\n",
+		 current->pid, flags, ret);
+
+	return ret;
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 07/13] x86/process/64: Add uintr task context switch support
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (5 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-24  0:41   ` Thomas Gleixner
  2021-09-13 20:01 ` [RFC PATCH 08/13] x86/process/64: Clean up uintr task fork and exit paths Sohil Mehta
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

User interrupt state is saved and restored using xstate supervisor
feature support. This includes the MSR state and the User Interrupt Flag
(UIF) value.

During context switch update the UPID for a uintr task to reflect the
current state of the task; namely whether the task should receive
interrupt notifications and which cpu the task is currently running on.

XSAVES clears the notification vector (UINV) in the MISC MSR to prevent
interrupts from being recognized in the UIRR MSR while the task is being
context switched. The UINV is restored back when the kernel does an
XRSTORS.

However, this conflicts with the kernel's lazy restore optimization
which skips an XRSTORS if the kernel is scheduling the same user task
back and the underlying MSR state hasn't been modified. Special handling
is needed for a uintr task in the context switch path to keep using this
optimization.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 arch/x86/include/asm/entry-common.h |  4 ++
 arch/x86/include/asm/uintr.h        |  9 ++++
 arch/x86/kernel/fpu/core.c          |  8 +++
 arch/x86/kernel/process_64.c        |  4 ++
 arch/x86/kernel/uintr_core.c        | 75 +++++++++++++++++++++++++++++
 5 files changed, 100 insertions(+)

diff --git a/arch/x86/include/asm/entry-common.h b/arch/x86/include/asm/entry-common.h
index 14ebd2196569..4e6c4d0912a5 100644
--- a/arch/x86/include/asm/entry-common.h
+++ b/arch/x86/include/asm/entry-common.h
@@ -8,6 +8,7 @@
 #include <asm/nospec-branch.h>
 #include <asm/io_bitmap.h>
 #include <asm/fpu/api.h>
+#include <asm/uintr.h>
 
 /* Check that the stack and regs on entry from user mode are sane. */
 static __always_inline void arch_check_user_regs(struct pt_regs *regs)
@@ -57,6 +58,9 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
 	if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
 		switch_fpu_return();
 
+	if (static_cpu_has(X86_FEATURE_UINTR))
+		switch_uintr_return();
+
 #ifdef CONFIG_COMPAT
 	/*
 	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/uintr.h b/arch/x86/include/asm/uintr.h
index 4f35bd8bd4e0..f7ccb67014b8 100644
--- a/arch/x86/include/asm/uintr.h
+++ b/arch/x86/include/asm/uintr.h
@@ -8,6 +8,15 @@ bool uintr_arch_enabled(void);
 int do_uintr_register_handler(u64 handler);
 int do_uintr_unregister_handler(void);
 
+/* TODO: Inline the context switch related functions */
+void switch_uintr_prepare(struct task_struct *prev);
+void switch_uintr_return(void);
+
+#else /* !CONFIG_X86_USER_INTERRUPTS */
+
+static inline void switch_uintr_prepare(struct task_struct *prev) {}
+static inline void switch_uintr_return(void) {}
+
 #endif /* CONFIG_X86_USER_INTERRUPTS */
 
 #endif /* _ASM_X86_UINTR_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 7ada7bd03a32..e30588bf7ce9 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -95,6 +95,14 @@ EXPORT_SYMBOL(irq_fpu_usable);
  * over the place.
  *
  * FXSAVE and all XSAVE variants preserve the FPU register state.
+ *
+ * When XSAVES is called with XFEATURE_UINTR enabled it
+ * saves the FPU state and clears the interrupt notification
+ * vector byte of the MISC_MSR [bits 39:32]. This is required
+ * to stop detecting additional User Interrupts after we
+ * have saved the FPU state. Before going back to userspace
+ * we would correct this and only program the byte that was
+ * cleared.
  */
 void save_fpregs_to_fpstate(struct fpu *fpu)
 {
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ec0d836a13b1..62b82137db9c 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -53,6 +53,7 @@
 #include <asm/xen/hypervisor.h>
 #include <asm/vdso.h>
 #include <asm/resctrl.h>
+#include <asm/uintr.h>
 #include <asm/unistd.h>
 #include <asm/fsgsbase.h>
 #ifdef CONFIG_IA32_EMULATION
@@ -565,6 +566,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
 		     this_cpu_read(hardirq_stack_inuse));
 
+	if (static_cpu_has(X86_FEATURE_UINTR))
+		switch_uintr_prepare(prev_p);
+
 	if (!test_thread_flag(TIF_NEED_FPU_LOAD))
 		switch_fpu_prepare(prev_fpu, cpu);
 
diff --git a/arch/x86/kernel/uintr_core.c b/arch/x86/kernel/uintr_core.c
index 2c6042a6840a..7a29888050ad 100644
--- a/arch/x86/kernel/uintr_core.c
+++ b/arch/x86/kernel/uintr_core.c
@@ -238,3 +238,78 @@ int do_uintr_register_handler(u64 handler)
 
 	return 0;
 }
+
+/* Suppress notifications since this task is being context switched out */
+void switch_uintr_prepare(struct task_struct *prev)
+{
+	struct uintr_upid *upid;
+
+	if (is_uintr_receiver(prev)) {
+		upid = prev->thread.ui_recv->upid_ctx->upid;
+		set_bit(UPID_SN, (unsigned long *)&upid->nc.status);
+	}
+}
+
+/*
+ * Do this right before we are going back to userspace after the FPU has been
+ * reloaded i.e. TIF_NEED_FPU_LOAD is clear.
+ * Called from arch_exit_to_user_mode_prepare() with interrupts disabled.
+ */
+void switch_uintr_return(void)
+{
+	struct uintr_upid *upid;
+	u64 misc_msr;
+
+	if (is_uintr_receiver(current)) {
+		/*
+		 * The XSAVES instruction clears the UINTR notification
+		 * vector(UINV) in the UINT_MISC MSR when user context gets
+		 * saved. Before going back to userspace we need to restore the
+		 * notification vector. XRSTORS would automatically restore the
+		 * notification but we can't be sure that XRSTORS will always
+		 * be called when going back to userspace. Also if XSAVES gets
+		 * called twice the UINV stored in the Xstate buffer will be
+		 * overwritten. Threfore, before going back to userspace we
+		 * always check if the UINV is set and reprogram if needed.
+		 *
+		 * Alternatively, we could combine this with
+		 * switch_fpu_return() and program the MSR whenever we are
+		 * skipping the XRSTORS. We need special precaution to make
+		 * sure the UINV value in the XSTATE buffer doesn't get
+		 * overwritten by calling XSAVES twice.
+		 */
+		WARN_ON_ONCE(test_thread_flag(TIF_NEED_FPU_LOAD));
+
+		/* Modify only the relevant bits of the MISC MSR */
+		rdmsrl(MSR_IA32_UINTR_MISC, misc_msr);
+		if (!(misc_msr & GENMASK_ULL(39, 32))) {
+			misc_msr |= (u64)UINTR_NOTIFICATION_VECTOR << 32;
+			wrmsrl(MSR_IA32_UINTR_MISC, misc_msr);
+		}
+
+		/*
+		 * It is necessary to clear the SN bit after we set UINV and
+		 * NDST to avoid incorrect interrupt routing.
+		 */
+		upid = current->thread.ui_recv->upid_ctx->upid;
+		upid->nc.ndst = cpu_to_ndst(smp_processor_id());
+		clear_bit(UPID_SN, (unsigned long *)&upid->nc.status);
+
+		/*
+		 * Interrupts might have accumulated in the UPID while the
+		 * thread was preempted. In this case invoke the hardware
+		 * detection sequence manually by sending a self IPI with UINV.
+		 * Since UINV is set and SN is cleared, any new UINTR
+		 * notifications due to the self IPI or otherwise would result
+		 * in the hardware updating the UIRR directly.
+		 * No real interrupt would be generated as a result of this.
+		 *
+		 * The alternative is to atomically read and clear the UPID and
+		 * program the UIRR. In that case the kernel would need to
+		 * carefully manage the race with the hardware if the UPID gets
+		 * updated after the read.
+		 */
+		if (READ_ONCE(upid->puir))
+			apic->send_IPI_self(UINTR_NOTIFICATION_VECTOR);
+	}
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 08/13] x86/process/64: Clean up uintr task fork and exit paths
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (6 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 07/13] x86/process/64: Add uintr task context switch support Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-24  1:02   ` Thomas Gleixner
  2021-09-13 20:01 ` [RFC PATCH 09/13] x86/uintr: Introduce vector registration and uintr_fd syscall Sohil Mehta
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

The user interrupt MSRs and the user interrupt state is task specific.
During task fork and exit clear the task state, clear the MSRs and
dereference the shared resources.

Some of the memory resources like the UPID are referenced in the file
descriptor and could be in use while the uintr_fd is still valid.
Instead of freeing up  the UPID just dereference it.  Eventually when
every user releases the reference the memory resource will be freed up.

Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 arch/x86/include/asm/uintr.h |  3 ++
 arch/x86/kernel/fpu/core.c   |  9 ++++++
 arch/x86/kernel/process.c    |  9 ++++++
 arch/x86/kernel/uintr_core.c | 55 ++++++++++++++++++++++++++++++++++++
 4 files changed, 76 insertions(+)

diff --git a/arch/x86/include/asm/uintr.h b/arch/x86/include/asm/uintr.h
index f7ccb67014b8..cef4dd81d40e 100644
--- a/arch/x86/include/asm/uintr.h
+++ b/arch/x86/include/asm/uintr.h
@@ -8,12 +8,15 @@ bool uintr_arch_enabled(void);
 int do_uintr_register_handler(u64 handler);
 int do_uintr_unregister_handler(void);
 
+void uintr_free(struct task_struct *task);
+
 /* TODO: Inline the context switch related functions */
 void switch_uintr_prepare(struct task_struct *prev);
 void switch_uintr_return(void);
 
 #else /* !CONFIG_X86_USER_INTERRUPTS */
 
+static inline void uintr_free(struct task_struct *task) {}
 static inline void switch_uintr_prepare(struct task_struct *prev) {}
 static inline void switch_uintr_return(void) {}
 
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index e30588bf7ce9..c0a54f7aaa2a 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -260,6 +260,7 @@ int fpu_clone(struct task_struct *dst)
 {
 	struct fpu *src_fpu = &current->thread.fpu;
 	struct fpu *dst_fpu = &dst->thread.fpu;
+	struct uintr_state *uintr_state;
 
 	/* The new task's FPU state cannot be valid in the hardware. */
 	dst_fpu->last_cpu = -1;
@@ -284,6 +285,14 @@ int fpu_clone(struct task_struct *dst)
 
 	else
 		save_fpregs_to_fpstate(dst_fpu);
+
+	/* UINTR state is not expected to be inherited (in the current design). */
+	if (static_cpu_has(X86_FEATURE_UINTR)) {
+		uintr_state = get_xsave_addr(&dst_fpu->state.xsave, XFEATURE_UINTR);
+		if (uintr_state)
+			memset(uintr_state, 0, sizeof(*uintr_state));
+	}
+
 	fpregs_unlock();
 
 	set_tsk_thread_flag(dst, TIF_NEED_FPU_LOAD);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 1d9463e3096b..83677f76bd7b 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -26,6 +26,7 @@
 #include <linux/elf-randomize.h>
 #include <trace/events/power.h>
 #include <linux/hw_breakpoint.h>
+#include <asm/uintr.h>
 #include <asm/cpu.h>
 #include <asm/apic.h>
 #include <linux/uaccess.h>
@@ -87,6 +88,12 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 #ifdef CONFIG_VM86
 	dst->thread.vm86 = NULL;
 #endif
+
+#ifdef CONFIG_X86_USER_INTERRUPTS
+	/* User Interrupt state is unique for each task */
+	dst->thread.ui_recv = NULL;
+#endif
+
 	return fpu_clone(dst);
 }
 
@@ -103,6 +110,8 @@ void exit_thread(struct task_struct *tsk)
 
 	free_vm86(t);
 
+	uintr_free(tsk);
+
 	fpu__drop(fpu);
 }
 
diff --git a/arch/x86/kernel/uintr_core.c b/arch/x86/kernel/uintr_core.c
index 7a29888050ad..a2a13f890139 100644
--- a/arch/x86/kernel/uintr_core.c
+++ b/arch/x86/kernel/uintr_core.c
@@ -313,3 +313,58 @@ void switch_uintr_return(void)
 			apic->send_IPI_self(UINTR_NOTIFICATION_VECTOR);
 	}
 }
+
+/*
+ * This should only be called from exit_thread().
+ * exit_thread() can happen in current context when the current thread is
+ * exiting or it can happen for a new thread that is being created.
+ * For new threads is_uintr_receiver() should fail.
+ */
+void uintr_free(struct task_struct *t)
+{
+	struct uintr_receiver *ui_recv;
+	struct fpu *fpu;
+
+	if (!static_cpu_has(X86_FEATURE_UINTR) || !is_uintr_receiver(t))
+		return;
+
+	if (WARN_ON_ONCE(t != current))
+		return;
+
+	fpu = &t->thread.fpu;
+
+	fpregs_lock();
+
+	if (fpregs_state_valid(fpu, smp_processor_id())) {
+		wrmsrl(MSR_IA32_UINTR_MISC, 0ULL);
+		wrmsrl(MSR_IA32_UINTR_PD, 0ULL);
+		wrmsrl(MSR_IA32_UINTR_RR, 0ULL);
+		wrmsrl(MSR_IA32_UINTR_STACKADJUST, 0ULL);
+		wrmsrl(MSR_IA32_UINTR_HANDLER, 0ULL);
+	} else {
+		struct uintr_state *p;
+
+		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_UINTR);
+		if (p) {
+			p->handler = 0;
+			p->uirr = 0;
+			p->upid_addr = 0;
+			p->stack_adjust = 0;
+			p->uinv = 0;
+		}
+	}
+
+	/* Check: Can a thread be context switched while it is exiting? */
+	ui_recv = t->thread.ui_recv;
+
+	/*
+	 * Suppress notifications so that no further interrupts are
+	 * generated based on this UPID.
+	 */
+	set_bit(UPID_SN, (unsigned long *)&ui_recv->upid_ctx->upid->nc.status);
+	put_upid_ref(ui_recv->upid_ctx);
+	kfree(ui_recv);
+	t->thread.ui_recv = NULL;
+
+	fpregs_unlock();
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 09/13] x86/uintr: Introduce vector registration and uintr_fd syscall
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (7 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 08/13] x86/process/64: Clean up uintr task fork and exit paths Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-24 10:33   ` Thomas Gleixner
  2021-09-13 20:01 ` [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls Sohil Mehta
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

Each receiving task has its own interrupt vector space of 64 vectors.
For each vector registered by a task create a uintr_fd. Only tasks that
have previously registered a user interrupt handler can register a
vector.

The sender for the user interrupt could be another userspace
application, kernel or an external source (like a device). Any sender
that wants to generate a user interrupt needs access to receiver's
vector number and UPID.  uintr_fd abstracts that information and allows
a sender with access to uintr_fd to connect and generate a user
interrupt. Upon interrupt delivery, the interrupt handler would be
invoked with the associated vector number pushed onto the stack.

Using an FD abstraction automatically provides a secure mechanism to
connect with a receiver. It also makes the tracking and management of
the interrupt vector resource easier for userspace.

uintr_fd can be useful in some of the usages where eventfd is used for
userspace event notifications. Though uintr_fd is nowhere close to a
drop-in replacement, the semantics are meant to be somewhat similar to
an eventfd or the write end of a pipe.

Access to uintr_fd can be achieved in the following ways:
- direct access if the task is part of the same thread group (process)
- inherited by a child process.
- explicitly shared using any of the FD sharing mechanisms.

If the sender is another userspace task, it can use the uintr_fd to send
user IPIs to the receiver. This works in conjunction with the SENDUIPI
instruction. The details related to this are covered later.

The exact APIs for the sender being a kernel or another external source
are still being worked upon. The general idea is that the receiver would
pass the uintr_fd to the kernel by extending some existing API (like
io_uring).

The vector associated with uintr_fd can be unregistered by closing all
references to the uintr_fd.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 arch/x86/include/asm/uintr.h |  14 ++++
 arch/x86/kernel/uintr_core.c | 129 +++++++++++++++++++++++++++++++++--
 arch/x86/kernel/uintr_fd.c   |  94 +++++++++++++++++++++++++
 3 files changed, 232 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/uintr.h b/arch/x86/include/asm/uintr.h
index cef4dd81d40e..1f00e2a63da4 100644
--- a/arch/x86/include/asm/uintr.h
+++ b/arch/x86/include/asm/uintr.h
@@ -4,9 +4,23 @@
 
 #ifdef CONFIG_X86_USER_INTERRUPTS
 
+struct uintr_upid_ctx {
+	struct task_struct *task;	/* Receiver task */
+	struct uintr_upid *upid;
+	refcount_t refs;
+};
+
+struct uintr_receiver_info {
+	struct uintr_upid_ctx *upid_ctx;	/* UPID context */
+	struct callback_head twork;		/* Task work head */
+	u64 uvec;				/* Vector number */
+};
+
 bool uintr_arch_enabled(void);
 int do_uintr_register_handler(u64 handler);
 int do_uintr_unregister_handler(void);
+int do_uintr_register_vector(struct uintr_receiver_info *r_info);
+void do_uintr_unregister_vector(struct uintr_receiver_info *r_info);
 
 void uintr_free(struct task_struct *task);
 
diff --git a/arch/x86/kernel/uintr_core.c b/arch/x86/kernel/uintr_core.c
index a2a13f890139..9dcb9f60e5bc 100644
--- a/arch/x86/kernel/uintr_core.c
+++ b/arch/x86/kernel/uintr_core.c
@@ -11,6 +11,7 @@
 #include <linux/sched.h>
 #include <linux/sched/task.h>
 #include <linux/slab.h>
+#include <linux/task_work.h>
 #include <linux/uaccess.h>
 
 #include <asm/apic.h>
@@ -20,6 +21,8 @@
 #include <asm/msr-index.h>
 #include <asm/uintr.h>
 
+#define UINTR_MAX_UVEC_NR 64
+
 /* User Posted Interrupt Descriptor (UPID) */
 struct uintr_upid {
 	struct {
@@ -36,13 +39,9 @@ struct uintr_upid {
 #define UPID_ON		0x0	/* Outstanding notification */
 #define UPID_SN		0x1	/* Suppressed notification */
 
-struct uintr_upid_ctx {
-	struct uintr_upid *upid;
-	refcount_t refs;
-};
-
 struct uintr_receiver {
 	struct uintr_upid_ctx *upid_ctx;
+	u64 uvec_mask;	/* track active vector per bit */
 };
 
 inline bool uintr_arch_enabled(void)
@@ -69,6 +68,7 @@ static inline u32 cpu_to_ndst(int cpu)
 
 static void free_upid(struct uintr_upid_ctx *upid_ctx)
 {
+	put_task_struct(upid_ctx->task);
 	kfree(upid_ctx->upid);
 	upid_ctx->upid = NULL;
 	kfree(upid_ctx);
@@ -93,6 +93,7 @@ static struct uintr_upid_ctx *alloc_upid(void)
 
 	upid_ctx->upid = upid;
 	refcount_set(&upid_ctx->refs, 1);
+	upid_ctx->task = get_task_struct(current);
 
 	return upid_ctx;
 }
@@ -103,6 +104,77 @@ static void put_upid_ref(struct uintr_upid_ctx *upid_ctx)
 		free_upid(upid_ctx);
 }
 
+static struct uintr_upid_ctx *get_upid_ref(struct uintr_upid_ctx *upid_ctx)
+{
+	refcount_inc(&upid_ctx->refs);
+	return upid_ctx;
+}
+
+static void __clear_vector_from_upid(u64 uvec, struct uintr_upid *upid)
+{
+	clear_bit(uvec, (unsigned long *)&upid->puir);
+}
+
+static void __clear_vector_from_task(u64 uvec)
+{
+	struct task_struct *t = current;
+
+	pr_debug("recv: task=%d free vector %llu\n", t->pid, uvec);
+
+	if (!(BIT_ULL(uvec) & t->thread.ui_recv->uvec_mask))
+		return;
+
+	clear_bit(uvec, (unsigned long *)&t->thread.ui_recv->uvec_mask);
+
+	if (!t->thread.ui_recv->uvec_mask)
+		pr_debug("recv: task=%d unregistered all user vectors\n", t->pid);
+}
+
+/* Callback to clear the vector structures when a vector is unregistered. */
+static void receiver_clear_uvec(struct callback_head *head)
+{
+	struct uintr_receiver_info *r_info;
+	struct uintr_upid_ctx *upid_ctx;
+	struct task_struct *t = current;
+	u64 uvec;
+
+	r_info = container_of(head, struct uintr_receiver_info, twork);
+	uvec = r_info->uvec;
+	upid_ctx = r_info->upid_ctx;
+
+	/*
+	 * If a task has unregistered the interrupt handler the vector
+	 * structures would have already been cleared.
+	 */
+	if (is_uintr_receiver(t)) {
+		/*
+		 * The UPID context in the callback might differ from the one
+		 * on the task if the task unregistered its interrupt handler
+		 * and then registered itself again. The vector structures
+		 * related to the previous UPID would have already been cleared
+		 * in that case.
+		 */
+		if (t->thread.ui_recv->upid_ctx != upid_ctx) {
+			pr_debug("recv: task %d is now using a different UPID\n",
+				 t->pid);
+			goto out_free;
+		}
+
+		/*
+		 * If the vector has been recognized in the UIRR don't modify
+		 * it. We need to disable User Interrupts before modifying the
+		 * UIRR. It might be better to just let that interrupt get
+		 * delivered.
+		 */
+		__clear_vector_from_upid(uvec, upid_ctx->upid);
+		__clear_vector_from_task(uvec);
+	}
+
+out_free:
+	put_upid_ref(upid_ctx);
+	kfree(r_info);
+}
+
 int do_uintr_unregister_handler(void)
 {
 	struct task_struct *t = current;
@@ -239,6 +311,53 @@ int do_uintr_register_handler(u64 handler)
 	return 0;
 }
 
+void do_uintr_unregister_vector(struct uintr_receiver_info *r_info)
+{
+	int ret;
+
+	pr_debug("recv: Adding task work to clear vector %llu added for task=%d\n",
+		 r_info->uvec, r_info->upid_ctx->task->pid);
+
+	init_task_work(&r_info->twork, receiver_clear_uvec);
+	ret = task_work_add(r_info->upid_ctx->task, &r_info->twork, true);
+	if (ret) {
+		pr_debug("recv: Clear vector task=%d has already exited\n",
+			 r_info->upid_ctx->task->pid);
+		put_upid_ref(r_info->upid_ctx);
+		kfree(r_info);
+		return;
+	}
+}
+
+int do_uintr_register_vector(struct uintr_receiver_info *r_info)
+{
+	struct uintr_receiver *ui_recv;
+	struct task_struct *t = current;
+
+	/*
+	 * A vector should only be registered by a task that
+	 * has an interrupt handler registered.
+	 */
+	if (!is_uintr_receiver(t))
+		return -EINVAL;
+
+	if (r_info->uvec >= UINTR_MAX_UVEC_NR)
+		return -ENOSPC;
+
+	ui_recv = t->thread.ui_recv;
+
+	if (ui_recv->uvec_mask & BIT_ULL(r_info->uvec))
+		return -EBUSY;
+
+	ui_recv->uvec_mask |= BIT_ULL(r_info->uvec);
+	pr_debug("recv: task %d new uvec=%llu, new mask %llx\n",
+		 t->pid, r_info->uvec, ui_recv->uvec_mask);
+
+	r_info->upid_ctx = get_upid_ref(ui_recv->upid_ctx);
+
+	return 0;
+}
+
 /* Suppress notifications since this task is being context switched out */
 void switch_uintr_prepare(struct task_struct *prev)
 {
diff --git a/arch/x86/kernel/uintr_fd.c b/arch/x86/kernel/uintr_fd.c
index a1a9c105fdab..f0548bbac776 100644
--- a/arch/x86/kernel/uintr_fd.c
+++ b/arch/x86/kernel/uintr_fd.c
@@ -6,11 +6,105 @@
  */
 #define pr_fmt(fmt)	"uintr: " fmt
 
+#include <linux/anon_inodes.h>
+#include <linux/fdtable.h>
 #include <linux/sched.h>
 #include <linux/syscalls.h>
 
 #include <asm/uintr.h>
 
+struct uintrfd_ctx {
+	struct uintr_receiver_info *r_info;
+};
+
+#ifdef CONFIG_PROC_FS
+static void uintrfd_show_fdinfo(struct seq_file *m, struct file *file)
+{
+	struct uintrfd_ctx *uintrfd_ctx = file->private_data;
+
+	/* Check: Should we print the receiver and sender info here? */
+	seq_printf(m, "user_vector:%llu\n", uintrfd_ctx->r_info->uvec);
+}
+#endif
+
+static int uintrfd_release(struct inode *inode, struct file *file)
+{
+	struct uintrfd_ctx *uintrfd_ctx = file->private_data;
+
+	pr_debug("recv: Release uintrfd for r_task %d uvec %llu\n",
+		 uintrfd_ctx->r_info->upid_ctx->task->pid,
+		 uintrfd_ctx->r_info->uvec);
+
+	do_uintr_unregister_vector(uintrfd_ctx->r_info);
+	kfree(uintrfd_ctx);
+
+	return 0;
+}
+
+static const struct file_operations uintrfd_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= uintrfd_show_fdinfo,
+#endif
+	.release	= uintrfd_release,
+	.llseek		= noop_llseek,
+};
+
+/*
+ * sys_uintr_create_fd - Create a uintr_fd for the registered interrupt vector.
+ */
+SYSCALL_DEFINE2(uintr_create_fd, u64, vector, unsigned int, flags)
+{
+	struct uintrfd_ctx *uintrfd_ctx;
+	int uintrfd;
+	int ret;
+
+	if (!uintr_arch_enabled())
+		return -EOPNOTSUPP;
+
+	if (flags)
+		return -EINVAL;
+
+	uintrfd_ctx = kzalloc(sizeof(*uintrfd_ctx), GFP_KERNEL);
+	if (!uintrfd_ctx)
+		return -ENOMEM;
+
+	uintrfd_ctx->r_info = kzalloc(sizeof(*uintrfd_ctx->r_info), GFP_KERNEL);
+	if (!uintrfd_ctx->r_info) {
+		ret = -ENOMEM;
+		goto out_free_ctx;
+	}
+
+	uintrfd_ctx->r_info->uvec = vector;
+	ret = do_uintr_register_vector(uintrfd_ctx->r_info);
+	if (ret) {
+		kfree(uintrfd_ctx->r_info);
+		goto out_free_ctx;
+	}
+
+	/* TODO: Get user input for flags - UFD_CLOEXEC */
+	/* Check: Do we need O_NONBLOCK? */
+	uintrfd = anon_inode_getfd("[uintrfd]", &uintrfd_fops, uintrfd_ctx,
+				   O_RDONLY | O_CLOEXEC | O_NONBLOCK);
+
+	if (uintrfd < 0) {
+		ret = uintrfd;
+		goto out_free_uvec;
+	}
+
+	pr_debug("recv: Alloc vector success uintrfd %d uvec %llu for task=%d\n",
+		 uintrfd, uintrfd_ctx->r_info->uvec, current->pid);
+
+	return uintrfd;
+
+out_free_uvec:
+	do_uintr_unregister_vector(uintrfd_ctx->r_info);
+out_free_ctx:
+	kfree(uintrfd_ctx);
+	pr_debug("recv: Alloc vector failed for task=%d ret %d\n",
+		 current->pid, ret);
+	return ret;
+}
+
 /*
  * sys_uintr_register_handler - setup user interrupt handler for receiver.
  */
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (8 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 09/13] x86/uintr: Introduce vector registration and uintr_fd syscall Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-23 12:28   ` Greg KH
  2021-09-24 10:54   ` Thomas Gleixner
  2021-09-13 20:01 ` [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall Sohil Mehta
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

Add a registration syscall for a task to register itself as a user
interrupt sender using the uintr_fd generated by the receiver. A task
can register multiple uintr_fds. Each unique successful connection
creates a new entry in the User Interrupt Target Table (UITT).

Each entry in the UITT table is referred by the UITT index (uipi_index).
The uipi_index returned during the registration syscall lets a sender
generate a user IPI using the 'SENDUIPI <uipi_index>' instruction.

Also, add a sender unregister syscall to unregister a particular task
from the uintr_fd. Calling close on the uintr_fd will disconnect all
threads in a sender process from that FD.

Currently, the UITT size is arbitrarily chosen as 256 entries
corresponding to a 4KB page. Based on feedback and usage data this can
either be increased/decreased or made dynamic later.

Architecturally, the UITT table can be unique for each thread or shared
across threads of the same thread group. The current implementation
keeps the UITT as unique for the each thread. This makes the kernel
implementation relatively simple and only threads that use uintr get
setup with the related structures. However, this means that the
uipi_index for each thread would be inconsistent wrt to other threads.
(Executing 'SENDUIPI 2' on threads of the same process could generate
different user interrupts.)

Alternatively, the benefit of sharing the UITT table is that all threads
would see the same view of the UITT table. Also the kernel UITT memory
allocation would be more efficient if multiple threads connect to the
same uintr_fd. However, this would mean the kernel needs to keep the
UITT table size MISC_MSR[] in sync across these threads. Also the
UPID/UITT teardown flows might need additional consideration.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 arch/x86/include/asm/processor.h |   2 +
 arch/x86/include/asm/uintr.h     |  15 ++
 arch/x86/kernel/process.c        |   1 +
 arch/x86/kernel/uintr_core.c     | 355 ++++++++++++++++++++++++++++++-
 arch/x86/kernel/uintr_fd.c       | 133 ++++++++++++
 5 files changed, 495 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index d229bfac8b4f..3482c3182e39 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -10,6 +10,7 @@ struct mm_struct;
 struct io_bitmap;
 struct vm86;
 struct uintr_receiver;
+struct uintr_sender;
 
 #include <asm/math_emu.h>
 #include <asm/segment.h>
@@ -533,6 +534,7 @@ struct thread_struct {
 #ifdef CONFIG_X86_USER_INTERRUPTS
 	/* User Interrupt state*/
 	struct uintr_receiver	*ui_recv;
+	struct uintr_sender	*ui_send;
 #endif
 
 	/* Floating point and extended processor state */
diff --git a/arch/x86/include/asm/uintr.h b/arch/x86/include/asm/uintr.h
index 1f00e2a63da4..ef3521dd7fb9 100644
--- a/arch/x86/include/asm/uintr.h
+++ b/arch/x86/include/asm/uintr.h
@@ -8,6 +8,7 @@ struct uintr_upid_ctx {
 	struct task_struct *task;	/* Receiver task */
 	struct uintr_upid *upid;
 	refcount_t refs;
+	bool receiver_active;		/* Flag for UPID being mapped to a receiver */
 };
 
 struct uintr_receiver_info {
@@ -16,12 +17,26 @@ struct uintr_receiver_info {
 	u64 uvec;				/* Vector number */
 };
 
+struct uintr_sender_info {
+	struct list_head node;
+	struct uintr_uitt_ctx *uitt_ctx;
+	struct task_struct *task;
+	struct uintr_upid_ctx *r_upid_ctx;	/* Receiver's UPID context */
+	struct callback_head twork;		/* Task work head */
+	unsigned int uitt_index;
+};
+
 bool uintr_arch_enabled(void);
 int do_uintr_register_handler(u64 handler);
 int do_uintr_unregister_handler(void);
 int do_uintr_register_vector(struct uintr_receiver_info *r_info);
 void do_uintr_unregister_vector(struct uintr_receiver_info *r_info);
 
+int do_uintr_register_sender(struct uintr_receiver_info *r_info,
+			     struct uintr_sender_info *s_info);
+void do_uintr_unregister_sender(struct uintr_receiver_info *r_info,
+				struct uintr_sender_info *s_info);
+
 void uintr_free(struct task_struct *task);
 
 /* TODO: Inline the context switch related functions */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 83677f76bd7b..9db33e467b30 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -92,6 +92,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 #ifdef CONFIG_X86_USER_INTERRUPTS
 	/* User Interrupt state is unique for each task */
 	dst->thread.ui_recv = NULL;
+	dst->thread.ui_send = NULL;
 #endif
 
 	return fpu_clone(dst);
diff --git a/arch/x86/kernel/uintr_core.c b/arch/x86/kernel/uintr_core.c
index 9dcb9f60e5bc..8f331c5fe0cf 100644
--- a/arch/x86/kernel/uintr_core.c
+++ b/arch/x86/kernel/uintr_core.c
@@ -21,6 +21,11 @@
 #include <asm/msr-index.h>
 #include <asm/uintr.h>
 
+/*
+ * Each UITT entry is 16 bytes in size.
+ * Current UITT table size is set as 4KB (256 * 16 bytes)
+ */
+#define UINTR_MAX_UITT_NR 256
 #define UINTR_MAX_UVEC_NR 64
 
 /* User Posted Interrupt Descriptor (UPID) */
@@ -44,6 +49,27 @@ struct uintr_receiver {
 	u64 uvec_mask;	/* track active vector per bit */
 };
 
+/* User Interrupt Target Table Entry (UITTE) */
+struct uintr_uitt_entry {
+	u8	valid;			/* bit 0: valid, bit 1-7: reserved */
+	u8	user_vec;
+	u8	reserved[6];
+	u64	target_upid_addr;
+} __packed __aligned(16);
+
+struct uintr_uitt_ctx {
+	struct uintr_uitt_entry *uitt;
+	/* Protect UITT */
+	spinlock_t uitt_lock;
+	refcount_t refs;
+};
+
+struct uintr_sender {
+	struct uintr_uitt_ctx *uitt_ctx;
+	/* track active uitt entries per bit */
+	u64 uitt_mask[BITS_TO_U64(UINTR_MAX_UITT_NR)];
+};
+
 inline bool uintr_arch_enabled(void)
 {
 	return static_cpu_has(X86_FEATURE_UINTR);
@@ -54,6 +80,36 @@ static inline bool is_uintr_receiver(struct task_struct *t)
 	return !!t->thread.ui_recv;
 }
 
+static inline bool is_uintr_sender(struct task_struct *t)
+{
+	return !!t->thread.ui_send;
+}
+
+static inline bool is_uintr_task(struct task_struct *t)
+{
+	return(is_uintr_receiver(t) || is_uintr_sender(t));
+}
+
+static inline bool is_uitt_empty(struct task_struct *t)
+{
+	return !!bitmap_empty((unsigned long *)t->thread.ui_send->uitt_mask,
+			      UINTR_MAX_UITT_NR);
+}
+
+/*
+ * No lock is needed to read the active flag. Writes only happen from
+ * r_info->task that owns the UPID. Everyone else would just read this flag.
+ *
+ * This only provides a static check. The receiver may become inactive right
+ * after this check. The primary reason to have this check is to prevent future
+ * senders from connecting with this UPID, since the receiver task has already
+ * made this UPID inactive.
+ */
+static bool uintr_is_receiver_active(struct uintr_receiver_info *r_info)
+{
+	return r_info->upid_ctx->receiver_active;
+}
+
 static inline u32 cpu_to_ndst(int cpu)
 {
 	u32 apicid = (u32)apic->cpu_present_to_apicid(cpu);
@@ -94,6 +150,7 @@ static struct uintr_upid_ctx *alloc_upid(void)
 	upid_ctx->upid = upid;
 	refcount_set(&upid_ctx->refs, 1);
 	upid_ctx->task = get_task_struct(current);
+	upid_ctx->receiver_active = true;
 
 	return upid_ctx;
 }
@@ -110,6 +167,64 @@ static struct uintr_upid_ctx *get_upid_ref(struct uintr_upid_ctx *upid_ctx)
 	return upid_ctx;
 }
 
+static void free_uitt(struct uintr_uitt_ctx *uitt_ctx)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&uitt_ctx->uitt_lock, flags);
+	kfree(uitt_ctx->uitt);
+	uitt_ctx->uitt = NULL;
+	spin_unlock_irqrestore(&uitt_ctx->uitt_lock, flags);
+
+	kfree(uitt_ctx);
+}
+
+/* TODO: Replace UITT allocation with KPTI compatible memory allocator */
+static struct uintr_uitt_ctx *alloc_uitt(void)
+{
+	struct uintr_uitt_ctx *uitt_ctx;
+	struct uintr_uitt_entry *uitt;
+
+	uitt_ctx = kzalloc(sizeof(*uitt_ctx), GFP_KERNEL);
+	if (!uitt_ctx)
+		return NULL;
+
+	uitt = kzalloc(sizeof(*uitt) * UINTR_MAX_UITT_NR, GFP_KERNEL);
+	if (!uitt) {
+		kfree(uitt_ctx);
+		return NULL;
+	}
+
+	uitt_ctx->uitt = uitt;
+	spin_lock_init(&uitt_ctx->uitt_lock);
+	refcount_set(&uitt_ctx->refs, 1);
+
+	return uitt_ctx;
+}
+
+static void put_uitt_ref(struct uintr_uitt_ctx *uitt_ctx)
+{
+	if (refcount_dec_and_test(&uitt_ctx->refs))
+		free_uitt(uitt_ctx);
+}
+
+static struct uintr_uitt_ctx *get_uitt_ref(struct uintr_uitt_ctx *uitt_ctx)
+{
+	refcount_inc(&uitt_ctx->refs);
+	return uitt_ctx;
+}
+
+static inline void mark_uitte_invalid(struct uintr_sender_info *s_info)
+{
+	struct uintr_uitt_entry *uitte;
+	unsigned long flags;
+
+	spin_lock_irqsave(&s_info->uitt_ctx->uitt_lock, flags);
+	uitte = &s_info->uitt_ctx->uitt[s_info->uitt_index];
+	uitte->valid = 0;
+	spin_unlock_irqrestore(&s_info->uitt_ctx->uitt_lock, flags);
+}
+
 static void __clear_vector_from_upid(u64 uvec, struct uintr_upid *upid)
 {
 	clear_bit(uvec, (unsigned long *)&upid->puir);
@@ -175,6 +290,210 @@ static void receiver_clear_uvec(struct callback_head *head)
 	kfree(r_info);
 }
 
+static void teardown_uitt(void)
+{
+	struct task_struct *t = current;
+	struct fpu *fpu = &t->thread.fpu;
+	u64 msr64;
+
+	put_uitt_ref(t->thread.ui_send->uitt_ctx);
+	kfree(t->thread.ui_send);
+	t->thread.ui_send = NULL;
+
+	fpregs_lock();
+
+	if (fpregs_state_valid(fpu, smp_processor_id())) {
+		/* Modify only the relevant bits of the MISC MSR */
+		rdmsrl(MSR_IA32_UINTR_MISC, msr64);
+		msr64 &= GENMASK_ULL(63, 32);
+		wrmsrl(MSR_IA32_UINTR_MISC, msr64);
+		wrmsrl(MSR_IA32_UINTR_TT, 0ULL);
+	} else {
+		struct uintr_state *p;
+
+		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_UINTR);
+		if (p) {
+			p->uitt_size = 0;
+			p->uitt_addr = 0;
+		}
+	}
+
+	fpregs_unlock();
+}
+
+static int init_uitt(void)
+{
+	struct task_struct *t = current;
+	struct fpu *fpu = &t->thread.fpu;
+	struct uintr_sender *ui_send;
+	u64 msr64;
+
+	ui_send = kzalloc(sizeof(*t->thread.ui_send), GFP_KERNEL);
+	if (!ui_send)
+		return -ENOMEM;
+
+	ui_send->uitt_ctx = alloc_uitt();
+	if (!ui_send->uitt_ctx) {
+		pr_debug("send: Alloc UITT failed for task=%d\n", t->pid);
+		kfree(ui_send);
+		return -ENOMEM;
+	}
+
+	fpregs_lock();
+
+	if (fpregs_state_valid(fpu, smp_processor_id())) {
+		wrmsrl(MSR_IA32_UINTR_TT, (u64)ui_send->uitt_ctx->uitt | 1);
+		/* Modify only the relevant bits of the MISC MSR */
+		rdmsrl(MSR_IA32_UINTR_MISC, msr64);
+		msr64 &= GENMASK_ULL(63, 32);
+		msr64 |= UINTR_MAX_UITT_NR;
+		wrmsrl(MSR_IA32_UINTR_MISC, msr64);
+	} else {
+		struct xregs_state *xsave;
+		struct uintr_state *p;
+
+		xsave = &fpu->state.xsave;
+		xsave->header.xfeatures |= XFEATURE_MASK_UINTR;
+		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_UINTR);
+		if (p) {
+			p->uitt_size = UINTR_MAX_UITT_NR;
+			p->uitt_addr = (u64)ui_send->uitt_ctx->uitt | 1;
+		}
+	}
+
+	fpregs_unlock();
+
+	pr_debug("send: Setup a new UITT=%px for task=%d with size %d\n",
+		 ui_send->uitt_ctx->uitt, t->pid, UINTR_MAX_UITT_NR * 16);
+
+	t->thread.ui_send = ui_send;
+
+	return 0;
+}
+
+static void __free_uitt_entry(unsigned int entry)
+{
+	struct task_struct *t = current;
+	unsigned long flags;
+
+	if (entry >= UINTR_MAX_UITT_NR)
+		return;
+
+	if (!is_uintr_sender(t))
+		return;
+
+	pr_debug("send: Freeing UITTE entry %d for task=%d\n", entry, t->pid);
+
+	spin_lock_irqsave(&t->thread.ui_send->uitt_ctx->uitt_lock, flags);
+	memset(&t->thread.ui_send->uitt_ctx->uitt[entry], 0,
+	       sizeof(struct uintr_uitt_entry));
+	spin_unlock_irqrestore(&t->thread.ui_send->uitt_ctx->uitt_lock, flags);
+
+	clear_bit(entry, (unsigned long *)t->thread.ui_send->uitt_mask);
+
+	if (is_uitt_empty(t)) {
+		pr_debug("send: UITT mask is empty. Dereference and teardown UITT\n");
+		teardown_uitt();
+	}
+}
+
+static void sender_free_uitte(struct callback_head *head)
+{
+	struct uintr_sender_info *s_info;
+
+	s_info = container_of(head, struct uintr_sender_info, twork);
+
+	__free_uitt_entry(s_info->uitt_index);
+	put_uitt_ref(s_info->uitt_ctx);
+	put_upid_ref(s_info->r_upid_ctx);
+	put_task_struct(s_info->task);
+	kfree(s_info);
+}
+
+void do_uintr_unregister_sender(struct uintr_receiver_info *r_info,
+				struct uintr_sender_info *s_info)
+{
+	int ret;
+
+	/*
+	 * To make sure any new senduipi result in a #GP fault.
+	 * The task work might take non-zero time to kick the process out.
+	 */
+	mark_uitte_invalid(s_info);
+
+	pr_debug("send: Adding Free UITTE %d task work for task=%d\n",
+		 s_info->uitt_index, s_info->task->pid);
+
+	init_task_work(&s_info->twork, sender_free_uitte);
+	ret = task_work_add(s_info->task, &s_info->twork, true);
+	if (ret) {
+		/*
+		 * Dereferencing the UITT and UPID here since the task has
+		 * exited.
+		 */
+		pr_debug("send: Free UITTE %d task=%d has already exited\n",
+			 s_info->uitt_index, s_info->task->pid);
+		put_upid_ref(s_info->r_upid_ctx);
+		put_uitt_ref(s_info->uitt_ctx);
+		put_task_struct(s_info->task);
+		kfree(s_info);
+		return;
+	}
+}
+
+int do_uintr_register_sender(struct uintr_receiver_info *r_info,
+			     struct uintr_sender_info *s_info)
+{
+	struct uintr_uitt_entry *uitte = NULL;
+	struct uintr_sender *ui_send;
+	struct task_struct *t = current;
+	unsigned long flags;
+	int entry;
+	int ret;
+
+	/*
+	 * Only a static check. Receiver could exit anytime after this check.
+	 * This check only prevents connections using uintr_fd after the
+	 * receiver has already exited/unregistered.
+	 */
+	if (!uintr_is_receiver_active(r_info))
+		return -ESHUTDOWN;
+
+	if (is_uintr_sender(t)) {
+		entry = find_first_zero_bit((unsigned long *)t->thread.ui_send->uitt_mask,
+					    UINTR_MAX_UITT_NR);
+		if (entry >= UINTR_MAX_UITT_NR)
+			return -ENOSPC;
+	} else {
+		BUILD_BUG_ON(UINTR_MAX_UITT_NR < 1);
+		entry = 0;
+		ret = init_uitt();
+		if (ret)
+			return ret;
+	}
+
+	ui_send = t->thread.ui_send;
+
+	set_bit(entry, (unsigned long *)ui_send->uitt_mask);
+
+	spin_lock_irqsave(&ui_send->uitt_ctx->uitt_lock, flags);
+	uitte = &ui_send->uitt_ctx->uitt[entry];
+	pr_debug("send: sender=%d receiver=%d UITTE entry %d address %px\n",
+		 current->pid, r_info->upid_ctx->task->pid, entry, uitte);
+
+	uitte->user_vec = r_info->uvec;
+	uitte->target_upid_addr = (u64)r_info->upid_ctx->upid;
+	uitte->valid = 1;
+	spin_unlock_irqrestore(&ui_send->uitt_ctx->uitt_lock, flags);
+
+	s_info->r_upid_ctx = get_upid_ref(r_info->upid_ctx);
+	s_info->uitt_ctx = get_uitt_ref(ui_send->uitt_ctx);
+	s_info->task = get_task_struct(current);
+	s_info->uitt_index = entry;
+
+	return 0;
+}
+
 int do_uintr_unregister_handler(void)
 {
 	struct task_struct *t = current;
@@ -222,6 +541,8 @@ int do_uintr_unregister_handler(void)
 	}
 
 	ui_recv = t->thread.ui_recv;
+	ui_recv->upid_ctx->receiver_active = false;
+
 	/*
 	 * Suppress notifications so that no further interrupts are generated
 	 * based on this UPID.
@@ -437,14 +758,14 @@ void switch_uintr_return(void)
  * This should only be called from exit_thread().
  * exit_thread() can happen in current context when the current thread is
  * exiting or it can happen for a new thread that is being created.
- * For new threads is_uintr_receiver() should fail.
+ * For new threads is_uintr_task() should fail.
  */
 void uintr_free(struct task_struct *t)
 {
 	struct uintr_receiver *ui_recv;
 	struct fpu *fpu;
 
-	if (!static_cpu_has(X86_FEATURE_UINTR) || !is_uintr_receiver(t))
+	if (!static_cpu_has(X86_FEATURE_UINTR) || !is_uintr_task(t))
 		return;
 
 	if (WARN_ON_ONCE(t != current))
@@ -456,6 +777,7 @@ void uintr_free(struct task_struct *t)
 
 	if (fpregs_state_valid(fpu, smp_processor_id())) {
 		wrmsrl(MSR_IA32_UINTR_MISC, 0ULL);
+		wrmsrl(MSR_IA32_UINTR_TT, 0ULL);
 		wrmsrl(MSR_IA32_UINTR_PD, 0ULL);
 		wrmsrl(MSR_IA32_UINTR_RR, 0ULL);
 		wrmsrl(MSR_IA32_UINTR_STACKADJUST, 0ULL);
@@ -470,20 +792,31 @@ void uintr_free(struct task_struct *t)
 			p->upid_addr = 0;
 			p->stack_adjust = 0;
 			p->uinv = 0;
+			p->uitt_addr = 0;
+			p->uitt_size = 0;
 		}
 	}
 
 	/* Check: Can a thread be context switched while it is exiting? */
-	ui_recv = t->thread.ui_recv;
+	if (is_uintr_receiver(t)) {
+		ui_recv = t->thread.ui_recv;
 
-	/*
-	 * Suppress notifications so that no further interrupts are
-	 * generated based on this UPID.
-	 */
-	set_bit(UPID_SN, (unsigned long *)&ui_recv->upid_ctx->upid->nc.status);
-	put_upid_ref(ui_recv->upid_ctx);
-	kfree(ui_recv);
-	t->thread.ui_recv = NULL;
+		/*
+		 * Suppress notifications so that no further interrupts are
+		 * generated based on this UPID.
+		 */
+		set_bit(UPID_SN, (unsigned long *)&ui_recv->upid_ctx->upid->nc.status);
+		ui_recv->upid_ctx->receiver_active = false;
+		put_upid_ref(ui_recv->upid_ctx);
+		kfree(ui_recv);
+		t->thread.ui_recv = NULL;
+	}
 
 	fpregs_unlock();
+
+	if (is_uintr_sender(t)) {
+		put_uitt_ref(t->thread.ui_send->uitt_ctx);
+		kfree(t->thread.ui_send);
+		t->thread.ui_send = NULL;
+	}
 }
diff --git a/arch/x86/kernel/uintr_fd.c b/arch/x86/kernel/uintr_fd.c
index f0548bbac776..3c82c032c0b9 100644
--- a/arch/x86/kernel/uintr_fd.c
+++ b/arch/x86/kernel/uintr_fd.c
@@ -15,6 +15,9 @@
 
 struct uintrfd_ctx {
 	struct uintr_receiver_info *r_info;
+	/* Protect sender_list */
+	spinlock_t sender_lock;
+	struct list_head sender_list;
 };
 
 #ifdef CONFIG_PROC_FS
@@ -30,11 +33,20 @@ static void uintrfd_show_fdinfo(struct seq_file *m, struct file *file)
 static int uintrfd_release(struct inode *inode, struct file *file)
 {
 	struct uintrfd_ctx *uintrfd_ctx = file->private_data;
+	struct uintr_sender_info *s_info, *tmp;
+	unsigned long flags;
 
 	pr_debug("recv: Release uintrfd for r_task %d uvec %llu\n",
 		 uintrfd_ctx->r_info->upid_ctx->task->pid,
 		 uintrfd_ctx->r_info->uvec);
 
+	spin_lock_irqsave(&uintrfd_ctx->sender_lock, flags);
+	list_for_each_entry_safe(s_info, tmp, &uintrfd_ctx->sender_list, node) {
+		list_del(&s_info->node);
+		do_uintr_unregister_sender(uintrfd_ctx->r_info, s_info);
+	}
+	spin_unlock_irqrestore(&uintrfd_ctx->sender_lock, flags);
+
 	do_uintr_unregister_vector(uintrfd_ctx->r_info);
 	kfree(uintrfd_ctx);
 
@@ -81,6 +93,9 @@ SYSCALL_DEFINE2(uintr_create_fd, u64, vector, unsigned int, flags)
 		goto out_free_ctx;
 	}
 
+	INIT_LIST_HEAD(&uintrfd_ctx->sender_list);
+	spin_lock_init(&uintrfd_ctx->sender_lock);
+
 	/* TODO: Get user input for flags - UFD_CLOEXEC */
 	/* Check: Do we need O_NONBLOCK? */
 	uintrfd = anon_inode_getfd("[uintrfd]", &uintrfd_fops, uintrfd_ctx,
@@ -150,3 +165,121 @@ SYSCALL_DEFINE1(uintr_unregister_handler, unsigned int, flags)
 
 	return ret;
 }
+
+/*
+ * sys_uintr_register_sender - setup user inter-processor interrupt sender.
+ */
+SYSCALL_DEFINE2(uintr_register_sender, int, uintrfd, unsigned int, flags)
+{
+	struct uintr_sender_info *s_info;
+	struct uintrfd_ctx *uintrfd_ctx;
+	unsigned long lock_flags;
+	struct file *uintr_f;
+	struct fd f;
+	int ret = 0;
+
+	if (!uintr_arch_enabled())
+		return -EOPNOTSUPP;
+
+	if (flags)
+		return -EINVAL;
+
+	f = fdget(uintrfd);
+	uintr_f = f.file;
+	if (!uintr_f)
+		return -EBADF;
+
+	if (uintr_f->f_op != &uintrfd_fops) {
+		ret = -EOPNOTSUPP;
+		goto out_fdput;
+	}
+
+	uintrfd_ctx = (struct uintrfd_ctx *)uintr_f->private_data;
+
+	spin_lock_irqsave(&uintrfd_ctx->sender_lock, lock_flags);
+	list_for_each_entry(s_info, &uintrfd_ctx->sender_list, node) {
+		if (s_info->task == current) {
+			ret = -EISCONN;
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&uintrfd_ctx->sender_lock, lock_flags);
+
+	if (ret)
+		goto out_fdput;
+
+	s_info = kzalloc(sizeof(*s_info), GFP_KERNEL);
+	if (!s_info) {
+		ret = -ENOMEM;
+		goto out_fdput;
+	}
+
+	ret = do_uintr_register_sender(uintrfd_ctx->r_info, s_info);
+	if (ret) {
+		kfree(s_info);
+		goto out_fdput;
+	}
+
+	spin_lock_irqsave(&uintrfd_ctx->sender_lock, lock_flags);
+	list_add(&s_info->node, &uintrfd_ctx->sender_list);
+	spin_unlock_irqrestore(&uintrfd_ctx->sender_lock, lock_flags);
+
+	ret = s_info->uitt_index;
+
+out_fdput:
+	pr_debug("send: register sender task=%d flags %d ret(uipi_id)=%d\n",
+		 current->pid, flags, ret);
+
+	fdput(f);
+	return ret;
+}
+
+/*
+ * sys_uintr_unregister_sender - Unregister user inter-processor interrupt sender.
+ */
+SYSCALL_DEFINE2(uintr_unregister_sender, int, uintrfd, unsigned int, flags)
+{
+	struct uintr_sender_info *s_info;
+	struct uintrfd_ctx *uintrfd_ctx;
+	struct file *uintr_f;
+	unsigned long lock_flags;
+	struct fd f;
+	int ret;
+
+	if (!uintr_arch_enabled())
+		return -EOPNOTSUPP;
+
+	if (flags)
+		return -EINVAL;
+
+	f = fdget(uintrfd);
+	uintr_f = f.file;
+	if (!uintr_f)
+		return -EBADF;
+
+	if (uintr_f->f_op != &uintrfd_fops) {
+		ret = -EOPNOTSUPP;
+		goto out_fdput;
+	}
+
+	uintrfd_ctx = (struct uintrfd_ctx *)uintr_f->private_data;
+
+	ret = -EINVAL;
+	spin_lock_irqsave(&uintrfd_ctx->sender_lock, lock_flags);
+	list_for_each_entry(s_info, &uintrfd_ctx->sender_list, node) {
+		if (s_info->task == current) {
+			ret = 0;
+			list_del(&s_info->node);
+			do_uintr_unregister_sender(uintrfd_ctx->r_info, s_info);
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&uintrfd_ctx->sender_lock, lock_flags);
+
+	pr_debug("send: unregister sender uintrfd %d for task=%d ret %d\n",
+		 uintrfd, current->pid, ret);
+
+out_fdput:
+	fdput(f);
+	return ret;
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (9 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-24 11:04   ` Thomas Gleixner
                     ` (2 more replies)
  2021-09-13 20:01 ` [RFC PATCH 12/13] x86/uintr: Wire up the user interrupt syscalls Sohil Mehta
                   ` (6 subsequent siblings)
  17 siblings, 3 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

Add a new system call to allow applications to block in the kernel and
wait for user interrupts.

<The current implementation doesn't support waking up from other
blocking system calls like sleep(), read(), epoll(), etc.

uintr_wait() is a placeholder syscall while we decide on that
behaviour.>

When the application makes this syscall the notification vector is
switched to a new kernel vector. Any new SENDUIPI will invoke the kernel
interrupt which is then used to wake up the process.

Currently, the task wait list is global one. To make the implementation
scalable there is a need to move to a distributed per-cpu wait list.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 arch/x86/include/asm/hardirq.h     |  1 +
 arch/x86/include/asm/idtentry.h    |  1 +
 arch/x86/include/asm/irq_vectors.h |  3 +-
 arch/x86/include/asm/uintr.h       | 22 +++++++
 arch/x86/kernel/idt.c              |  1 +
 arch/x86/kernel/irq.c              | 18 ++++++
 arch/x86/kernel/uintr_core.c       | 94 ++++++++++++++++++++++++------
 arch/x86/kernel/uintr_fd.c         | 15 +++++
 8 files changed, 136 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/hardirq.h b/arch/x86/include/asm/hardirq.h
index 279afc01f1ac..a4623fdb65a1 100644
--- a/arch/x86/include/asm/hardirq.h
+++ b/arch/x86/include/asm/hardirq.h
@@ -22,6 +22,7 @@ typedef struct {
 #endif
 #ifdef CONFIG_X86_USER_INTERRUPTS
 	unsigned int uintr_spurious_count;
+	unsigned int uintr_kernel_notifications;
 #endif
 	unsigned int x86_platform_ipis;	/* arch dependent */
 	unsigned int apic_perf_irqs;
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 5929a6f9eeee..0ac7ef592283 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -673,6 +673,7 @@ DECLARE_IDTENTRY_SYSVEC(POSTED_INTR_NESTED_VECTOR,	sysvec_kvm_posted_intr_nested
 
 #ifdef CONFIG_X86_USER_INTERRUPTS
 DECLARE_IDTENTRY_SYSVEC(UINTR_NOTIFICATION_VECTOR,	sysvec_uintr_spurious_interrupt);
+DECLARE_IDTENTRY_SYSVEC(UINTR_KERNEL_VECTOR,		sysvec_uintr_kernel_notification);
 #endif
 
 #if IS_ENABLED(CONFIG_HYPERV)
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index d26faa504931..1d289b3ee0da 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -106,8 +106,9 @@
 
 /* Vector for User interrupt notifications */
 #define UINTR_NOTIFICATION_VECTOR       0xec
+#define UINTR_KERNEL_VECTOR		0xeb
 
-#define LOCAL_TIMER_VECTOR		0xeb
+#define LOCAL_TIMER_VECTOR		0xea
 
 #define NR_VECTORS			 256
 
diff --git a/arch/x86/include/asm/uintr.h b/arch/x86/include/asm/uintr.h
index ef3521dd7fb9..64113ef523ca 100644
--- a/arch/x86/include/asm/uintr.h
+++ b/arch/x86/include/asm/uintr.h
@@ -4,11 +4,29 @@
 
 #ifdef CONFIG_X86_USER_INTERRUPTS
 
+/* User Posted Interrupt Descriptor (UPID) */
+struct uintr_upid {
+	struct {
+		u8 status;	/* bit 0: ON, bit 1: SN, bit 2-7: reserved */
+		u8 reserved1;	/* Reserved */
+		u8 nv;		/* Notification vector */
+		u8 reserved2;	/* Reserved */
+		u32 ndst;	/* Notification destination */
+	} nc __packed;		/* Notification control */
+	u64 puir;		/* Posted user interrupt requests */
+} __aligned(64);
+
+/* UPID Notification control status */
+#define UPID_ON		0x0	/* Outstanding notification */
+#define UPID_SN		0x1	/* Suppressed notification */
+
 struct uintr_upid_ctx {
+	struct list_head node;
 	struct task_struct *task;	/* Receiver task */
 	struct uintr_upid *upid;
 	refcount_t refs;
 	bool receiver_active;		/* Flag for UPID being mapped to a receiver */
+	bool waiting;
 };
 
 struct uintr_receiver_info {
@@ -43,11 +61,15 @@ void uintr_free(struct task_struct *task);
 void switch_uintr_prepare(struct task_struct *prev);
 void switch_uintr_return(void);
 
+int uintr_receiver_wait(void);
+void uintr_wake_up_process(void);
+
 #else /* !CONFIG_X86_USER_INTERRUPTS */
 
 static inline void uintr_free(struct task_struct *task) {}
 static inline void switch_uintr_prepare(struct task_struct *prev) {}
 static inline void switch_uintr_return(void) {}
+static inline void uintr_wake_up_process(void) {}
 
 #endif /* CONFIG_X86_USER_INTERRUPTS */
 
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index d8c45e0728f0..8d4fd7509523 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -149,6 +149,7 @@ static const __initconst struct idt_data apic_idts[] = {
 # endif
 #ifdef CONFIG_X86_USER_INTERRUPTS
 	INTG(UINTR_NOTIFICATION_VECTOR,		asm_sysvec_uintr_spurious_interrupt),
+	INTG(UINTR_KERNEL_VECTOR,		asm_sysvec_uintr_kernel_notification),
 #endif
 # ifdef CONFIG_IRQ_WORK
 	INTG(IRQ_WORK_VECTOR,			asm_sysvec_irq_work),
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index e3c35668c7c5..22349f5c301b 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -22,6 +22,7 @@
 #include <asm/desc.h>
 #include <asm/traps.h>
 #include <asm/thermal.h>
+#include <asm/uintr.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/irq_vectors.h>
@@ -187,6 +188,11 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 	for_each_online_cpu(j)
 		seq_printf(p, "%10u ", irq_stats(j)->uintr_spurious_count);
 	seq_puts(p, "  User-interrupt spurious event\n");
+
+	seq_printf(p, "%*s: ", prec, "UKN");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", irq_stats(j)->uintr_kernel_notifications);
+	seq_puts(p, "  User-interrupt kernel notification event\n");
 #endif
 	return 0;
 }
@@ -356,6 +362,18 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_uintr_spurious_interrupt)
 	ack_APIC_irq();
 	inc_irq_stat(uintr_spurious_count);
 }
+
+/*
+ * Handler for UINTR_KERNEL_VECTOR.
+ */
+DEFINE_IDTENTRY_SYSVEC(sysvec_uintr_kernel_notification)
+{
+	/* TODO: Add entry-exit tracepoints */
+	ack_APIC_irq();
+	inc_irq_stat(uintr_kernel_notifications);
+
+	uintr_wake_up_process();
+}
 #endif
 
 
diff --git a/arch/x86/kernel/uintr_core.c b/arch/x86/kernel/uintr_core.c
index 8f331c5fe0cf..4e5545e6d903 100644
--- a/arch/x86/kernel/uintr_core.c
+++ b/arch/x86/kernel/uintr_core.c
@@ -28,22 +28,6 @@
 #define UINTR_MAX_UITT_NR 256
 #define UINTR_MAX_UVEC_NR 64
 
-/* User Posted Interrupt Descriptor (UPID) */
-struct uintr_upid {
-	struct {
-		u8 status;	/* bit 0: ON, bit 1: SN, bit 2-7: reserved */
-		u8 reserved1;	/* Reserved */
-		u8 nv;		/* Notification vector */
-		u8 reserved2;	/* Reserved */
-		u32 ndst;	/* Notification destination */
-	} nc __packed;		/* Notification control */
-	u64 puir;		/* Posted user interrupt requests */
-} __aligned(64);
-
-/* UPID Notification control status */
-#define UPID_ON		0x0	/* Outstanding notification */
-#define UPID_SN		0x1	/* Suppressed notification */
-
 struct uintr_receiver {
 	struct uintr_upid_ctx *upid_ctx;
 	u64 uvec_mask;	/* track active vector per bit */
@@ -70,6 +54,10 @@ struct uintr_sender {
 	u64 uitt_mask[BITS_TO_U64(UINTR_MAX_UITT_NR)];
 };
 
+/* TODO: To remove the global lock, move to a per-cpu wait list. */
+static DEFINE_SPINLOCK(uintr_wait_lock);
+static struct list_head uintr_wait_list = LIST_HEAD_INIT(uintr_wait_list);
+
 inline bool uintr_arch_enabled(void)
 {
 	return static_cpu_has(X86_FEATURE_UINTR);
@@ -80,6 +68,12 @@ static inline bool is_uintr_receiver(struct task_struct *t)
 	return !!t->thread.ui_recv;
 }
 
+/* Always make sure task is_uintr_receiver() before calling */
+static inline bool is_uintr_waiting(struct task_struct *t)
+{
+	return t->thread.ui_recv->upid_ctx->waiting;
+}
+
 static inline bool is_uintr_sender(struct task_struct *t)
 {
 	return !!t->thread.ui_send;
@@ -151,6 +145,7 @@ static struct uintr_upid_ctx *alloc_upid(void)
 	refcount_set(&upid_ctx->refs, 1);
 	upid_ctx->task = get_task_struct(current);
 	upid_ctx->receiver_active = true;
+	upid_ctx->waiting = false;
 
 	return upid_ctx;
 }
@@ -494,6 +489,68 @@ int do_uintr_register_sender(struct uintr_receiver_info *r_info,
 	return 0;
 }
 
+int uintr_receiver_wait(void)
+{
+	struct uintr_upid_ctx *upid_ctx;
+	unsigned long flags;
+
+	if (!is_uintr_receiver(current))
+		return -EOPNOTSUPP;
+
+	upid_ctx = current->thread.ui_recv->upid_ctx;
+	upid_ctx->upid->nc.nv = UINTR_KERNEL_VECTOR;
+	upid_ctx->waiting = true;
+	spin_lock_irqsave(&uintr_wait_lock, flags);
+	list_add(&upid_ctx->node, &uintr_wait_list);
+	spin_unlock_irqrestore(&uintr_wait_lock, flags);
+
+	set_current_state(TASK_INTERRUPTIBLE);
+	schedule();
+
+	return -EINTR;
+}
+
+/*
+ * Runs in interrupt context.
+ * Scan through all UPIDs to check if any interrupt is on going.
+ */
+void uintr_wake_up_process(void)
+{
+	struct uintr_upid_ctx *upid_ctx, *tmp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&uintr_wait_lock, flags);
+	list_for_each_entry_safe(upid_ctx, tmp, &uintr_wait_list, node) {
+		if (test_bit(UPID_ON, (unsigned long *)&upid_ctx->upid->nc.status)) {
+			set_bit(UPID_SN, (unsigned long *)&upid_ctx->upid->nc.status);
+			upid_ctx->upid->nc.nv = UINTR_NOTIFICATION_VECTOR;
+			upid_ctx->waiting = false;
+			wake_up_process(upid_ctx->task);
+			list_del(&upid_ctx->node);
+		}
+	}
+	spin_unlock_irqrestore(&uintr_wait_lock, flags);
+}
+
+/* Called when task is unregistering/exiting */
+static void uintr_remove_task_wait(struct task_struct *task)
+{
+	struct uintr_upid_ctx *upid_ctx, *tmp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&uintr_wait_lock, flags);
+	list_for_each_entry_safe(upid_ctx, tmp, &uintr_wait_list, node) {
+		if (upid_ctx->task == task) {
+			pr_debug("wait: Removing task %d from wait\n",
+				 upid_ctx->task->pid);
+			upid_ctx->upid->nc.nv = UINTR_NOTIFICATION_VECTOR;
+			upid_ctx->waiting = false;
+			list_del(&upid_ctx->node);
+		}
+	}
+	spin_unlock_irqrestore(&uintr_wait_lock, flags);
+}
+
 int do_uintr_unregister_handler(void)
 {
 	struct task_struct *t = current;
@@ -548,7 +605,7 @@ int do_uintr_unregister_handler(void)
 	 * based on this UPID.
 	 */
 	set_bit(UPID_SN, (unsigned long *)&ui_recv->upid_ctx->upid->nc.status);
-
+	uintr_remove_task_wait(t);
 	put_upid_ref(ui_recv->upid_ctx);
 	kfree(ui_recv);
 	t->thread.ui_recv = NULL;
@@ -684,7 +741,7 @@ void switch_uintr_prepare(struct task_struct *prev)
 {
 	struct uintr_upid *upid;
 
-	if (is_uintr_receiver(prev)) {
+	if (is_uintr_receiver(prev) && !is_uintr_waiting(prev)) {
 		upid = prev->thread.ui_recv->upid_ctx->upid;
 		set_bit(UPID_SN, (unsigned long *)&upid->nc.status);
 	}
@@ -806,6 +863,7 @@ void uintr_free(struct task_struct *t)
 		 * generated based on this UPID.
 		 */
 		set_bit(UPID_SN, (unsigned long *)&ui_recv->upid_ctx->upid->nc.status);
+		uintr_remove_task_wait(t);
 		ui_recv->upid_ctx->receiver_active = false;
 		put_upid_ref(ui_recv->upid_ctx);
 		kfree(ui_recv);
diff --git a/arch/x86/kernel/uintr_fd.c b/arch/x86/kernel/uintr_fd.c
index 3c82c032c0b9..a7e55d98c0c7 100644
--- a/arch/x86/kernel/uintr_fd.c
+++ b/arch/x86/kernel/uintr_fd.c
@@ -283,3 +283,18 @@ SYSCALL_DEFINE2(uintr_unregister_sender, int, uintrfd, unsigned int, flags)
 	fdput(f);
 	return ret;
 }
+
+/*
+ * sys_uintr_wait - Wait for a user interrupt
+ */
+SYSCALL_DEFINE1(uintr_wait, unsigned int, flags)
+{
+	if (!uintr_arch_enabled())
+		return -EOPNOTSUPP;
+
+	if (flags)
+		return -EINVAL;
+
+	/* TODO: Add a timeout option */
+	return uintr_receiver_wait();
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 12/13] x86/uintr: Wire up the user interrupt syscalls
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (10 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-13 20:01 ` [RFC PATCH 13/13] selftests/x86: Add basic tests for User IPI Sohil Mehta
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

Wire up the user interrupt receiver and sender related syscalls for
x86_64.

For rest of the architectures the syscalls are not implemented.

<TODO: Reserve the syscall numbers for other architectures>

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  6 ++++++
 arch/x86/entry/syscalls/syscall_64.tbl |  6 ++++++
 include/linux/syscalls.h               |  8 ++++++++
 include/uapi/asm-generic/unistd.h      | 15 ++++++++++++++-
 kernel/sys_ni.c                        |  8 ++++++++
 scripts/checksyscalls.sh               |  6 ++++++
 6 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 960a021d543e..d0e97f1f1173 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -453,3 +453,9 @@
 446	i386	landlock_restrict_self	sys_landlock_restrict_self
 447	i386	memfd_secret		sys_memfd_secret
 448	i386	process_mrelease	sys_process_mrelease
+449	i386	uintr_register_handler	sys_uintr_register_handler
+450	i386	uintr_unregister_handler sys_uintr_unregister_handler
+451	i386	uintr_create_fd		sys_uintr_create_fd
+452	i386	uintr_register_sender	sys_uintr_register_sender
+453	i386	uintr_unregister_sender	sys_uintr_unregister_sender
+454	i386	uintr_wait		sys_uintr_wait
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 18b5500ea8bf..444af44e5947 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -370,6 +370,12 @@
 446	common	landlock_restrict_self	sys_landlock_restrict_self
 447	common	memfd_secret		sys_memfd_secret
 448	common	process_mrelease	sys_process_mrelease
+449	common	uintr_register_handler	sys_uintr_register_handler
+450	common	uintr_unregister_handler sys_uintr_unregister_handler
+451	common	uintr_create_fd		sys_uintr_create_fd
+452	common	uintr_register_sender	sys_uintr_register_sender
+453	common	uintr_unregister_sender	sys_uintr_unregister_sender
+454	common	uintr_wait		sys_uintr_wait
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 252243c7783d..f47f64c36d87 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1060,6 +1060,14 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
 /* arch/x86/kernel/ioport.c */
 asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
 
+/* arch/x86/kernel/uintr_fd.c */
+asmlinkage long sys_uintr_register_handler(u64 __user *handler, unsigned int flags);
+asmlinkage long sys_uintr_unregister_handler(unsigned int flags);
+asmlinkage long sys_uintr_create_fd(u64 vector, unsigned int flags);
+asmlinkage long sys_uintr_register_sender(int uintr_fd, unsigned int flags);
+asmlinkage long sys_uintr_unregister_sender(int uintr_fd, unsigned int flags);
+asmlinkage long sys_uintr_wait(unsigned int flags);
+
 /* pciconfig: alpha, arm, arm64, ia64, sparc */
 asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
 				unsigned long off, unsigned long len,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1c5fb86d455a..b9a8b344270a 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -880,8 +880,21 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
 #define __NR_process_mrelease 448
 __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 
+#define __NR_uintr_register_handler 449
+__SYSCALL(__NR_uintr_register_handler, sys_uintr_register_handler)
+#define __NR_uintr_unregister_handler 450
+__SYSCALL(__NR_uintr_unregister_handler, sys_uintr_unregister_handler)
+#define __NR_uintr_create_fd 451
+__SYSCALL(__NR_uintr_create_fd, sys_uintr_create_fd)
+#define __NR_uintr_register_sender 452
+__SYSCALL(__NR_uintr_register_sender, sys_uintr_register_sender)
+#define __NR_uintr_unregister_sender 453
+__SYSCALL(__NR_uintr_unregister_sender, sys_uintr_unregister_sender)
+#define __NR_uintr_wait 454
+__SYSCALL(__NR_uintr_wait, sys_uintr_wait)
+
 #undef __NR_syscalls
-#define __NR_syscalls 449
+#define __NR_syscalls 455
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index f43d89d92860..5d8b92ac197b 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -357,6 +357,14 @@ COND_SYSCALL(pkey_free);
 /* memfd_secret */
 COND_SYSCALL(memfd_secret);
 
+/* user interrupts */
+COND_SYSCALL(uintr_register_handler);
+COND_SYSCALL(uintr_unregister_handler);
+COND_SYSCALL(uintr_create_fd);
+COND_SYSCALL(uintr_register_sender);
+COND_SYSCALL(uintr_unregister_sender);
+COND_SYSCALL(uintr_wait);
+
 /*
  * Architecture specific weak syscall entries.
  */
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index fd9777f63f14..0969580d829c 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -204,6 +204,12 @@ cat << EOF
 #define __IGNORE__sysctl
 #define __IGNORE_arch_prctl
 #define __IGNORE_nfsservctl
+#define __IGNORE_uintr_register_handler
+#define __IGNORE_uintr_unregister_handler
+#define __IGNORE_uintr_create_fd
+#define __IGNORE_uintr_register_sender
+#define __IGNORE_uintr_unregister_sender
+#define __IGNORE_uintr_wait
 
 /* ... including the "new" 32-bit uid syscalls */
 #define __IGNORE_lchown32
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [RFC PATCH 13/13] selftests/x86: Add basic tests for User IPI
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (11 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 12/13] x86/uintr: Wire up the user interrupt syscalls Sohil Mehta
@ 2021-09-13 20:01 ` Sohil Mehta
  2021-09-13 20:27 ` [RFC PATCH 00/13] x86 User Interrupts support Dave Hansen
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-13 20:01 UTC (permalink / raw)
  To: x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Ashok Raj, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Dan Williams, Randy E Witt,
	Ravi V Shankar, Ramesh Thomas, linux-api, linux-arch,
	linux-kernel, linux-kselftest

Include 2 basic tests for receiving a User IPI:
1. Receiver is spinning in userspace.
2. Receiver is blocked in the kernel.

The selftests need gcc with 'muintr' support to compile.

GCC 11 (recently released) has support for this.

Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
---
 tools/testing/selftests/x86/Makefile |  10 ++
 tools/testing/selftests/x86/uintr.c  | 147 +++++++++++++++++++++++++++
 2 files changed, 157 insertions(+)
 create mode 100644 tools/testing/selftests/x86/uintr.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index b4142cd1c5c2..38588221b09e 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -9,6 +9,7 @@ UNAME_M := $(shell uname -m)
 CAN_BUILD_I386 := $(shell ./check_cc.sh $(CC) trivial_32bit_program.c -m32)
 CAN_BUILD_X86_64 := $(shell ./check_cc.sh $(CC) trivial_64bit_program.c)
 CAN_BUILD_WITH_NOPIE := $(shell ./check_cc.sh $(CC) trivial_program.c -no-pie)
+CAN_BUILD_UINTR := $(shell ./check_cc.sh $(CC) trivial_64bit_program.c -muintr)
 
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap_vdso \
 			check_initial_reg_state sigreturn iopl ioperm \
@@ -19,6 +20,11 @@ TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			vdso_restorer
 TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering \
 			corrupt_xstate_header
+
+ifeq ($(CAN_BUILD_UINTR),1)
+TARGETS_C_64BIT_ONLY := $(TARGETS_C_64BIT_ONLY) uintr
+endif
+
 # Some selftests require 32bit support enabled also on 64bit systems
 TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
 
@@ -41,6 +47,10 @@ ifeq ($(CAN_BUILD_WITH_NOPIE),1)
 CFLAGS += -no-pie
 endif
 
+ifeq ($(CAN_BUILD_UINTR),1)
+CFLAGS += -muintr
+endif
+
 define gen-target-rule-32
 $(1) $(1)_32: $(OUTPUT)/$(1)_32
 .PHONY: $(1) $(1)_32
diff --git a/tools/testing/selftests/x86/uintr.c b/tools/testing/selftests/x86/uintr.c
new file mode 100644
index 000000000000..61a53526f2fa
--- /dev/null
+++ b/tools/testing/selftests/x86/uintr.c
@@ -0,0 +1,147 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2020, Intel Corporation.
+ *
+ * Sohil Mehta <sohil.mehta@intel.com>
+ */
+#define _GNU_SOURCE
+#include <syscall.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <x86gprintrin.h>
+#include <pthread.h>
+#include <stdlib.h>
+
+#ifndef __x86_64__
+# error This test is 64-bit only
+#endif
+
+#ifndef __NR_uintr_register_handler
+#define __NR_uintr_register_handler	449
+#define __NR_uintr_unregister_handler	450
+#define __NR_uintr_create_fd		451
+#define __NR_uintr_register_sender	452
+#define __NR_uintr_unregister_sender	453
+#define __NR_uintr_wait			454
+#endif
+
+#define uintr_register_handler(handler, flags)	syscall(__NR_uintr_register_handler, handler, flags)
+#define uintr_unregister_handler(flags)		syscall(__NR_uintr_unregister_handler, flags)
+#define uintr_create_fd(vector, flags)		syscall(__NR_uintr_create_fd, vector, flags)
+#define uintr_register_sender(fd, flags)	syscall(__NR_uintr_register_sender, fd, flags)
+#define uintr_unregister_sender(fd, flags)	syscall(__NR_uintr_unregister_sender, fd, flags)
+#define uintr_wait(flags)			syscall(__NR_uintr_wait, flags)
+
+unsigned long uintr_received;
+unsigned int uintr_fd;
+
+void __attribute__((interrupt))__attribute__((target("general-regs-only", "inline-all-stringops")))
+uintr_handler(struct __uintr_frame *ui_frame,
+	      unsigned long long vector)
+{
+	uintr_received = 1;
+}
+
+void receiver_setup_interrupt(void)
+{
+	int vector = 0;
+	int ret;
+
+	/* Register interrupt handler */
+	if (uintr_register_handler(uintr_handler, 0)) {
+		printf("[FAIL]\tInterrupt handler register error\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Create uintr_fd */
+	ret = uintr_create_fd(vector, 0);
+	if (ret < 0) {
+		printf("[FAIL]\tInterrupt vector registration error\n");
+		exit(EXIT_FAILURE);
+	}
+
+	uintr_fd = ret;
+}
+
+void *sender_thread(void *arg)
+{
+	long sleep_usec = (long)arg;
+	int uipi_index;
+
+	uipi_index = uintr_register_sender(uintr_fd, 0);
+	if (uipi_index < 0) {
+		printf("[FAIL]\tSender register error\n");
+		return NULL;
+	}
+
+	/* Sleep before sending IPI to allow the receiver to block in the kernel */
+	if (sleep_usec)
+		usleep(sleep_usec);
+
+	printf("\tother thread: sending IPI\n");
+	_senduipi(uipi_index);
+
+	uintr_unregister_sender(uintr_fd, 0);
+
+	return NULL;
+}
+
+static inline void cpu_relax(void)
+{
+	asm volatile("rep; nop" ::: "memory");
+}
+
+void test_base_ipi(void)
+{
+	pthread_t pt;
+
+	uintr_received = 0;
+	if (pthread_create(&pt, NULL, &sender_thread, NULL)) {
+		printf("[FAIL]\tError creating sender thread\n");
+		return;
+	}
+
+	printf("[RUN]\tSpin in userspace (waiting for interrupts)\n");
+	// Keep spinning until interrupt received
+	while (!uintr_received)
+		cpu_relax();
+
+	printf("[OK]\tUser interrupt received\n");
+}
+
+void test_blocking_ipi(void)
+{
+	pthread_t pt;
+	long sleep_usec;
+
+	uintr_received = 0;
+	sleep_usec = 1000;
+	if (pthread_create(&pt, NULL, &sender_thread, (void *)sleep_usec)) {
+		printf("[FAIL]\tError creating sender thread\n");
+		return;
+	}
+
+	printf("[RUN]\tBlock in the kernel (waiting for interrupts)\n");
+	uintr_wait(0);
+	if (uintr_received)
+		printf("[OK]\tUser interrupt received\n");
+	else
+		printf("[FAIL]\tUser interrupt not received\n");
+}
+
+int main(int argc, char *argv[])
+{
+	receiver_setup_interrupt();
+
+	/* Enable interrupts */
+	_stui();
+
+	test_base_ipi();
+
+	test_blocking_ipi();
+
+	close(uintr_fd);
+	uintr_unregister_handler(0);
+
+	exit(EXIT_SUCCESS);
+}
-- 
2.33.0


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (12 preceding siblings ...)
  2021-09-13 20:01 ` [RFC PATCH 13/13] selftests/x86: Add basic tests for User IPI Sohil Mehta
@ 2021-09-13 20:27 ` Dave Hansen
  2021-09-14 19:03   ` Mehta, Sohil
  2021-09-23 14:39 ` Jens Axboe
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 81+ messages in thread
From: Dave Hansen @ 2021-09-13 20:27 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Tony Luck, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/13/21 1:01 PM, Sohil Mehta wrote:
> User Interrupts (Uintr) is a hardware technology that enables delivering
> interrupts directly to user space.

Your problem in all of this is going to be convincing folks that this is
a problem worth solving.  I'd start this off with something
attention-grabbing.

Two things.  Good, snazzy writing doesn't repeat words.  You repeated
"interrupt" twice in that first sentence.  It also doesn't get my
attention.  Here's a more concise way of saying it, and also adding
something to get the reader's attention:

	User Interrupts directly deliver events to user space and are
	10x faster than the closest alternative.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-13 20:27 ` [RFC PATCH 00/13] x86 User Interrupts support Dave Hansen
@ 2021-09-14 19:03   ` Mehta, Sohil
  2021-09-23 12:19     ` Greg KH
  0 siblings, 1 reply; 81+ messages in thread
From: Mehta, Sohil @ 2021-09-14 19:03 UTC (permalink / raw)
  To: Hansen, Dave, x86, linux-kernel
  Cc: Luck, Tony, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Lutomirski, Andy, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj,
	Ashok, Jacob Pan, Kammela, Gayatri, Zeng, Guang, Williams, Dan J,
	Witt, Randy E, Shankar, Ravi V, Thomas, Ramesh, linux-api,
	linux-arch, linux-kselftest

Resending.. There were some email delivery issues.

On 9/13/2021 1:27 PM, Dave Hansen wrote:
>	User Interrupts directly deliver events to user space and are
>	10x faster than the closest alternative.

Thanks Dave. This is definitely more attention-grabbing than the
previous intro. I'll include this next time.

One thing to note, the 10x gain is only applicable for User IPIs.
For other source of User Interrupts (like kernel-to-user
notifications and other external sources), we don't have the data
yet.

I realized the User IPI data in the cover also needs some
clarification. The 10x gain is only seen when the receiver is
spinning in User space - waiting for interrupts.

If the receiver were to block (wait) in the kernel, the performance
would drop as expected. However, User IPI (blocked) would still be
10% faster than Eventfd and 40% faster than signals.

Here is the updated table:
+---------------------+-------------------------+
| IPC type            |   Relative Latency      |
|                     |(normalized to User IPI) |
+---------------------+-------------------------+
| User IPI            |                     1.0 |
| User IPI (blocked)  |                     8.9 |
| Signal              |                    14.8 |
| Eventfd             |                     9.7 |
| Pipe                |                    16.3 |
| Domain              |                    17.3 |
+---------------------+-------------------------+

--Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-14 19:03   ` Mehta, Sohil
@ 2021-09-23 12:19     ` Greg KH
  2021-09-23 14:09       ` Greg KH
                         ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Greg KH @ 2021-09-23 12:19 UTC (permalink / raw)
  To: Mehta, Sohil
  Cc: Hansen, Dave, x86, linux-kernel, Luck, Tony, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Lutomirski, Andy,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Raj, Ashok, Jacob Pan, Kammela,
	Gayatri, Zeng, Guang, Williams, Dan J, Witt, Randy E, Shankar,
	Ravi V, Thomas, Ramesh, linux-api, linux-arch, linux-kselftest

On Tue, Sep 14, 2021 at 07:03:36PM +0000, Mehta, Sohil wrote:
> Resending.. There were some email delivery issues.
> 
> On 9/13/2021 1:27 PM, Dave Hansen wrote:
> >	User Interrupts directly deliver events to user space and are
> >	10x faster than the closest alternative.
> 
> Thanks Dave. This is definitely more attention-grabbing than the
> previous intro. I'll include this next time.
> 
> One thing to note, the 10x gain is only applicable for User IPIs.
> For other source of User Interrupts (like kernel-to-user
> notifications and other external sources), we don't have the data
> yet.
> 
> I realized the User IPI data in the cover also needs some
> clarification. The 10x gain is only seen when the receiver is
> spinning in User space - waiting for interrupts.
> 
> If the receiver were to block (wait) in the kernel, the performance
> would drop as expected. However, User IPI (blocked) would still be
> 10% faster than Eventfd and 40% faster than signals.
> 
> Here is the updated table:
> +---------------------+-------------------------+
> | IPC type            |   Relative Latency      |
> |                     |(normalized to User IPI) |
> +---------------------+-------------------------+
> | User IPI            |                     1.0 |
> | User IPI (blocked)  |                     8.9 |
> | Signal              |                    14.8 |
> | Eventfd             |                     9.7 |
> | Pipe                |                    16.3 |
> | Domain              |                    17.3 |
> +---------------------+-------------------------+

Relative is just that, "relative".  If the real values are extremely
tiny, then relative is just "this goes a tiny tiny bit faster than what
you have today in eventfd", right?

So how about "absolute"?  What are we talking here?

And this is really only for the "one userspace task waking up another
userspace task" policies.  What real workload can actually use this?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls
  2021-09-13 20:01 ` [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls Sohil Mehta
@ 2021-09-23 12:26   ` Greg KH
  2021-09-24  0:05     ` Thomas Gleixner
  2021-09-27 23:20     ` Sohil Mehta
  2021-09-23 23:52   ` Thomas Gleixner
  1 sibling, 2 replies; 81+ messages in thread
From: Greg KH @ 2021-09-23 12:26 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13, 2021 at 01:01:25PM -0700, Sohil Mehta wrote:
> Any application that wants to receive a user interrupt needs to register
> an interrupt handler with the kernel. Add a registration syscall that
> sets up the interrupt handler and the related kernel structures for
> the task that makes this syscall.
> 
> Only one interrupt handler per task can be registered with the
> kernel/hardware. Each task has its private interrupt vector space of 64
> vectors. The vector registration and the related FD management is
> covered later.
> 
> Also add an unregister syscall to let a task unregister the interrupt
> handler.
> 
> The UPID for each receiver task needs to be updated whenever a task gets
> context switched or it moves from one cpu to another. This will also be
> covered later. The system calls haven't been wired up yet so no real
> harm is done if we don't update the UPID right now.
> 
> <Code typically in the x86/kernel directory doesn't deal with file
> descriptor management. I have kept uintr_fd.c separate to make it easier
> to move it somewhere else if needed.>
> 
> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
> ---
>  arch/x86/include/asm/processor.h |   6 +
>  arch/x86/include/asm/uintr.h     |  13 ++
>  arch/x86/kernel/Makefile         |   1 +
>  arch/x86/kernel/uintr_core.c     | 240 +++++++++++++++++++++++++++++++
>  arch/x86/kernel/uintr_fd.c       |  58 ++++++++
>  5 files changed, 318 insertions(+)
>  create mode 100644 arch/x86/include/asm/uintr.h
>  create mode 100644 arch/x86/kernel/uintr_core.c
>  create mode 100644 arch/x86/kernel/uintr_fd.c
> 
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 9ad2acaaae9b..d229bfac8b4f 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -9,6 +9,7 @@ struct task_struct;
>  struct mm_struct;
>  struct io_bitmap;
>  struct vm86;
> +struct uintr_receiver;
>  
>  #include <asm/math_emu.h>
>  #include <asm/segment.h>
> @@ -529,6 +530,11 @@ struct thread_struct {
>  	 */
>  	u32			pkru;
>  
> +#ifdef CONFIG_X86_USER_INTERRUPTS
> +	/* User Interrupt state*/
> +	struct uintr_receiver	*ui_recv;
> +#endif
> +
>  	/* Floating point and extended processor state */
>  	struct fpu		fpu;
>  	/*
> diff --git a/arch/x86/include/asm/uintr.h b/arch/x86/include/asm/uintr.h
> new file mode 100644
> index 000000000000..4f35bd8bd4e0
> --- /dev/null
> +++ b/arch/x86/include/asm/uintr.h
> @@ -0,0 +1,13 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_X86_UINTR_H
> +#define _ASM_X86_UINTR_H
> +
> +#ifdef CONFIG_X86_USER_INTERRUPTS
> +
> +bool uintr_arch_enabled(void);
> +int do_uintr_register_handler(u64 handler);
> +int do_uintr_unregister_handler(void);
> +
> +#endif /* CONFIG_X86_USER_INTERRUPTS */
> +
> +#endif /* _ASM_X86_UINTR_H */
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 8f4e8fa6ed75..060ca9f23e23 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -140,6 +140,7 @@ obj-$(CONFIG_UPROBES)			+= uprobes.o
>  obj-$(CONFIG_PERF_EVENTS)		+= perf_regs.o
>  obj-$(CONFIG_TRACING)			+= tracepoint.o
>  obj-$(CONFIG_SCHED_MC_PRIO)		+= itmt.o
> +obj-$(CONFIG_X86_USER_INTERRUPTS)	+= uintr_fd.o uintr_core.o
>  obj-$(CONFIG_X86_UMIP)			+= umip.o
>  
>  obj-$(CONFIG_UNWINDER_ORC)		+= unwind_orc.o
> diff --git a/arch/x86/kernel/uintr_core.c b/arch/x86/kernel/uintr_core.c
> new file mode 100644
> index 000000000000..2c6042a6840a
> --- /dev/null
> +++ b/arch/x86/kernel/uintr_core.c
> @@ -0,0 +1,240 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2021, Intel Corporation.
> + *
> + * Sohil Mehta <sohil.mehta@intel.com>
> + * Jacob Pan <jacob.jun.pan@linux.intel.com>
> + */
> +#define pr_fmt(fmt)    "uintr: " fmt
> +
> +#include <linux/refcount.h>
> +#include <linux/sched.h>
> +#include <linux/sched/task.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +#include <asm/apic.h>
> +#include <asm/fpu/internal.h>
> +#include <asm/irq_vectors.h>
> +#include <asm/msr.h>
> +#include <asm/msr-index.h>
> +#include <asm/uintr.h>
> +
> +/* User Posted Interrupt Descriptor (UPID) */
> +struct uintr_upid {
> +	struct {
> +		u8 status;	/* bit 0: ON, bit 1: SN, bit 2-7: reserved */
> +		u8 reserved1;	/* Reserved */
> +		u8 nv;		/* Notification vector */
> +		u8 reserved2;	/* Reserved */

What are these "reserved" for?

> +		u32 ndst;	/* Notification destination */
> +	} nc __packed;		/* Notification control */
> +	u64 puir;		/* Posted user interrupt requests */
> +} __aligned(64);
> +
> +/* UPID Notification control status */
> +#define UPID_ON		0x0	/* Outstanding notification */
> +#define UPID_SN		0x1	/* Suppressed notification */
> +
> +struct uintr_upid_ctx {
> +	struct uintr_upid *upid;
> +	refcount_t refs;

Please use a kref for this and do not roll your own for no good reason.

> +/*
> + * sys_uintr_register_handler - setup user interrupt handler for receiver.
> + */
> +SYSCALL_DEFINE2(uintr_register_handler, u64 __user *, handler, unsigned int, flags)
> +{
> +	int ret;
> +
> +	if (!uintr_arch_enabled())
> +		return -EOPNOTSUPP;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	/* TODO: Validate the handler address */
> +	if (!handler)
> +		return -EFAULT;

Um, that's a pretty big "TODO" here.

How are you going to define what is, and what is not, an allowed
"handler"?

I'm sure the security people would like to get involved here, as well as
the auditing people.  Have you talked with them about their requirements
for this type of stuff?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls
  2021-09-13 20:01 ` [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls Sohil Mehta
@ 2021-09-23 12:28   ` Greg KH
  2021-09-28 18:01     ` Sohil Mehta
  2021-09-24 10:54   ` Thomas Gleixner
  1 sibling, 1 reply; 81+ messages in thread
From: Greg KH @ 2021-09-23 12:28 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13, 2021 at 01:01:29PM -0700, Sohil Mehta wrote:
> Add a registration syscall for a task to register itself as a user
> interrupt sender using the uintr_fd generated by the receiver. A task
> can register multiple uintr_fds. Each unique successful connection
> creates a new entry in the User Interrupt Target Table (UITT).
> 
> Each entry in the UITT table is referred by the UITT index (uipi_index).
> The uipi_index returned during the registration syscall lets a sender
> generate a user IPI using the 'SENDUIPI <uipi_index>' instruction.
> 
> Also, add a sender unregister syscall to unregister a particular task
> from the uintr_fd. Calling close on the uintr_fd will disconnect all
> threads in a sender process from that FD.
> 
> Currently, the UITT size is arbitrarily chosen as 256 entries
> corresponding to a 4KB page. Based on feedback and usage data this can
> either be increased/decreased or made dynamic later.
> 
> Architecturally, the UITT table can be unique for each thread or shared
> across threads of the same thread group. The current implementation
> keeps the UITT as unique for the each thread. This makes the kernel
> implementation relatively simple and only threads that use uintr get
> setup with the related structures. However, this means that the
> uipi_index for each thread would be inconsistent wrt to other threads.
> (Executing 'SENDUIPI 2' on threads of the same process could generate
> different user interrupts.)
> 
> Alternatively, the benefit of sharing the UITT table is that all threads
> would see the same view of the UITT table. Also the kernel UITT memory
> allocation would be more efficient if multiple threads connect to the
> same uintr_fd. However, this would mean the kernel needs to keep the
> UITT table size MISC_MSR[] in sync across these threads. Also the
> UPID/UITT teardown flows might need additional consideration.
> 
> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
> ---
>  arch/x86/include/asm/processor.h |   2 +
>  arch/x86/include/asm/uintr.h     |  15 ++
>  arch/x86/kernel/process.c        |   1 +
>  arch/x86/kernel/uintr_core.c     | 355 ++++++++++++++++++++++++++++++-
>  arch/x86/kernel/uintr_fd.c       | 133 ++++++++++++
>  5 files changed, 495 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index d229bfac8b4f..3482c3182e39 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -10,6 +10,7 @@ struct mm_struct;
>  struct io_bitmap;
>  struct vm86;
>  struct uintr_receiver;
> +struct uintr_sender;
>  
>  #include <asm/math_emu.h>
>  #include <asm/segment.h>
> @@ -533,6 +534,7 @@ struct thread_struct {
>  #ifdef CONFIG_X86_USER_INTERRUPTS
>  	/* User Interrupt state*/
>  	struct uintr_receiver	*ui_recv;
> +	struct uintr_sender	*ui_send;
>  #endif
>  
>  	/* Floating point and extended processor state */
> diff --git a/arch/x86/include/asm/uintr.h b/arch/x86/include/asm/uintr.h
> index 1f00e2a63da4..ef3521dd7fb9 100644
> --- a/arch/x86/include/asm/uintr.h
> +++ b/arch/x86/include/asm/uintr.h
> @@ -8,6 +8,7 @@ struct uintr_upid_ctx {
>  	struct task_struct *task;	/* Receiver task */
>  	struct uintr_upid *upid;
>  	refcount_t refs;
> +	bool receiver_active;		/* Flag for UPID being mapped to a receiver */
>  };
>  
>  struct uintr_receiver_info {
> @@ -16,12 +17,26 @@ struct uintr_receiver_info {
>  	u64 uvec;				/* Vector number */
>  };
>  
> +struct uintr_sender_info {
> +	struct list_head node;
> +	struct uintr_uitt_ctx *uitt_ctx;
> +	struct task_struct *task;
> +	struct uintr_upid_ctx *r_upid_ctx;	/* Receiver's UPID context */
> +	struct callback_head twork;		/* Task work head */
> +	unsigned int uitt_index;
> +};
> +
>  bool uintr_arch_enabled(void);
>  int do_uintr_register_handler(u64 handler);
>  int do_uintr_unregister_handler(void);
>  int do_uintr_register_vector(struct uintr_receiver_info *r_info);
>  void do_uintr_unregister_vector(struct uintr_receiver_info *r_info);
>  
> +int do_uintr_register_sender(struct uintr_receiver_info *r_info,
> +			     struct uintr_sender_info *s_info);
> +void do_uintr_unregister_sender(struct uintr_receiver_info *r_info,
> +				struct uintr_sender_info *s_info);
> +
>  void uintr_free(struct task_struct *task);
>  
>  /* TODO: Inline the context switch related functions */
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 83677f76bd7b..9db33e467b30 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -92,6 +92,7 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
>  #ifdef CONFIG_X86_USER_INTERRUPTS
>  	/* User Interrupt state is unique for each task */
>  	dst->thread.ui_recv = NULL;
> +	dst->thread.ui_send = NULL;
>  #endif
>  
>  	return fpu_clone(dst);
> diff --git a/arch/x86/kernel/uintr_core.c b/arch/x86/kernel/uintr_core.c
> index 9dcb9f60e5bc..8f331c5fe0cf 100644
> --- a/arch/x86/kernel/uintr_core.c
> +++ b/arch/x86/kernel/uintr_core.c
> @@ -21,6 +21,11 @@
>  #include <asm/msr-index.h>
>  #include <asm/uintr.h>
>  
> +/*
> + * Each UITT entry is 16 bytes in size.
> + * Current UITT table size is set as 4KB (256 * 16 bytes)
> + */
> +#define UINTR_MAX_UITT_NR 256
>  #define UINTR_MAX_UVEC_NR 64
>  
>  /* User Posted Interrupt Descriptor (UPID) */
> @@ -44,6 +49,27 @@ struct uintr_receiver {
>  	u64 uvec_mask;	/* track active vector per bit */
>  };
>  
> +/* User Interrupt Target Table Entry (UITTE) */
> +struct uintr_uitt_entry {
> +	u8	valid;			/* bit 0: valid, bit 1-7: reserved */

Do you check that the other bits are set to 0?

> +	u8	user_vec;
> +	u8	reserved[6];

What is this reserved for?

> +	u64	target_upid_addr;

If this is a pointer, why not say it is a pointer?

> +} __packed __aligned(16);
> +
> +struct uintr_uitt_ctx {
> +	struct uintr_uitt_entry *uitt;
> +	/* Protect UITT */
> +	spinlock_t uitt_lock;
> +	refcount_t refs;

Again, a kref please.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-23 12:19     ` Greg KH
@ 2021-09-23 14:09       ` Greg KH
  2021-09-23 14:46         ` Dave Hansen
  2021-09-23 23:24         ` Sohil Mehta
  2021-09-23 23:09       ` Sohil Mehta
  2021-09-24  0:17       ` Sohil Mehta
  2 siblings, 2 replies; 81+ messages in thread
From: Greg KH @ 2021-09-23 14:09 UTC (permalink / raw)
  To: Mehta, Sohil
  Cc: Hansen, Dave, x86, linux-kernel, Luck, Tony, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Lutomirski, Andy,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Raj, Ashok, Jacob Pan, Kammela,
	Gayatri, Zeng, Guang, Williams, Dan J, Witt, Randy E, Shankar,
	Ravi V, Thomas, Ramesh, linux-api, linux-arch, linux-kselftest

On Thu, Sep 23, 2021 at 02:19:05PM +0200, Greg KH wrote:
> On Tue, Sep 14, 2021 at 07:03:36PM +0000, Mehta, Sohil wrote:
> > Resending.. There were some email delivery issues.
> > 
> > On 9/13/2021 1:27 PM, Dave Hansen wrote:
> > >	User Interrupts directly deliver events to user space and are
> > >	10x faster than the closest alternative.
> > 
> > Thanks Dave. This is definitely more attention-grabbing than the
> > previous intro. I'll include this next time.
> > 
> > One thing to note, the 10x gain is only applicable for User IPIs.
> > For other source of User Interrupts (like kernel-to-user
> > notifications and other external sources), we don't have the data
> > yet.
> > 
> > I realized the User IPI data in the cover also needs some
> > clarification. The 10x gain is only seen when the receiver is
> > spinning in User space - waiting for interrupts.
> > 
> > If the receiver were to block (wait) in the kernel, the performance
> > would drop as expected. However, User IPI (blocked) would still be
> > 10% faster than Eventfd and 40% faster than signals.
> > 
> > Here is the updated table:
> > +---------------------+-------------------------+
> > | IPC type            |   Relative Latency      |
> > |                     |(normalized to User IPI) |
> > +---------------------+-------------------------+
> > | User IPI            |                     1.0 |
> > | User IPI (blocked)  |                     8.9 |
> > | Signal              |                    14.8 |
> > | Eventfd             |                     9.7 |
> > | Pipe                |                    16.3 |
> > | Domain              |                    17.3 |
> > +---------------------+-------------------------+
> 
> Relative is just that, "relative".  If the real values are extremely
> tiny, then relative is just "this goes a tiny tiny bit faster than what
> you have today in eventfd", right?
> 
> So how about "absolute"?  What are we talking here?
> 
> And this is really only for the "one userspace task waking up another
> userspace task" policies.  What real workload can actually use this?

Also, you forgot to list Binder in the above IPC type.

And you forgot to mention that this is tied to one specific CPU type
only.  Are syscalls allowed to be created that would only work on
obscure cpus like this one?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (13 preceding siblings ...)
  2021-09-13 20:27 ` [RFC PATCH 00/13] x86 User Interrupts support Dave Hansen
@ 2021-09-23 14:39 ` Jens Axboe
  2021-09-29  4:31 ` Andy Lutomirski
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 81+ messages in thread
From: Jens Axboe @ 2021-09-23 14:39 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On 9/13/21 2:01 PM, Sohil Mehta wrote:
> - Discuss potential use cases.
> We are starting to look at actual usages and libraries (like libevent[2] and
> liburing[3]) that can take advantage of this technology. Unfortunately, we
> don't have much to share on this right now. We need some help from the
> community to identify usages that can benefit from this. We would like to make
> sure the proposed APIs work for the eventual consumers.

One use case for liburing/io_uring would be to use it instead of eventfd
for notifications. I know some folks do use eventfd right now, though
it's not that common. But if we had support for something like this,
then you could use it to know when to reap events rather than sleep in
the kernel. Or at least to be notified when new events have been posted
to the cq ring.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-23 14:09       ` Greg KH
@ 2021-09-23 14:46         ` Dave Hansen
  2021-09-23 15:07           ` Greg KH
  2021-09-23 23:24         ` Sohil Mehta
  1 sibling, 1 reply; 81+ messages in thread
From: Dave Hansen @ 2021-09-23 14:46 UTC (permalink / raw)
  To: Greg KH, Mehta, Sohil
  Cc: x86, linux-kernel, Luck, Tony, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Lutomirski, Andy, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Raj, Ashok, Jacob Pan, Kammela, Gayatri, Zeng,
	Guang, Williams, Dan J, Witt, Randy E, Shankar, Ravi V, Thomas,
	Ramesh, linux-api, linux-arch, linux-kselftest

On 9/23/21 7:09 AM, Greg KH wrote:
> And you forgot to mention that this is tied to one specific CPU type
> only.  Are syscalls allowed to be created that would only work on
> obscure cpus like this one?

Well, you have to start somewhere.  For example, when memory protection
keys went in, we added three syscalls:

> 329     common  pkey_mprotect           sys_pkey_mprotect
> 330     common  pkey_alloc              sys_pkey_alloc
> 331     common  pkey_free               sys_pkey_free

At the point that I started posting these, you couldn't even buy a
system with this feature.  For a while, there was only one Intel Xeon
generation that had support.

But, if you build it, they will come.  Today, there is powerpc support
and our friends at AMD added support to their processors.  In addition,
protection keys are found across Intel's entire CPU line: from big
Xeons, down to the little Atoms you find in Chromebooks.

I encourage everyone submitting new hardware features to include
information about where their feature will show up to end users *and* to
say how widely it will be available.  I'd actually prefer if maintainers
rejected patches that didn't have this information.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-23 14:46         ` Dave Hansen
@ 2021-09-23 15:07           ` Greg KH
  0 siblings, 0 replies; 81+ messages in thread
From: Greg KH @ 2021-09-23 15:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Mehta, Sohil, x86, linux-kernel, Luck, Tony, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Lutomirski, Andy,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Raj, Ashok, Jacob Pan, Kammela,
	Gayatri, Zeng, Guang, Williams, Dan J, Witt, Randy E, Shankar,
	Ravi V, Thomas, Ramesh, linux-api, linux-arch, linux-kselftest

On Thu, Sep 23, 2021 at 07:46:43AM -0700, Dave Hansen wrote:
> I encourage everyone submitting new hardware features to include
> information about where their feature will show up to end users *and* to
> say how widely it will be available.  I'd actually prefer if maintainers
> rejected patches that didn't have this information.

Make sense.  So, what are the answers to these questions for this new
CPU feature?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 03/13] x86/cpu: Enumerate User Interrupts support
  2021-09-13 20:01 ` [RFC PATCH 03/13] x86/cpu: Enumerate User Interrupts support Sohil Mehta
@ 2021-09-23 22:24   ` Thomas Gleixner
  2021-09-24 19:59     ` Sohil Mehta
  2021-09-27 20:42     ` Sohil Mehta
  0 siblings, 2 replies; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-23 22:24 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> SENDUIPI is a special ring-3 instruction that makes a supervisor mode
> memory access to the UPID and UITT memory. Currently, KPTI needs to be
> off for User IPIs to work.  Processors that support user interrupts are
> not affected by Meltdown so the auto mode of KPTI will default to off.
>
> Users who want to force enable KPTI will need to wait for a later
> version of this patch series that is compatible with KPTI. We need to
> allocate the UPID and UITT structures from a special memory region that
> has supervisor access but it is mapped into userspace. The plan is to
> implement a mechanism similar to LDT.

Seriously?

> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>

This SOB chain is invalid. Ditto in several other patches.

>  
> +config X86_USER_INTERRUPTS
> +	bool "User Interrupts (UINTR)"
> +	depends on X86_LOCAL_APIC && X86_64

X86_64 does not work w/o LOCAL_APIC so this dependency is pointless.

> +	depends on CPU_SUP_INTEL
> +	help
> +	  User Interrupts are events that can be delivered directly to
> +	  userspace without a transition through the kernel. The interrupts
> +	  could be generated by another userspace application, kernel or a
> +	  device.
> +
> +	  Refer, Documentation/x86/user-interrupts.rst for details.

"Refer, Documentation..." is not a sentence.

>  
> +/* User Interrupt interface */
> +#define MSR_IA32_UINTR_RR		0x985
> +#define MSR_IA32_UINTR_HANDLER		0x986
> +#define MSR_IA32_UINTR_STACKADJUST	0x987
> +#define MSR_IA32_UINTR_MISC		0x988	/* 39:32-UINV, 31:0-UITTSZ */

Bah, these tail comments are crap. Please define proper masks/shift
constants for this instead of using magic numbers in the code.

> +static __always_inline void setup_uintr(struct cpuinfo_x86 *c)

This has to be always inline because it's performance critical or what?

> +{
> +	/* check the boot processor, plus compile options for UINTR. */

Sentences start with uppercase letters.

> +	if (!cpu_feature_enabled(X86_FEATURE_UINTR))
> +		goto disable_uintr;
> +
> +	/* checks the current processor's cpuid bits: */
> +	if (!cpu_has(c, X86_FEATURE_UINTR))
> +		goto disable_uintr;
> +
> +	/*
> +	 * User Interrupts currently doesn't support PTI. For processors that
> +	 * support User interrupts PTI in auto mode will default to off.  Need
> +	 * this check only for users who have force enabled PTI.
> +	 */
> +	if (boot_cpu_has(X86_FEATURE_PTI)) {
> +		pr_info_once("x86: User Interrupts (UINTR) not enabled. Please disable PTI using 'nopti' kernel parameter\n");

That message does not make sense. The admin has explicitly added 'pti'
to the kernel command line on a CPU which is not affected. So why would
he now have to add 'nopti' ?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 04/13] x86/fpu/xstate: Enumerate User Interrupts supervisor state
  2021-09-13 20:01 ` [RFC PATCH 04/13] x86/fpu/xstate: Enumerate User Interrupts supervisor state Sohil Mehta
@ 2021-09-23 22:34   ` Thomas Gleixner
  2021-09-27 22:25     ` Sohil Mehta
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-23 22:34 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> Enable xstate supervisor support for User Interrupts by default.

What means enabled by default? It's enabled when available and not
disabled on the command line.

> The user interrupt state for a task consists of the MSR state and the
> User Interrupt Flag (UIF) value. XSAVES and XRSTORS handle saving and
> restoring both of these states.
>
> <The supervisor XSTATE code might be reworked based on issues reported
> in the past. The Uintr context switching code would also need rework and
> additional testing in that regard.>

What? Which issues were reported and if they have been reported then how
is the provided code correct?

> +/*
> + * State component 14 is supervisor state used for User Interrupts state.
> + * The size of this state is 48 bytes
> + */
> +struct uintr_state {
> +	u64 handler;
> +	u64 stack_adjust;
> +	u32 uitt_size;
> +	u8  uinv;
> +	u8  pad1;
> +	u8  pad2;
> +	u8  uif_pad3;		/* bit 7 - UIF, bits 6:0 - reserved */

Please do not use tail comments. Also what kind of name is uif_pad3?
Bitfields exist for a reason.

Aside of that please use tabs to seperate type and name.

> +	u64 upid_addr;
> +	u64 uirr;
> +	u64 uitt_addr;
> +} __packed;
> +

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector
  2021-09-13 20:01 ` [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector Sohil Mehta
@ 2021-09-23 23:07   ` Thomas Gleixner
  2021-09-25 13:30     ` Thomas Gleixner
  2021-09-27 19:26     ` Sohil Mehta
  0 siblings, 2 replies; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-23 23:07 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> A user interrupt notification vector is used on the receiver's cpu to
> identify an interrupt as a user interrupt (and not a kernel interrupt).
> Hardware uses the same notification vector to generate an IPI from a
> sender's cpu core when the SENDUIPI instruction is executed.
>
> Typically, the kernel shouldn't receive an interrupt with this vector.
> However, it is possible that the kernel might receive this vector.
>
> Scenario that can cause the spurious interrupt:
>
> Step	cpu 0 (receiver task)		cpu 1 (sender task)
> ----	---------------------		-------------------
> 1	task is running
> 2					executes SENDUIPI
> 3					IPI sent
> 4	context switched out
> 5	IPI delivered
> 	(kernel interrupt detected)
>
> A kernel interrupt can be detected, if a receiver task gets scheduled
> out after the SENDUIPI-based IPI was sent but before the IPI was
> delivered.

What happens if the SENDUIPI is issued when the target task is not on
the CPU? How is that any different from the above?

> The kernel doesn't need to do anything in this case other than receiving
> the interrupt and clearing the local APIC. The user interrupt is always
> stored in the receiver's UPID before the IPI is generated. When the
> receiver gets scheduled back the interrupt would be delivered based on
> its UPID.

So why on earth is that vector reaching the CPU at all?

> +#ifdef CONFIG_X86_USER_INTERRUPTS
> +	seq_printf(p, "%*s: ", prec, "UIS");

No point in printing that when user interrupts are not available/enabled
on the system.

> +	for_each_online_cpu(j)
> +		seq_printf(p, "%10u ", irq_stats(j)->uintr_spurious_count);
> +	seq_puts(p, "  User-interrupt spurious event\n");
>  #endif
>  	return 0;
>  }
> @@ -325,6 +331,33 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)
>  }
>  #endif
>  
> +#ifdef CONFIG_X86_USER_INTERRUPTS
> +/*
> + * Handler for UINTR_NOTIFICATION_VECTOR.
> + *
> + * The notification vector is used by the cpu to detect a User Interrupt. In
> + * the typical usage, the cpu would handle this interrupt and clear the local
> + * apic.
> + *
> + * However, it is possible that the kernel might receive this vector. This can
> + * happen if the receiver thread was running when the interrupt was sent but it
> + * got scheduled out before the interrupt was delivered. The kernel doesn't
> + * need to do anything other than clearing the local APIC. A pending user
> + * interrupt is always saved in the receiver's UPID which can be referenced
> + * when the receiver gets scheduled back.
> + *
> + * If the kernel receives a storm of these, it could mean an issue with the
> + * kernel's saving and restoring of the User Interrupt MSR state; Specifically,
> + * the notification vector bits in the IA32_UINTR_MISC_MSR.

Definitely well thought out hardware that.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-23 12:19     ` Greg KH
  2021-09-23 14:09       ` Greg KH
@ 2021-09-23 23:09       ` Sohil Mehta
  2021-09-24  0:17       ` Sohil Mehta
  2 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-23 23:09 UTC (permalink / raw)
  To: Greg KH
  Cc: Hansen, Dave, x86, linux-kernel, Luck, Tony, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Lutomirski, Andy,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Raj, Ashok, Jacob Pan, Kammela,
	Gayatri, Zeng, Guang, Williams, Dan J, Witt, Randy E, Shankar,
	Ravi V, Thomas, Ramesh, linux-api, linux-arch, linux-kselftest

On 9/23/2021 5:19 AM, Greg KH wrote:
> On Tue, Sep 14, 2021 at 07:03:36PM +0000, Mehta, Sohil wrote:
>
> Here is the updated table:
> +---------------------+-------------------------+
> | IPC type            |   Relative Latency      |
> |                     |(normalized to User IPI) |
> +---------------------+-------------------------+
> | User IPI            |                     1.0 |
> | User IPI (blocked)  |                     8.9 |
> | Signal              |                    14.8 |
> | Eventfd             |                     9.7 |
> | Pipe                |                    16.3 |
> | Domain              |                    17.3 |
> +---------------------+-------------------------+
> Relative is just that, "relative".  If the real values are extremely
> tiny, then relative is just "this goes a tiny tiny bit faster than what
> you have today in eventfd", right?
>
> So how about "absolute"?  What are we talking here?

Thanks Greg for reviewing the patches.

The reason I have not included absolute numbers is that on a 
pre-production platform it could be misleading. The data here is more of 
an approximation with the final performance expected to trend in this 
direction.

I have used the term "relative" only to signify that this is comparing 
User IPI with others.

Let's say, if eventfd took 9.7 usec on a system then User IPI (running) 
would take 1 usec. So it would still be a 9x improvement.

But, I agree with your point. This is only a micro-benchmark performance 
comparison. The overall gain in a real workload would depend on how it 
uses IPC.

+---------------------+------------------------------+
| IPC type            |       Example Latency        |
|                     |        (micro seconds)       |
+---------------------+------------------------------+
| User IPI (running)  |                     1.0 usec |
| User IPI (blocked)  |                     8.9 usec |
| Signal              |                    14.8 usec |
| Eventfd             |                     9.7 usec |
| Pipe                |                    16.3 usec |
| Domain              |                    17.3 usec |
+---------------------+------------------------------+


> And this is really only for the "one userspace task waking up another
> userspace task" policies.  What real workload can actually use this?

A User IPI sender could be registered to send IPIs to multiple targets. 
But, there is no broadcast mechanism, so it can only target one receiver 
everytime it executes the SENDUIPI instruction.

Thanks,

Sohil

> thanks,
>
> greg k-h



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-23 14:09       ` Greg KH
  2021-09-23 14:46         ` Dave Hansen
@ 2021-09-23 23:24         ` Sohil Mehta
  1 sibling, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-23 23:24 UTC (permalink / raw)
  To: Greg KH
  Cc: Hansen, Dave, x86, linux-kernel, Luck, Tony, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Lutomirski, Andy,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Raj, Ashok, Jacob Pan, Kammela,
	Gayatri, Zeng, Guang, Williams, Dan J, Witt, Randy E, Shankar,
	Ravi V, Thomas, Ramesh, linux-api, linux-arch, linux-kselftest

On 9/23/2021 7:09 AM, Greg KH wrote:
> Also, you forgot to list Binder in the above IPC type.
>
Thanks for pointing that out. In the LPC discussion today there was also 
a suggestion to compare this with Futex wake.

I'll include a comparison with Binder and Futex next time.

I used this IPC benchmark this time but it doesn't include Binder and Futex.

https://github.com/goldsborough/ipc-bench

Would you know if there is anything out there that is more comprehensive 
for benchmarking IPC?

Thanks,

Sohil




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls
  2021-09-13 20:01 ` [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls Sohil Mehta
  2021-09-23 12:26   ` Greg KH
@ 2021-09-23 23:52   ` Thomas Gleixner
  2021-09-27 23:57     ` Sohil Mehta
  1 sibling, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-23 23:52 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> +/* User Posted Interrupt Descriptor (UPID) */
> +struct uintr_upid {
> +	struct {
> +		u8 status;	/* bit 0: ON, bit 1: SN, bit 2-7: reserved */
> +		u8 reserved1;	/* Reserved */
> +		u8 nv;		/* Notification vector */
> +		u8 reserved2;	/* Reserved */
> +		u32 ndst;	/* Notification destination */
> +	} nc __packed;		/* Notification control */
> +	u64 puir;		/* Posted user interrupt requests */
> +} __aligned(64);
> +
> +/* UPID Notification control status */
> +#define UPID_ON		0x0	/* Outstanding notification */
> +#define UPID_SN		0x1	/* Suppressed notification */

Come on. This are bits in upid.status, right? So why can't the comment
above these defines says so and why can't the names not reflect that?

> +struct uintr_upid_ctx {
> +	struct uintr_upid *upid;
> +	refcount_t refs;

Please use tabular format for struct members. 

> +};
> +
> +struct uintr_receiver {
> +	struct uintr_upid_ctx *upid_ctx;
> +};

So we need a struct to wrap a pointer to another struct. Why?

> +inline bool uintr_arch_enabled(void)

What's this arch_enabled indirection for? Is this used anywhere in
non-architecture code?

> +{
> +	return static_cpu_has(X86_FEATURE_UINTR);
> +}
> +
> +static inline bool is_uintr_receiver(struct task_struct *t)
> +{
> +	return !!t->thread.ui_recv;
> +}
> +
> +static inline u32 cpu_to_ndst(int cpu)
> +{
> +	u32 apicid = (u32)apic->cpu_present_to_apicid(cpu);
> +
> +	WARN_ON_ONCE(apicid == BAD_APICID);

Brilliant. If x2apic is not enabled then this case returns

> +	if (!x2apic_enabled())
> +		return (apicid << 8) & 0xFF00;

  (BAD_APICID << 8) & 0xFF00 == 0xFF ....

> +int do_uintr_unregister_handler(void)
> +{
> +	struct task_struct *t = current;
> +	struct fpu *fpu = &t->thread.fpu;
> +	struct uintr_receiver *ui_recv;
> +	u64 msr64;
> +
> +	if (!is_uintr_receiver(t))
> +		return -EINVAL;
> +
> +	pr_debug("recv: Unregister handler and clear MSRs for task=%d\n",
> +		 t->pid);
> +
> +	/*
> +	 * TODO: Evaluate usage of fpregs_lock() and get_xsave_addr(). Bugs
> +	 * have been reported recently for PASID and WRPKRU.

Again. Which bugs and why haven't they been evaluated before posting?

> +	 * UPID and ui_recv will be referenced during context switch. Need to
> +	 * disable preemption while modifying the MSRs, UPID and ui_recv thread
> +	 * struct.
> +	 */
> +	fpregs_lock();

And because you need to disable preemption you need to use
fpregs_lock(), right? That's not what fpregs_lock() is about.

> +	/* Clear only the receiver specific state. Sender related state is not modified */
> +	if (fpregs_state_valid(fpu, smp_processor_id())) {
> +		/* Modify only the relevant bits of the MISC MSR */
> +		rdmsrl(MSR_IA32_UINTR_MISC, msr64);
> +		msr64 &= ~GENMASK_ULL(39, 32);

This is exactly the crap which results from not defining stuff
properly. Random numbers in code which nobody can understand.

> +		wrmsrl(MSR_IA32_UINTR_MISC, msr64);
> +		wrmsrl(MSR_IA32_UINTR_PD, 0ULL);
> +		wrmsrl(MSR_IA32_UINTR_RR, 0ULL);
> +		wrmsrl(MSR_IA32_UINTR_STACKADJUST, 0ULL);
> +		wrmsrl(MSR_IA32_UINTR_HANDLER, 0ULL);
> +	} else {
> +		struct uintr_state *p;
> +
> +		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_UINTR);
> +		if (p) {
> +			p->handler = 0;
> +			p->stack_adjust = 0;
> +			p->upid_addr = 0;
> +			p->uinv = 0;
> +			p->uirr = 0;
> +		}

So p == NULL is expected here?

> +	}
> +
> +	ui_recv = t->thread.ui_recv;
> +	/*
> +	 * Suppress notifications so that no further interrupts are generated
> +	 * based on this UPID.
> +	 */
> +	set_bit(UPID_SN, (unsigned long *)&ui_recv->upid_ctx->upid->nc.status);
> +
> +	put_upid_ref(ui_recv->upid_ctx);
> +	kfree(ui_recv);
> +	t->thread.ui_recv = NULL;

Why has this put/kfree stuff to be in the fpregs locked section?

> +	fpregs_unlock();
> +
> +	return 0;
> +}
> +
> +int do_uintr_register_handler(u64 handler)
> +{
> +	struct uintr_receiver *ui_recv;
> +	struct uintr_upid *upid;
> +	struct task_struct *t = current;
> +	struct fpu *fpu = &t->thread.fpu;
> +	u64 misc_msr;
> +	int cpu;
> +
> +	if (is_uintr_receiver(t))
> +		return -EBUSY;
> +
> +	ui_recv = kzalloc(sizeof(*ui_recv), GFP_KERNEL);
> +	if (!ui_recv)
> +		return -ENOMEM;
> +
> +	ui_recv->upid_ctx = alloc_upid();
> +	if (!ui_recv->upid_ctx) {
> +		kfree(ui_recv);
> +		pr_debug("recv: alloc upid failed for task=%d\n", t->pid);
> +		return -ENOMEM;
> +	}
> +
> +	/*
> +	 * TODO: Evaluate usage of fpregs_lock() and get_xsave_addr(). Bugs
> +	 * have been reported recently for PASID and WRPKRU.

Oh well.

> +	 * UPID and ui_recv will be referenced during context switch. Need to
> +	 * disable preemption while modifying the MSRs, UPID and ui_recv thread
> +	 * struct.

See above.

> +	 */
> +	fpregs_lock();
> +
> +	cpu = smp_processor_id();
> +	upid = ui_recv->upid_ctx->upid;
> +	upid->nc.nv = UINTR_NOTIFICATION_VECTOR;
> +	upid->nc.ndst = cpu_to_ndst(cpu);
> +
> +	t->thread.ui_recv = ui_recv;
> +
> +	if (fpregs_state_valid(fpu, cpu)) {
> +		wrmsrl(MSR_IA32_UINTR_HANDLER, handler);
> +		wrmsrl(MSR_IA32_UINTR_PD, (u64)ui_recv->upid_ctx->upid);
> +
> +		/* Set value as size of ABI redzone */
> +		wrmsrl(MSR_IA32_UINTR_STACKADJUST, 128);
> +
> +		/* Modify only the relevant bits of the MISC MSR */
> +		rdmsrl(MSR_IA32_UINTR_MISC, misc_msr);
> +		misc_msr |= (u64)UINTR_NOTIFICATION_VECTOR << 32;
> +		wrmsrl(MSR_IA32_UINTR_MISC, misc_msr);
> +	} else {
> +		struct xregs_state *xsave;
> +		struct uintr_state *p;
> +
> +		xsave = &fpu->state.xsave;
> +		xsave->header.xfeatures |= XFEATURE_MASK_UINTR;
> +		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_UINTR);
> +		if (p) {
> +			p->handler = handler;
> +			p->upid_addr = (u64)ui_recv->upid_ctx->upid;
> +			p->stack_adjust = 128;
> +			p->uinv = UINTR_NOTIFICATION_VECTOR;
> +		}

Again. How is p supposed to be NULL and if so, why is this silently
treating this as success?

> +	}
> +
> +	fpregs_unlock();

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls
  2021-09-23 12:26   ` Greg KH
@ 2021-09-24  0:05     ` Thomas Gleixner
  2021-09-27 23:20     ` Sohil Mehta
  1 sibling, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-24  0:05 UTC (permalink / raw)
  To: Greg KH, Sohil Mehta
  Cc: x86, Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On Thu, Sep 23 2021 at 14:26, Greg KH wrote:
> On Mon, Sep 13, 2021 at 01:01:25PM -0700, Sohil Mehta wrote:
>> +SYSCALL_DEFINE2(uintr_register_handler, u64 __user *, handler, unsigned int, flags)
>> +{
>> +	int ret;
>> +
>> +	if (!uintr_arch_enabled())
>> +		return -EOPNOTSUPP;
>> +
>> +	if (flags)
>> +		return -EINVAL;
>> +
>> +	/* TODO: Validate the handler address */
>> +	if (!handler)
>> +		return -EFAULT;
>
> Um, that's a pretty big "TODO" here.
>
> How are you going to define what is, and what is not, an allowed
> "handler"?

The requirement is obviously that this is a valid user space address,
but that's so hard to validate that it needs to be done later.

At least the documentation claims that a non user space address should
result in a #GP on delivery. Whether that holds in all corner cases (see
the spurious handling muck) is a different question and might come back
to us later through a channel which we hate with a passion :)

> I'm sure the security people would like to get involved here, as well as
> the auditing people.  Have you talked with them about their requirements
> for this type of stuff?

The handler is strictly a user space address and user space is generally
allowed to shoot itself into the foot. If the address is bogus then this
will resolve into inaccessible, not-mapped or not exectuable space and
the application can keep the pieces.

Whether the hardware handles the resulting exception correctly is a
different question, but that can't be prevented by any sanity check on
the address at registration time.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-23 12:19     ` Greg KH
  2021-09-23 14:09       ` Greg KH
  2021-09-23 23:09       ` Sohil Mehta
@ 2021-09-24  0:17       ` Sohil Mehta
  2 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-24  0:17 UTC (permalink / raw)
  To: Greg KH
  Cc: Hansen, Dave, x86, linux-kernel, Luck, Tony, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H . Peter Anvin, Lutomirski, Andy,
	Jens Axboe, Christian Brauner, Peter Zijlstra, Shuah Khan,
	Arnd Bergmann, Jonathan Corbet, Raj, Ashok, Jacob Pan, Kammela,
	Gayatri, Zeng, Guang, Williams, Dan J, Witt, Randy E, Shankar,
	Ravi V, Thomas, Ramesh, linux-api, linux-arch, linux-kselftest

On 9/23/2021 5:19 AM, Greg KH wrote:

> What real workload can actually use this?
>
I missed replying to this.

User mode runtimes is one the usages that we think would benefit from 
User IPIs.

Also as Jens mentioned in another thread, this could help kernel to user 
notifications in io_uring (using User Interrupts instead of eventfd for 
signaling).

Libevent is another abstraction that we are evaluating.


Thanks,

Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 07/13] x86/process/64: Add uintr task context switch support
  2021-09-13 20:01 ` [RFC PATCH 07/13] x86/process/64: Add uintr task context switch support Sohil Mehta
@ 2021-09-24  0:41   ` Thomas Gleixner
  2021-09-28  0:30     ` Sohil Mehta
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-24  0:41 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:

> User interrupt state is saved and restored using xstate supervisor
> feature support. This includes the MSR state and the User Interrupt Flag
> (UIF) value.
>
> During context switch update the UPID for a uintr task to reflect the
> current state of the task; namely whether the task should receive
> interrupt notifications and which cpu the task is currently running on.
>
> XSAVES clears the notification vector (UINV) in the MISC MSR to prevent
> interrupts from being recognized in the UIRR MSR while the task is being
> context switched. The UINV is restored back when the kernel does an
> XRSTORS.
>
> However, this conflicts with the kernel's lazy restore optimization
> which skips an XRSTORS if the kernel is scheduling the same user task
> back and the underlying MSR state hasn't been modified. Special handling
> is needed for a uintr task in the context switch path to keep using this
> optimization.

And this special handling is?

Distinct void of content here.

>  /* Check that the stack and regs on entry from user mode are sane. */
>  static __always_inline void arch_check_user_regs(struct pt_regs *regs)
> @@ -57,6 +58,9 @@ static inline void arch_exit_to_user_mode_prepare(struct pt_regs *regs,
>  	if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
>  		switch_fpu_return();
>  
> +	if (static_cpu_has(X86_FEATURE_UINTR))
> +		switch_uintr_return();
> +

...

> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -95,6 +95,14 @@ EXPORT_SYMBOL(irq_fpu_usable);
>   * over the place.
>   *
>   * FXSAVE and all XSAVE variants preserve the FPU register state.
> + *
> + * When XSAVES is called with XFEATURE_UINTR enabled it
> + * saves the FPU state and clears the interrupt notification
> + * vector byte of the MISC_MSR [bits 39:32]. This is required
> + * to stop detecting additional User Interrupts after we
> + * have saved the FPU state. Before going back to userspace
> + * we would correct this and only program the byte that was

we would?

This simply has to be done before returning to user space no matter
what. And _we_ can't do that. Please do not impersonate code.

> + * cleared.
>   */
>  void save_fpregs_to_fpstate(struct fpu *fpu)
>  {
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index ec0d836a13b1..62b82137db9c 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -53,6 +53,7 @@
>  #include <asm/xen/hypervisor.h>
>  #include <asm/vdso.h>
>  #include <asm/resctrl.h>
> +#include <asm/uintr.h>
>  #include <asm/unistd.h>
>  #include <asm/fsgsbase.h>
>  #ifdef CONFIG_IA32_EMULATION
> @@ -565,6 +566,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>  	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
>  		     this_cpu_read(hardirq_stack_inuse));
>  
> +	if (static_cpu_has(X86_FEATURE_UINTR))

cpu_feature_enabled() please.

> +		switch_uintr_prepare(prev_p);
> +
>  	if (!test_thread_flag(TIF_NEED_FPU_LOAD))
>  		switch_fpu_prepare(prev_fpu, cpu);
>  
> diff --git a/arch/x86/kernel/uintr_core.c b/arch/x86/kernel/uintr_core.c
> index 2c6042a6840a..7a29888050ad 100644
> --- a/arch/x86/kernel/uintr_core.c
> +++ b/arch/x86/kernel/uintr_core.c
> @@ -238,3 +238,78 @@ int do_uintr_register_handler(u64 handler)
>  
>  	return 0;
>  }
> +
> +/* Suppress notifications since this task is being context switched out */
> +void switch_uintr_prepare(struct task_struct *prev)
> +{
> +	struct uintr_upid *upid;
> +
> +	if (is_uintr_receiver(prev)) {
> +		upid = prev->thread.ui_recv->upid_ctx->upid;
> +		set_bit(UPID_SN, (unsigned long *)&upid->nc.status);

Please add a comment why this needs to be a locked instruction.

> +	}
> +}
> +
> +/*
> + * Do this right before we are going back to userspace after the FPU has been
> + * reloaded i.e. TIF_NEED_FPU_LOAD is clear.
> + * Called from arch_exit_to_user_mode_prepare() with interrupts disabled.
> + */
> +void switch_uintr_return(void)
> +{
> +	struct uintr_upid *upid;
> +	u64 misc_msr;
> +
> +	if (is_uintr_receiver(current)) {
> +		/*
> +		 * The XSAVES instruction clears the UINTR notification
> +		 * vector(UINV) in the UINT_MISC MSR when user context gets
> +		 * saved. Before going back to userspace we need to restore the
> +		 * notification vector. XRSTORS would automatically restore the
> +		 * notification but we can't be sure that XRSTORS will always
> +		 * be called when going back to userspace. Also if XSAVES gets
> +		 * called twice the UINV stored in the Xstate buffer will be
> +		 * overwritten. Threfore, before going back to userspace we
> +		 * always check if the UINV is set and reprogram if needed.
> +		 *
> +		 * Alternatively, we could combine this with
> +		 * switch_fpu_return() and program the MSR whenever we are
> +		 * skipping the XRSTORS. We need special precaution to make
> +		 * sure the UINV value in the XSTATE buffer doesn't get
> +		 * overwritten by calling XSAVES twice.
> +		 */
> +		WARN_ON_ONCE(test_thread_flag(TIF_NEED_FPU_LOAD));
> +
> +		/* Modify only the relevant bits of the MISC MSR */

I surely appreciate the well thought out hardware design which requires
yet another rdmsrl/wrmsrl pair here.

Of course this is invoked unconditionally when the CPU has
X86_FEATURE_UINTR:

> +	if (static_cpu_has(X86_FEATURE_UINTR))
> +		switch_uintr_return();

Why?

If the sequence is:

     syscall()
     do_stuff()
     return_to_user()

then what on earth has modified that MSR state? Nothing at all, but you
still run this code. What for?

> +		rdmsrl(MSR_IA32_UINTR_MISC, misc_msr);
> +		if (!(misc_msr & GENMASK_ULL(39, 32))) {

Hardcoded random numbers ...

> +			misc_msr |= (u64)UINTR_NOTIFICATION_VECTOR << 32;

Hardcoded numerical shift value...

> +			wrmsrl(MSR_IA32_UINTR_MISC, misc_msr);
> +		}
> +
> +		/*
> +		 * It is necessary to clear the SN bit after we set UINV and
> +		 * NDST to avoid incorrect interrupt routing.

Right, because if the task did not go through schedule() this state has not
been changed at all and therefore you need to clear SN just in case to
make sure that it hasn't been set by accident, right?

> +		 */
> +		upid = current->thread.ui_recv->upid_ctx->upid;
> +		upid->nc.ndst = cpu_to_ndst(smp_processor_id());
> +		clear_bit(UPID_SN, (unsigned long *)&upid->nc.status);
> +
> +		/*
> +		 * Interrupts might have accumulated in the UPID while the
> +		 * thread was preempted. In this case invoke the hardware
> +		 * detection sequence manually by sending a self IPI with UINV.
> +		 * Since UINV is set and SN is cleared, any new UINTR
> +		 * notifications due to the self IPI or otherwise would result
> +		 * in the hardware updating the UIRR directly.
> +		 * No real interrupt would be generated as a result of this.
> +		 *
> +		 * The alternative is to atomically read and clear the UPID and
> +		 * program the UIRR. In that case the kernel would need to
> +		 * carefully manage the race with the hardware if the UPID gets
> +		 * updated after the read.
> +		 */
> +		if (READ_ONCE(upid->puir))
> +			apic->send_IPI_self(UINTR_NOTIFICATION_VECTOR);

So sending an self IPI is more performant than doing it purely in
memory with some care? I seriously doubt that.

Oh well, I was under the impression that this is about performance and
not about adding as much overhead as possible.

But what do I know....

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 08/13] x86/process/64: Clean up uintr task fork and exit paths
  2021-09-13 20:01 ` [RFC PATCH 08/13] x86/process/64: Clean up uintr task fork and exit paths Sohil Mehta
@ 2021-09-24  1:02   ` Thomas Gleixner
  2021-09-28  1:23     ` Sohil Mehta
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-24  1:02 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:

> The user interrupt MSRs and the user interrupt state is task specific.
> During task fork and exit clear the task state, clear the MSRs and
> dereference the shared resources.
>
> Some of the memory resources like the UPID are referenced in the file
> descriptor and could be in use while the uintr_fd is still valid.
> Instead of freeing up  the UPID just dereference it.

Derefencing the UPID, i.e. accessing task->upid->foo helps in which way?

You want to drop the reference count I assume. Then please write that
so. 

> Eventually when every user releases the reference the memory resource
> will be freed up.

Yeah, eventually or not...

> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c

> @@ -260,6 +260,7 @@ int fpu_clone(struct task_struct *dst)
>  {
>  	struct fpu *src_fpu = &current->thread.fpu;
>  	struct fpu *dst_fpu = &dst->thread.fpu;
> +	struct uintr_state *uintr_state;
>  
>  	/* The new task's FPU state cannot be valid in the hardware. */
>  	dst_fpu->last_cpu = -1;
> @@ -284,6 +285,14 @@ int fpu_clone(struct task_struct *dst)
>  
>  	else
>  		save_fpregs_to_fpstate(dst_fpu);
> +
> +	/* UINTR state is not expected to be inherited (in the current design). */
> +	if (static_cpu_has(X86_FEATURE_UINTR)) {
> +		uintr_state = get_xsave_addr(&dst_fpu->state.xsave, XFEATURE_UINTR);
> +		if (uintr_state)
> +			memset(uintr_state, 0, sizeof(*uintr_state));
> +	}

1) If the FPU registers are up to date then this can be completely
   avoided by excluding the UINTR component from XSAVES

2) If the task never used that muck then UINTR is in init state and
   clearing that memory is a redunant exercise because it has been
   cleared already

So yes, this clearly is evidence how this is enhancing performance.

> +/*
> + * This should only be called from exit_thread().

Should? Would? Maybe or what?

> + * exit_thread() can happen in current context when the current thread is
> + * exiting or it can happen for a new thread that is being created.

A right that makes sense. If a new thread is created then it can call
exit_thread(), right?

> + * For new threads is_uintr_receiver() should fail.

Should fail?

> + */
> +void uintr_free(struct task_struct *t)
> +{
> +	struct uintr_receiver *ui_recv;
> +	struct fpu *fpu;
> +
> +	if (!static_cpu_has(X86_FEATURE_UINTR) || !is_uintr_receiver(t))
> +		return;
> +
> +	if (WARN_ON_ONCE(t != current))
> +		return;
> +
> +	fpu = &t->thread.fpu;
> +
> +	fpregs_lock();
> +
> +	if (fpregs_state_valid(fpu, smp_processor_id())) {
> +		wrmsrl(MSR_IA32_UINTR_MISC, 0ULL);
> +		wrmsrl(MSR_IA32_UINTR_PD, 0ULL);
> +		wrmsrl(MSR_IA32_UINTR_RR, 0ULL);
> +		wrmsrl(MSR_IA32_UINTR_STACKADJUST, 0ULL);
> +		wrmsrl(MSR_IA32_UINTR_HANDLER, 0ULL);
> +	} else {
> +		struct uintr_state *p;
> +
> +		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_UINTR);
> +		if (p) {
> +			p->handler = 0;
> +			p->uirr = 0;
> +			p->upid_addr = 0;
> +			p->stack_adjust = 0;
> +			p->uinv = 0;
> +		}
> +	}
> +
> +	/* Check: Can a thread be context switched while it is exiting? */

This looks like a question which should be answered _before_ writing
such code.

> +	ui_recv = t->thread.ui_recv;
> +
> +	/*
> +	 * Suppress notifications so that no further interrupts are
> +	 * generated based on this UPID.
> +	 */
> +	set_bit(UPID_SN, (unsigned long *)&ui_recv->upid_ctx->upid->nc.status);
> +	put_upid_ref(ui_recv->upid_ctx);
> +	kfree(ui_recv);
> +	t->thread.ui_recv = NULL;

Again, why needs all this put/kfree muck be within the fpregs locked section?

> +	fpregs_unlock();
> +}

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 09/13] x86/uintr: Introduce vector registration and uintr_fd syscall
  2021-09-13 20:01 ` [RFC PATCH 09/13] x86/uintr: Introduce vector registration and uintr_fd syscall Sohil Mehta
@ 2021-09-24 10:33   ` Thomas Gleixner
  2021-09-28 20:40     ` Sohil Mehta
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-24 10:33 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>  static void free_upid(struct uintr_upid_ctx *upid_ctx)
>  {
> +	put_task_struct(upid_ctx->task);

That's current, right?

>  	kfree(upid_ctx->upid);
>  	upid_ctx->upid = NULL;
>  	kfree(upid_ctx);
> @@ -93,6 +93,7 @@ static struct uintr_upid_ctx *alloc_upid(void)
>  
>  	upid_ctx->upid = upid;
>  	refcount_set(&upid_ctx->refs, 1);
> +	upid_ctx->task = get_task_struct(current);

Current takes a refcount on it's own task struct when allocating upid,
and releases it at some point when freeing upid, right?

What for? Comments are overrated, except for comments describing
the obvious in the wrong way.

If this ever comes back in some form, then I pretty please want the life
time rules of this documented properly. The current state is
unreviewable.

>  	return upid_ctx;
>  }
> @@ -103,6 +104,77 @@ static void put_upid_ref(struct uintr_upid_ctx *upid_ctx)
>  		free_upid(upid_ctx);
>  }
>  
> +static struct uintr_upid_ctx *get_upid_ref(struct uintr_upid_ctx *upid_ctx)
> +{
> +	refcount_inc(&upid_ctx->refs);
> +	return upid_ctx;
> +}
> +
> +static void __clear_vector_from_upid(u64 uvec, struct uintr_upid *upid)
> +{
> +	clear_bit(uvec, (unsigned long *)&upid->puir);
> +}
> +
> +static void __clear_vector_from_task(u64 uvec)
> +{
> +	struct task_struct *t = current;
> +
> +	pr_debug("recv: task=%d free vector %llu\n", t->pid, uvec);
> +
> +	if (!(BIT_ULL(uvec) & t->thread.ui_recv->uvec_mask))
> +		return;
> +
> +	clear_bit(uvec, (unsigned long *)&t->thread.ui_recv->uvec_mask);
> +
> +	if (!t->thread.ui_recv->uvec_mask)
> +		pr_debug("recv: task=%d unregistered all user vectors\n", t->pid);

Once you are done debugging this complex function can you please turn it
into an unconditional clear_bit(...) at the call site?

> +/* Callback to clear the vector structures when a vector is unregistered. */
> +static void receiver_clear_uvec(struct callback_head *head)
> +{
> +	struct uintr_receiver_info *r_info;
> +	struct uintr_upid_ctx *upid_ctx;
> +	struct task_struct *t = current;
> +	u64 uvec;
> +
> +	r_info = container_of(head, struct uintr_receiver_info, twork);
> +	uvec = r_info->uvec;
> +	upid_ctx = r_info->upid_ctx;
> +
> +	/*
> +	 * If a task has unregistered the interrupt handler the vector
> +	 * structures would have already been cleared.

would ? No. They must have been cleared already, anything else is a bug.

> +	 */
> +	if (is_uintr_receiver(t)) {
> +		/*
> +		 * The UPID context in the callback might differ from the one
> +		 * on the task if the task unregistered its interrupt handler
> +		 * and then registered itself again. The vector structures
> +		 * related to the previous UPID would have already been cleared
> +		 * in that case.
> +		 */
> +		if (t->thread.ui_recv->upid_ctx != upid_ctx) {
> +			pr_debug("recv: task %d is now using a different UPID\n",
> +				 t->pid);
> +			goto out_free;
> +		}
> +
> +		/*
> +		 * If the vector has been recognized in the UIRR don't modify
> +		 * it. We need to disable User Interrupts before modifying the
> +		 * UIRR. It might be better to just let that interrupt get
> +		 * delivered.

Might be better? Please provide coherent explanations why this is correct.

> +		 */
> +		__clear_vector_from_upid(uvec, upid_ctx->upid);
> +		__clear_vector_from_task(uvec);
> +	}
> +
> +out_free:
> +	put_upid_ref(upid_ctx);
> +	kfree(r_info);
> +}
> +
>  int do_uintr_unregister_handler(void)
>  {
>  	struct task_struct *t = current;
> @@ -239,6 +311,53 @@ int do_uintr_register_handler(u64 handler)
>  	return 0;
>  }
>  
> +void do_uintr_unregister_vector(struct uintr_receiver_info *r_info)
> +{
> +	int ret;
> +
> +	pr_debug("recv: Adding task work to clear vector %llu added for task=%d\n",
> +		 r_info->uvec, r_info->upid_ctx->task->pid);
> +
> +	init_task_work(&r_info->twork, receiver_clear_uvec);

How is this safe? Reinitialization has to be serialized against other
usage. Again: Document the life time and serialization rules. Your
pr_debugs sprinkled all over the place are not a replacement for that.

> +	ret = task_work_add(r_info->upid_ctx->task, &r_info->twork, true);

Care to look at the type of the third argument of task_work_add()?

> +struct uintrfd_ctx {
> +	struct uintr_receiver_info *r_info;

Yet another wrapper struct? What for?

> +/*
> + * sys_uintr_create_fd - Create a uintr_fd for the registered interrupt vector.

So this creates a file descriptor for a vector which is already
allocated and then it calls do_uintr_register_vector() which allocates
the vector?

> + */
> +SYSCALL_DEFINE2(uintr_create_fd, u64, vector, unsigned int, flags)
> +{

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls
  2021-09-13 20:01 ` [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls Sohil Mehta
  2021-09-23 12:28   ` Greg KH
@ 2021-09-24 10:54   ` Thomas Gleixner
  1 sibling, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-24 10:54 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> +/*
> + * No lock is needed to read the active flag. Writes only happen from
> + * r_info->task that owns the UPID. Everyone else would just read this flag.
> + *
> + * This only provides a static check. The receiver may become inactive right
> + * after this check. The primary reason to have this check is to prevent future
> + * senders from connecting with this UPID, since the receiver task has already
> + * made this UPID inactive.

How is that not racy?

> +static void free_uitt(struct uintr_uitt_ctx *uitt_ctx)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&uitt_ctx->uitt_lock, flags);
> +	kfree(uitt_ctx->uitt);

Again. Please move kfree() outside of the lock held region. But aside of
that what is this lock protecting here?

> +	uitt_ctx->uitt = NULL;
> +	spin_unlock_irqrestore(&uitt_ctx->uitt_lock, flags);

If there is concurrency then the other task which is blocked on
uitt_lock will operate on uitt_ctx while the same is freed.

Again, this lacks any life time and serialization rules. Just sprinkling
locks all over the place does not make it magically correct.

> +	kfree(uitt_ctx);
> +}

> +static void put_uitt_ref(struct uintr_uitt_ctx *uitt_ctx)
> +{
> +	if (refcount_dec_and_test(&uitt_ctx->refs))
> +		free_uitt(uitt_ctx);
> +}


> +static struct uintr_uitt_ctx *get_uitt_ref(struct uintr_uitt_ctx *uitt_ctx)
> +{
> +	refcount_inc(&uitt_ctx->refs);
> +	return uitt_ctx;
> +}
> +
> +static inline void mark_uitte_invalid(struct uintr_sender_info *s_info)
> +{
> +	struct uintr_uitt_entry *uitte;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&s_info->uitt_ctx->uitt_lock, flags);
> +	uitte = &s_info->uitt_ctx->uitt[s_info->uitt_index];
> +	uitte->valid = 0;
> +	spin_unlock_irqrestore(&s_info->uitt_ctx->uitt_lock, flags);
> +}
> +
>  static void __clear_vector_from_upid(u64 uvec, struct uintr_upid *upid)
>  {
>  	clear_bit(uvec, (unsigned long *)&upid->puir);
> @@ -175,6 +290,210 @@ static void receiver_clear_uvec(struct callback_head *head)
>  	kfree(r_info);
>  }
>  
> +static void teardown_uitt(void)
> +{
> +	struct task_struct *t = current;
> +	struct fpu *fpu = &t->thread.fpu;
> +	u64 msr64;
> +
> +	put_uitt_ref(t->thread.ui_send->uitt_ctx);
> +	kfree(t->thread.ui_send);
> +	t->thread.ui_send = NULL;
> +
> +	fpregs_lock();
> +
> +	if (fpregs_state_valid(fpu, smp_processor_id())) {
> +		/* Modify only the relevant bits of the MISC MSR */
> +		rdmsrl(MSR_IA32_UINTR_MISC, msr64);
> +		msr64 &= GENMASK_ULL(63, 32);

More magic numbers.

> +		wrmsrl(MSR_IA32_UINTR_MISC, msr64);
> +		wrmsrl(MSR_IA32_UINTR_TT, 0ULL);

> +static void __free_uitt_entry(unsigned int entry)
> +{
> +	struct task_struct *t = current;
> +	unsigned long flags;
> +
> +	if (entry >= UINTR_MAX_UITT_NR)
> +		return;
> +
> +	if (!is_uintr_sender(t))
> +		return;
> +
> +	pr_debug("send: Freeing UITTE entry %d for task=%d\n", entry, t->pid);
> +
> +	spin_lock_irqsave(&t->thread.ui_send->uitt_ctx->uitt_lock, flags);
> +	memset(&t->thread.ui_send->uitt_ctx->uitt[entry], 0,
> +	       sizeof(struct uintr_uitt_entry));
> +	spin_unlock_irqrestore(&t->thread.ui_send->uitt_ctx->uitt_lock,
> flags);

What's the spinlock protecting here?

> +	clear_bit(entry, (unsigned long *)t->thread.ui_send->uitt_mask);
> +
> +	if (is_uitt_empty(t)) {
> +		pr_debug("send: UITT mask is empty. Dereference and teardown UITT\n");
> +		teardown_uitt();
> +	}
> +}

> +void do_uintr_unregister_sender(struct uintr_receiver_info *r_info,
> +				struct uintr_sender_info *s_info)
> +{
> +	int ret;
> +
> +	/*
> +	 * To make sure any new senduipi result in a #GP fault.
> +	 * The task work might take non-zero time to kick the process out.

-ENOPARSE

> +	 */
> +	mark_uitte_invalid(s_info);
> +
> +	pr_debug("send: Adding Free UITTE %d task work for task=%d\n",
> +		 s_info->uitt_index, s_info->task->pid);
> +
> +	init_task_work(&s_info->twork, sender_free_uitte);
> +	ret = task_work_add(s_info->task, &s_info->twork, true);
> +	if (ret) {
> +		/*
> +		 * Dereferencing the UITT and UPID here since the task has
> +		 * exited.
> +		 */
> +		pr_debug("send: Free UITTE %d task=%d has already exited\n",
> +			 s_info->uitt_index, s_info->task->pid);
> +		put_upid_ref(s_info->r_upid_ctx);
> +		put_uitt_ref(s_info->uitt_ctx);
> +		put_task_struct(s_info->task);
> +		kfree(s_info);
> +		return;
> +	}
> +}
> +
> +int do_uintr_register_sender(struct uintr_receiver_info *r_info,
> +			     struct uintr_sender_info *s_info)
> +{
> +	struct uintr_uitt_entry *uitte = NULL;
> +	struct uintr_sender *ui_send;
> +	struct task_struct *t = current;
> +	unsigned long flags;
> +	int entry;
> +	int ret;
> +
> +	/*
> +	 * Only a static check. Receiver could exit anytime after this check.
> +	 * This check only prevents connections using uintr_fd after the
> +	 * receiver has already exited/unregistered.
> +	 */
> +	if (!uintr_is_receiver_active(r_info))
> +		return -ESHUTDOWN;

How is this safe against a concurrent unregister/exit operation?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-13 20:01 ` [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall Sohil Mehta
@ 2021-09-24 11:04   ` Thomas Gleixner
  2021-09-25 12:08     ` Thomas Gleixner
  2021-09-28 23:08     ` Sohil Mehta
  2021-09-26 14:41   ` Thomas Gleixner
  2021-09-29  3:30   ` Andy Lutomirski
  2 siblings, 2 replies; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-24 11:04 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> Add a new system call to allow applications to block in the kernel and
> wait for user interrupts.
>
> <The current implementation doesn't support waking up from other
> blocking system calls like sleep(), read(), epoll(), etc.
>
> uintr_wait() is a placeholder syscall while we decide on that
> behaviour.>
>
> When the application makes this syscall the notification vector is
> switched to a new kernel vector. Any new SENDUIPI will invoke the kernel
> interrupt which is then used to wake up the process.
>
> Currently, the task wait list is global one. To make the implementation
> scalable there is a need to move to a distributed per-cpu wait list.

How are per cpu wait lists going to solve the problem?

> +
> +/*
> + * Handler for UINTR_KERNEL_VECTOR.
> + */
> +DEFINE_IDTENTRY_SYSVEC(sysvec_uintr_kernel_notification)
> +{
> +	/* TODO: Add entry-exit tracepoints */
> +	ack_APIC_irq();
> +	inc_irq_stat(uintr_kernel_notifications);
> +
> +	uintr_wake_up_process();

So this interrupt happens for any of those notifications. How are they
differentiated? 
>  
> +int uintr_receiver_wait(void)
> +{
> +	struct uintr_upid_ctx *upid_ctx;
> +	unsigned long flags;
> +
> +	if (!is_uintr_receiver(current))
> +		return -EOPNOTSUPP;
> +
> +	upid_ctx = current->thread.ui_recv->upid_ctx;
> +	upid_ctx->upid->nc.nv = UINTR_KERNEL_VECTOR;
> +	upid_ctx->waiting = true;
> +	spin_lock_irqsave(&uintr_wait_lock, flags);
> +	list_add(&upid_ctx->node, &uintr_wait_list);
> +	spin_unlock_irqrestore(&uintr_wait_lock, flags);
> +
> +	set_current_state(TASK_INTERRUPTIBLE);

Because we have not enough properly implemented wait primitives you need
to open code one which is blantantly wrong vs. a concurrent wake up?

> +	schedule();

How is that correct vs. a spurious wakeup? What takes care that the
entry is removed from the list?

Again. We have proper wait primitives.

> +	return -EINTR;
> +}
> +
> +/*
> + * Runs in interrupt context.
> + * Scan through all UPIDs to check if any interrupt is on going.
> + */
> +void uintr_wake_up_process(void)
> +{
> +	struct uintr_upid_ctx *upid_ctx, *tmp;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&uintr_wait_lock, flags);
> +	list_for_each_entry_safe(upid_ctx, tmp, &uintr_wait_list, node) {
> +		if (test_bit(UPID_ON, (unsigned long*)&upid_ctx->upid->nc.status)) {
> +			set_bit(UPID_SN, (unsigned long *)&upid_ctx->upid->nc.status);
> +			upid_ctx->upid->nc.nv = UINTR_NOTIFICATION_VECTOR;
> +			upid_ctx->waiting = false;
> +			wake_up_process(upid_ctx->task);
> +			list_del(&upid_ctx->node);

So any of these notification interrupts does a global mass wake up? How
does that make sense?

> +		}
> +	}
> +	spin_unlock_irqrestore(&uintr_wait_lock, flags);
> +}
> +
> +/* Called when task is unregistering/exiting */
> +static void uintr_remove_task_wait(struct task_struct *task)
> +{
> +	struct uintr_upid_ctx *upid_ctx, *tmp;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&uintr_wait_lock, flags);
> +	list_for_each_entry_safe(upid_ctx, tmp, &uintr_wait_list, node) {
> +		if (upid_ctx->task == task) {
> +			pr_debug("wait: Removing task %d from wait\n",
> +				 upid_ctx->task->pid);
> +			upid_ctx->upid->nc.nv = UINTR_NOTIFICATION_VECTOR;
> +			upid_ctx->waiting = false;
> +			list_del(&upid_ctx->node);
> +		}

What? You have to do a global list walk to find the entry which you
added yourself?

Thanks,

        tglx
 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 03/13] x86/cpu: Enumerate User Interrupts support
  2021-09-23 22:24   ` Thomas Gleixner
@ 2021-09-24 19:59     ` Sohil Mehta
  2021-09-27 20:42     ` Sohil Mehta
  1 sibling, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-24 19:59 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/23/2021 3:24 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
>> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
> This SOB chain is invalid. Ditto in several other patches.
>
>
Thank you Thomas for reviewing the patches! Really appreciate it.

I'll fix the SOB chain next time. I am planning to reply to rest of the 
comments over the next week.

Thanks,

Sohil



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-24 11:04   ` Thomas Gleixner
@ 2021-09-25 12:08     ` Thomas Gleixner
  2021-09-28 23:13       ` Sohil Mehta
  2021-09-28 23:08     ` Sohil Mehta
  1 sibling, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-25 12:08 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Fri, Sep 24 2021 at 13:04, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>> +int uintr_receiver_wait(void)
>> +{
>> +	struct uintr_upid_ctx *upid_ctx;
>> +	unsigned long flags;
>> +
>> +	if (!is_uintr_receiver(current))
>> +		return -EOPNOTSUPP;
>> +
>> +	upid_ctx = current->thread.ui_recv->upid_ctx;
>> +	upid_ctx->upid->nc.nv = UINTR_KERNEL_VECTOR;
>> +	upid_ctx->waiting = true;
>> +	spin_lock_irqsave(&uintr_wait_lock, flags);
>> +	list_add(&upid_ctx->node, &uintr_wait_list);
>> +	spin_unlock_irqrestore(&uintr_wait_lock, flags);
>> +
>> +	set_current_state(TASK_INTERRUPTIBLE);
>
> Because we have not enough properly implemented wait primitives you need
> to open code one which is blantantly wrong vs. a concurrent wake up?
>
>> +	schedule();
>
> How is that correct vs. a spurious wakeup? What takes care that the
> entry is removed from the list?
>
> Again. We have proper wait primitives.

Aisde of that this is completely broken vs. CPU hotplug.

CPUX
  switchto(tsk)
    tsk->upid.ndst = apicid(smp_processor_id();

  ret_to_user()
  ...
  sys_uintr_wait()
    ...
    schedule()

After that CPU X is unplugged which means the task won't be woken up by
an user IPI which is issued after CPU X went down.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector
  2021-09-23 23:07   ` Thomas Gleixner
@ 2021-09-25 13:30     ` Thomas Gleixner
  2021-09-26 12:39       ` Thomas Gleixner
  2021-09-27 19:26     ` Sohil Mehta
  1 sibling, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-25 13:30 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>> The kernel doesn't need to do anything in this case other than receiving
>> the interrupt and clearing the local APIC. The user interrupt is always
>> stored in the receiver's UPID before the IPI is generated. When the
>> receiver gets scheduled back the interrupt would be delivered based on
>> its UPID.
>
> So why on earth is that vector reaching the CPU at all?

Let's see how this works:

  task starts using UINTR.
    set UINTR_NOTIFACTION_VECTOR in MSR_IA32_UINTR_MISC
    
So from that point on the User-Interrupt Notification Identification
mechanism swallows the vector.

Where this stops working is not limited to context switch. The wreckage
comes from XSAVES:

 "After saving the user-interrupt state component, XSAVES clears
  UINV. (UINV is IA32_UINTR_MISC[39:32]; XSAVES does not modify the
  remainder of that MSR.)"

So the problem is _not_ context switch. The problem is XSAVES and that
can be issued even without a context switch.

The obvious question is: What is the value of clearing UINV?

Absolutely none. That notification vector cannot be used for anything
else, so why would the OS be interested to see it ever? This is about
user space interupts, right?

UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched
by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should
be considered a hardware bug.

Thanks,

         tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector
  2021-09-25 13:30     ` Thomas Gleixner
@ 2021-09-26 12:39       ` Thomas Gleixner
  2021-09-27 19:07         ` Sohil Mehta
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-26 12:39 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Sat, Sep 25 2021 at 15:30, Thomas Gleixner wrote:
> On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote:
> The obvious question is: What is the value of clearing UINV?
>
> Absolutely none. That notification vector cannot be used for anything
> else, so why would the OS be interested to see it ever? This is about
> user space interupts, right?
>
> UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched
> by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should
> be considered a hardware bug.

After decoding the documentation (sigh) and staring at the implications of
keeping UINV armed, I can see the point vs. the UPID lifetime issue when
a task gets scheduled out and migrated to a different CPU.

Not the most pretty solution, but as there needs to be some invalidation
which needs to be undone on return to user space it probably does not
matter much. 

As the whole thing is tightly coupled to XSAVES/RSTORS we need to
integrate it into that machinery and not pretend that it's something
half independent.

That means we have to handle the setting of the SN bit in UPID whenever
XSTATE is saved either during context switch, when the kernel uses the
FPU or in other places (signals, fpu_clone ...). They all end up in
save_fpregs_to_fpstate() so that might be the place to look at.

While talking about that: fpu_clone() has to invalidate the UINTR state
in the clone's xstate after the memcpy() or xsaves() operation.

Also the restore portion on the way back to user space has to be coupled
more tightly:

arch_exit_to_user_mode_prepare()
{
        ...
        if (unlikely(ti_work & _TIF_UPID))
        	uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD);
        if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
        	switch_fpu_return();
}

upid_set_ndst(upid)
{
	apicid = __this_cpu_read(x86_cpu_to_apicid);

        if (x2apic_enabled())
            upid->ndst.x2apic = apicid;
        else
            upid->ndst.apic = apicid;
}

uintr_restore_upid(bool xrstors_pending)
{
        clear_thread_flag(TIF_UPID);
        
	// Update destination
        upid_set_ndst(upid);

        // Do we need something stronger here?
        barrier();

        clear_bit(SN, upid->status);

        // Any SENDUIPI after this point sends to this CPU
           
        // Any bit which was set in upid->pir after SN was set
        // and/or UINV was cleared by XSAVES up to the point
        // where SN was cleared above is not reflected in UIRR.

	// As this runs with interrupts disabled the current state
        // of upid->pir can be read and used for restore. A SENDUIPI
        // which sets a bit in upid->pir after that read will send
        // the notification vector which is going to be handled once
        // the task reenables interrupts on return to user space.
        // If the SENDUIPI set the bit before the read then the
        // notification vector handling will just observe the same
        // PIR state.

        // Needs to be a locked access as there might be a
        // concurrent SENDUIPI modiying it.
        pir = read_locked(upid->pir);

        if (xrstors_pending)) {
        	// Update the saved xstate for xrstors
           	current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;
                current->xstate.uintr.uirr = pir;
        } else {
                // Manually restore UIRR and UINV
                wrmsrl(IA32_UINTR_RR, pir);

	        misc.val64 = 0;
                misc.uittsz = current->uintr->uittsz;
                misc.uinv = UINTR_NOTIFICATION_VECTOR;
                wrmsrl(IA32_UINTR_MISC, misc.val64);
        }
}

That's how I deciphered the documentation and I don't think this is far
from reality, but I might be wrong as usual.

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-13 20:01 ` [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall Sohil Mehta
  2021-09-24 11:04   ` Thomas Gleixner
@ 2021-09-26 14:41   ` Thomas Gleixner
  2021-09-29  1:09     ` Sohil Mehta
  2021-09-29  3:30   ` Andy Lutomirski
  2 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-26 14:41 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Sohil Mehta, Tony Luck, Dave Hansen, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> Add a new system call to allow applications to block in the kernel and
> wait for user interrupts.
>
> <The current implementation doesn't support waking up from other
> blocking system calls like sleep(), read(), epoll(), etc.
>
> uintr_wait() is a placeholder syscall while we decide on that
> behaviour.>

Which behaviour? You cannot integrate this into [clock_]nanosleep() by
any means or wakeup something which is blocked in read(somefd) via
SENDUIPI.

What you can do is implement read() and poll() support for the
uintrfd. Anything else is just not going to fly.

Adding support for read/poll is pretty much a straight forward variant
of a correctly implemented wait()/wakeup() mechanism.

While poll()/read() support might be useful and poll() also provides a
timeout, having an explicit (timed) wait mechanism might be interesting.

But that brings me to an interesting question. There are two cases:

 1) The task installed a user space interrupt handler. Now it
    want's to play nice and yield the CPU while waiting.

    So it needs to reinstall the UINV vector on return to user and
    update UIRR, but that'd be covered by the existing mechanism. Fine.

 2) Task has no user space interrupt handler installed and just want's
    to use that wait mechanism.

    What is consuming the pending bit(s)? 

    If that's not a valid use case, then the wait has to check for that
    and reject the syscall with EINVAL.

    If it is valid, then how are the pending bits consumed and relayed to
    user space?

The same questions arise when you think about implementing poll/read
support simply because the regular poll/read semantics are:

  poll waits for the event and read consumes the event

which would be similar to #2 above, but with an installed user space
interrupt handler the return from the poll system call would consume the
event immediately (assumed that UIF is set).

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector
  2021-09-26 12:39       ` Thomas Gleixner
@ 2021-09-27 19:07         ` Sohil Mehta
  2021-09-28  8:11           ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Sohil Mehta @ 2021-09-27 19:07 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/26/2021 5:39 AM, Thomas Gleixner wrote:
> On Sat, Sep 25 2021 at 15:30, Thomas Gleixner wrote:
>> On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote:
>> The obvious question is: What is the value of clearing UINV?
>>
>> Absolutely none. That notification vector cannot be used for anything
>> else, so why would the OS be interested to see it ever? This is about
>> user space interupts, right?
>>
>> UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched
>> by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should
>> be considered a hardware bug.
> After decoding the documentation (sigh) and staring at the implications of
> keeping UINV armed, I can see the point vs. the UPID lifetime issue when
> a task gets scheduled out and migrated to a different CPU.


I think you got it right. Here is my understanding of this.

The User-interrupt notification processing moves all the pending 
interrupts from UPID.PIR to the UIRR.

As you mentioned below, XSTATE is saved due to several reasons which 
saves the UIRR into memory. UIRR should no longer be updated after it 
has been saved.

XSAVES clears UINV is to stop detecting additional interrupts in the 
UIRR after it has been saved.


> Not the most pretty solution, but as there needs to be some invalidation
> which needs to be undone on return to user space it probably does not
> matter much.
>
> As the whole thing is tightly coupled to XSAVES/RSTORS we need to
> integrate it into that machinery and not pretend that it's something
> half independent.


I agree. Thank you for pointing this out.

> That means we have to handle the setting of the SN bit in UPID whenever
> XSTATE is saved either during context switch, when the kernel uses the
> FPU or in other places (signals, fpu_clone ...). They all end up in
> save_fpregs_to_fpstate() so that might be the place to look at.

  Yes. The current code doesn't do this. SN bit should be set whenever 
UINTR XSTATE is saved.

> While talking about that: fpu_clone() has to invalidate the UINTR state
> in the clone's xstate after the memcpy() or xsaves() operation.
>
> Also the restore portion on the way back to user space has to be coupled
> more tightly:
>
> arch_exit_to_user_mode_prepare()
> {
>          ...
>          if (unlikely(ti_work & _TIF_UPID))
>          	uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD);
>          if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
>          	switch_fpu_return();
> }

I am assuming _TIF_UPID would be set everytime SN is set and XSTATE is 
saved.

> upid_set_ndst(upid)
> {
> 	apicid = __this_cpu_read(x86_cpu_to_apicid);
>
>          if (x2apic_enabled())
>              upid->ndst.x2apic = apicid;
>          else
>              upid->ndst.apic = apicid;
> }
>
> uintr_restore_upid(bool xrstors_pending)
> {
>          clear_thread_flag(TIF_UPID);
>          
> 	// Update destination
>          upid_set_ndst(upid);
>
>          // Do we need something stronger here?
>          barrier();
>
>          clear_bit(SN, upid->status);
>
>          // Any SENDUIPI after this point sends to this CPU
>             
>          // Any bit which was set in upid->pir after SN was set
>          // and/or UINV was cleared by XSAVES up to the point
>          // where SN was cleared above is not reflected in UIRR.
>
> 	// As this runs with interrupts disabled the current state
>          // of upid->pir can be read and used for restore. A SENDUIPI
>          // which sets a bit in upid->pir after that read will send
>          // the notification vector which is going to be handled once
>          // the task reenables interrupts on return to user space.
>          // If the SENDUIPI set the bit before the read then the
>          // notification vector handling will just observe the same
>          // PIR state.
>
>          // Needs to be a locked access as there might be a
>          // concurrent SENDUIPI modiying it.
>          pir = read_locked(upid->pir);
>
>          if (xrstors_pending)) {
>          	// Update the saved xstate for xrstors
>             	current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;

XSAVES saves the UINV value into the XSTATE buffer. I am not sure if we 
need this again. Is it because it could have been overwritten by calling 
XSAVES twice?


>                  current->xstate.uintr.uirr = pir;

I believe PIR should be ORed. There could be some bits already set in 
the UIRR.

Also, shouldn't UPID->PIR be cleared? If not, we would detect these 
interrupts all over again during the next ring transition.

>          } else {
>                  // Manually restore UIRR and UINV
>                  wrmsrl(IA32_UINTR_RR, pir);
I believe read-modify-write here as well.
> 	        misc.val64 = 0;
>                  misc.uittsz = current->uintr->uittsz;
>                  misc.uinv = UINTR_NOTIFICATION_VECTOR;
>                  wrmsrl(IA32_UINTR_MISC, misc.val64);

Thanks! This helps reduce the additional MSR read.

>          }
> }
>
> That's how I deciphered the documentation and I don't think this is far
> from reality, but I might be wrong as usual.
>
> Hmm?

Thank you for the simplification. This is very helpful.

Sohil



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector
  2021-09-23 23:07   ` Thomas Gleixner
  2021-09-25 13:30     ` Thomas Gleixner
@ 2021-09-27 19:26     ` Sohil Mehta
  1 sibling, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-27 19:26 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/23/2021 4:07 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>> A user interrupt notification vector is used on the receiver's cpu to
>> identify an interrupt as a user interrupt (and not a kernel interrupt).
>> Hardware uses the same notification vector to generate an IPI from a
>> sender's cpu core when the SENDUIPI instruction is executed.
>>
>> Typically, the kernel shouldn't receive an interrupt with this vector.
>> However, it is possible that the kernel might receive this vector.
>>
>> Scenario that can cause the spurious interrupt:
>>
>> Step	cpu 0 (receiver task)		cpu 1 (sender task)
>> ----	---------------------		-------------------
>> 1	task is running
>> 2					executes SENDUIPI
>> 3					IPI sent
>> 4	context switched out
>> 5	IPI delivered
>> 	(kernel interrupt detected)
>>
>> A kernel interrupt can be detected, if a receiver task gets scheduled
>> out after the SENDUIPI-based IPI was sent but before the IPI was
>> delivered.
> What happens if the SENDUIPI is issued when the target task is not on
> the CPU? How is that any different from the above?


This didn't get covered in the other thread. Thought, I would clarify 
this a bit more.

A notification IPI is sent from the CPU that executes SENDUIPI if the 
target task is running (SN is 0).

If the target task is not running SN bit in the UPID is set, which 
prevents any notification interrupts from being generated.

However, it is possible that SN is 0 when SENDUIPI was executed which 
generates the notification IPI. But when the IPI arrives on receiver 
CPU, SN has been set, the task state has been saved and UINV has been 
cleared.

A kernel interrupt is detected in this case. I have a sample that demos 
this. I'll fix the current code and then send out the results.


>> The kernel doesn't need to do anything in this case other than receiving
>> the interrupt and clearing the local APIC. The user interrupt is always
>> stored in the receiver's UPID before the IPI is generated. When the
>> receiver gets scheduled back the interrupt would be delivered based on
>> its UPID.
> So why on earth is that vector reaching the CPU at all?

You covered this in the other thread.

>> +#ifdef CONFIG_X86_USER_INTERRUPTS
>> +	seq_printf(p, "%*s: ", prec, "UIS");
> No point in printing that when user interrupts are not available/enabled
> on the system.
>
Will fix this.

Thanks,

Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 03/13] x86/cpu: Enumerate User Interrupts support
  2021-09-23 22:24   ` Thomas Gleixner
  2021-09-24 19:59     ` Sohil Mehta
@ 2021-09-27 20:42     ` Sohil Mehta
  1 sibling, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-27 20:42 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/23/2021 3:24 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>> SENDUIPI is a special ring-3 instruction that makes a supervisor mode
>> memory access to the UPID and UITT memory. Currently, KPTI needs to be
>> off for User IPIs to work.  Processors that support user interrupts are
>> not affected by Meltdown so the auto mode of KPTI will default to off.
>>
>> Users who want to force enable KPTI will need to wait for a later
>> version of this patch series that is compatible with KPTI. We need to
>> allocate the UPID and UITT structures from a special memory region that
>> has supervisor access but it is mapped into userspace. The plan is to
>> implement a mechanism similar to LDT.
> Seriously?

Are questioning why we should add KPTI support if the hardware is not 
affected by Meltdown?

or

Why use an LDT like mechanism to do this?

I have listed this as one of the opens in the cover letter as well. I am 
not sure if users who force enable PTI would really care about User 
Interrupts.

Any input here would be helpful.

>
>> +	if (!cpu_feature_enabled(X86_FEATURE_UINTR))
>> +		goto disable_uintr;
>> +
>> +	/* checks the current processor's cpuid bits: */
>> +	if (!cpu_has(c, X86_FEATURE_UINTR))
>> +		goto disable_uintr;
>> +
>> +	/*
>> +	 * User Interrupts currently doesn't support PTI. For processors that
>> +	 * support User interrupts PTI in auto mode will default to off.  Need
>> +	 * this check only for users who have force enabled PTI.
>> +	 */
>> +	if (boot_cpu_has(X86_FEATURE_PTI)) {
>> +		pr_info_once("x86: User Interrupts (UINTR) not enabled. Please disable PTI using 'nopti' kernel parameter\n");
> That message does not make sense. The admin has explicitly added 'pti'
> to the kernel command line on a CPU which is not affected. So why would
> he now have to add 'nopti' ?

Yup. I'll fix this and other issues in this patch.

I thought the user should know why UINTR has been disabled. In 
hindsight, this would have been better covered in the sample Readme or 
something similar.


Thanks,

Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 04/13] x86/fpu/xstate: Enumerate User Interrupts supervisor state
  2021-09-23 22:34   ` Thomas Gleixner
@ 2021-09-27 22:25     ` Sohil Mehta
  0 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-27 22:25 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/23/2021 3:34 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>> Enable xstate supervisor support for User Interrupts by default.
> What means enabled by default? It's enabled when available and not
> disabled on the command line.

I'll remove it.

>> The user interrupt state for a task consists of the MSR state and the
>> User Interrupt Flag (UIF) value. XSAVES and XRSTORS handle saving and
>> restoring both of these states.
>>
>> <The supervisor XSTATE code might be reworked based on issues reported
>> in the past. The Uintr context switching code would also need rework and
>> additional testing in that regard.>
> What? Which issues were reported and if they have been reported then how
> is the provided code correct?


I apologize for causing this confusion. This comment was added a few 
months back when the PKRU and FPU code was being reworked. This comment 
is no longer valid.

>> +/*
>> + * State component 14 is supervisor state used for User Interrupts state.
>> + * The size of this state is 48 bytes
>> + */
>> +struct uintr_state {
>> +	u64 handler;
>> +	u64 stack_adjust;
>> +	u32 uitt_size;
>> +	u8  uinv;
>> +	u8  pad1;
>> +	u8  pad2;
>> +	u8  uif_pad3;		/* bit 7 - UIF, bits 6:0 - reserved */
> Please do not use tail comments. Also what kind of name is uif_pad3?
> Bitfields exist for a reason.


An internal version of this used bitfields. I was suggested that use of 
bitfields is not recommended for x86 code.

The name uif_pad3 is really ugly though. I'll change it to a bitfield 
next time.


Thanks,

Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls
  2021-09-23 12:26   ` Greg KH
  2021-09-24  0:05     ` Thomas Gleixner
@ 2021-09-27 23:20     ` Sohil Mehta
  2021-09-28  4:39       ` Greg KH
  1 sibling, 1 reply; 81+ messages in thread
From: Sohil Mehta @ 2021-09-27 23:20 UTC (permalink / raw)
  To: Greg KH
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On 9/23/2021 5:26 AM, Greg KH wrote:
> On Mon, Sep 13, 2021 at 01:01:25PM -0700, Sohil Mehta wrote:
>> +
>> +/* User Posted Interrupt Descriptor (UPID) */
>> +struct uintr_upid {
>> +	struct {
>> +		u8 status;	/* bit 0: ON, bit 1: SN, bit 2-7: reserved */
>> +		u8 reserved1;	/* Reserved */
>> +		u8 nv;		/* Notification vector */
>> +		u8 reserved2;	/* Reserved */
> What are these "reserved" for?


The UPID is an architectural data structure defined by the hardware. The 
reserved fields are defined by the hardware (likely to keep the 
structure size as 16 bytes).


>
>> +struct uintr_upid_ctx {
>> +	struct uintr_upid *upid;
>> +	refcount_t refs;
> Please use a kref for this and do not roll your own for no good reason.

Sure. Will do.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls
  2021-09-23 23:52   ` Thomas Gleixner
@ 2021-09-27 23:57     ` Sohil Mehta
  0 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-27 23:57 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/23/2021 4:52 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>
> +/* UPID Notification control status */
> +#define UPID_ON		0x0	/* Outstanding notification */
> +#define UPID_SN		0x1	/* Suppressed notification */
> Come on. This are bits in upid.status, right? So why can't the comment
> above these defines says so and why can't the names not reflect that?
I'll fix this.
>> +struct uintr_upid_ctx {
>> +	struct uintr_upid *upid;
>> +	refcount_t refs;
> Please use tabular format for struct members.
Will do.
>> +};
>> +
>> +struct uintr_receiver {
>> +	struct uintr_upid_ctx *upid_ctx;
>> +};
> So we need a struct to wrap a pointer to another struct. Why?

The struct will have more members added later.  Should the wrapper be 
created then?

I didn't want to add members that are not used in this patch.

>> +inline bool uintr_arch_enabled(void)
> What's this arch_enabled indirection for? Is this used anywhere in
> non-architecture code?


I'll remove this indirection.

It is a remnant of some older code that had uintr_fd managed outside of 
the x86 code.

>> +{
>> +	return static_cpu_has(X86_FEATURE_UINTR);
>> +}
>> +
>> +static inline bool is_uintr_receiver(struct task_struct *t)
>> +{
>> +	return !!t->thread.ui_recv;
>> +}
>> +
>> +static inline u32 cpu_to_ndst(int cpu)
>> +{
>> +	u32 apicid = (u32)apic->cpu_present_to_apicid(cpu);
>> +
>> +	WARN_ON_ONCE(apicid == BAD_APICID);
> Brilliant. If x2apic is not enabled then this case returns


I'll fix this.

>> +	if (!x2apic_enabled())
>> +		return (apicid << 8) & 0xFF00;
>    (BAD_APICID << 8) & 0xFF00 == 0xFF ....
>
>> +int do_uintr_unregister_handler(void)
>> +{
>> +	struct task_struct *t = current;
>> +	struct fpu *fpu = &t->thread.fpu;
>> +	struct uintr_receiver *ui_recv;
>> +	u64 msr64;
>> +
>> +	if (!is_uintr_receiver(t))
>> +		return -EINVAL;
>> +
>> +	pr_debug("recv: Unregister handler and clear MSRs for task=%d\n",
>> +		 t->pid);
>> +
>> +	/*
>> +	 * TODO: Evaluate usage of fpregs_lock() and get_xsave_addr(). Bugs
>> +	 * have been reported recently for PASID and WRPKRU.
> Again. Which bugs and why haven't they been evaluated before posting?
I apologize again. This comment is no longer valid.
>> +	 * UPID and ui_recv will be referenced during context switch. Need to
>> +	 * disable preemption while modifying the MSRs, UPID and ui_recv thread
>> +	 * struct.
>> +	 */
>> +	fpregs_lock();
> And because you need to disable preemption you need to use
> fpregs_lock(), right? That's not what fpregs_lock() is about.
>
Got it. I'll evaluate the use of fpregs_lock() at all places.
>> +		wrmsrl(MSR_IA32_UINTR_MISC, msr64);
>> +		wrmsrl(MSR_IA32_UINTR_PD, 0ULL);
>> +		wrmsrl(MSR_IA32_UINTR_RR, 0ULL);
>> +		wrmsrl(MSR_IA32_UINTR_STACKADJUST, 0ULL);
>> +		wrmsrl(MSR_IA32_UINTR_HANDLER, 0ULL);
>> +	} else {
>> +		struct uintr_state *p;
>> +
>> +		p = get_xsave_addr(&fpu->state.xsave, XFEATURE_UINTR);
>> +		if (p) {
>> +			p->handler = 0;
>> +			p->stack_adjust = 0;
>> +			p->upid_addr = 0;
>> +			p->uinv = 0;
>> +			p->uirr = 0;
>> +		}
> So p == NULL is expected here?
I'll fix this and other usages of get_xsave_addr().

Thanks,

Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 07/13] x86/process/64: Add uintr task context switch support
  2021-09-24  0:41   ` Thomas Gleixner
@ 2021-09-28  0:30     ` Sohil Mehta
  0 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-28  0:30 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/23/2021 5:41 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>
>> User interrupt state is saved and restored using xstate supervisor
>> feature support. This includes the MSR state and the User Interrupt Flag
>> (UIF) value.
>>
>> During context switch update the UPID for a uintr task to reflect the
>> current state of the task; namely whether the task should receive
>> interrupt notifications and which cpu the task is currently running on.
>>
>> XSAVES clears the notification vector (UINV) in the MISC MSR to prevent
>> interrupts from being recognized in the UIRR MSR while the task is being
>> context switched. The UINV is restored back when the kernel does an
>> XRSTORS.
>>
>> However, this conflicts with the kernel's lazy restore optimization
>> which skips an XRSTORS if the kernel is scheduling the same user task
>> back and the underlying MSR state hasn't been modified. Special handling
>> is needed for a uintr task in the context switch path to keep using this
>> optimization.
> And this special handling is?


By special handling I meant programming the MSR when XRSTORS doesn't 
happen on return to userspace. The pseudo code you provided in patch 5 
comments handles this well.


>> + * cleared.
>>    */
>>   void save_fpregs_to_fpstate(struct fpu *fpu)
>>   {
>> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
>> index ec0d836a13b1..62b82137db9c 100644
>> --- a/arch/x86/kernel/process_64.c
>> +++ b/arch/x86/kernel/process_64.c
>> @@ -53,6 +53,7 @@
>>   #include <asm/xen/hypervisor.h>
>>   #include <asm/vdso.h>
>>   #include <asm/resctrl.h>
>> +#include <asm/uintr.h>
>>   #include <asm/unistd.h>
>>   #include <asm/fsgsbase.h>
>>   #ifdef CONFIG_IA32_EMULATION
>> @@ -565,6 +566,9 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>>   	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
>>   		     this_cpu_read(hardirq_stack_inuse));
>>   
>> +	if (static_cpu_has(X86_FEATURE_UINTR))
> cpu_feature_enabled() please.


I'll fix this and the other issues that you mentioned.

>> +		switch_uintr_prepare(prev_p);
>> +
>>   	if (!test_thread_flag(TIF_NEED_FPU_LOAD))
>>   		switch_fpu_prepare(prev_fpu, cpu);
>>   
>> diff --git a/arch/x86/kernel/uintr_core.c b/arch/x86/kernel/uintr_core.c
>> index 2c6042a6840a..7a29888050ad 100644
>> --- a/arch/x86/kernel/uintr_core.c
>> +++ b/arch/x86/kernel/uintr_core.c
>> @@ -238,3 +238,78 @@ int do_uintr_register_handler(u64 handler)
>>   
>>   	return 0;
>>   }
>> +
>> +/* Suppress notifications since this task is being context switched out */
>> +void switch_uintr_prepare(struct task_struct *prev)
>> +{
>> +	struct uintr_upid *upid;
>> +
>> +	if (is_uintr_receiver(prev)) {
>> +		upid = prev->thread.ui_recv->upid_ctx->upid;
>> +		set_bit(UPID_SN, (unsigned long *)&upid->nc.status);
> Please add a comment why this needs to be a locked instruction.
>
>
Ok, will do.  The SN bit could be read concurrently on another CPU 
executing SENDUIPI.


> Of course this is invoked unconditionally when the CPU has
> X86_FEATURE_UINTR:
>
>> +	if (static_cpu_has(X86_FEATURE_UINTR))
>> +		switch_uintr_return();
> Why?
>
> If the sequence is:
>
>       syscall()
>       do_stuff()
>       return_to_user()
>
> then what on earth has modified that MSR state? Nothing at all, but you
> still run this code. What for?
>
>
The pseudo code in patch 5 covers this. I'll fix the code based on that.

Thanks,

Sohil



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 08/13] x86/process/64: Clean up uintr task fork and exit paths
  2021-09-24  1:02   ` Thomas Gleixner
@ 2021-09-28  1:23     ` Sohil Mehta
  0 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-28  1:23 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/23/2021 6:02 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>
>> The user interrupt MSRs and the user interrupt state is task specific.
>> During task fork and exit clear the task state, clear the MSRs and
>> dereference the shared resources.
>>
>> Some of the memory resources like the UPID are referenced in the file
>> descriptor and could be in use while the uintr_fd is still valid.
>> Instead of freeing up  the UPID just dereference it.
> Derefencing the UPID, i.e. accessing task->upid->foo helps in which way?
>
> You want to drop the reference count I assume. Then please write that
> so.


Ah! Not sure how I associated dereference to dropping the reference. 
Will update this.

>
>> @@ -260,6 +260,7 @@ int fpu_clone(struct task_struct *dst)
>>   {
>>   	struct fpu *src_fpu = &current->thread.fpu;
>>   	struct fpu *dst_fpu = &dst->thread.fpu;
>> +	struct uintr_state *uintr_state;
>>   
>>   	/* The new task's FPU state cannot be valid in the hardware. */
>>   	dst_fpu->last_cpu = -1;
>> @@ -284,6 +285,14 @@ int fpu_clone(struct task_struct *dst)
>>   
>>   	else
>>   		save_fpregs_to_fpstate(dst_fpu);
>> +
>> +	/* UINTR state is not expected to be inherited (in the current design). */
>> +	if (static_cpu_has(X86_FEATURE_UINTR)) {
>> +		uintr_state = get_xsave_addr(&dst_fpu->state.xsave, XFEATURE_UINTR);
>> +		if (uintr_state)
>> +			memset(uintr_state, 0, sizeof(*uintr_state));
>> +	}
> 1) If the FPU registers are up to date then this can be completely
>     avoided by excluding the UINTR component from XSAVES

You mentioned this in the other thread that the UINTR state must be 
invalidated during fpu_clone().

I am not sure if understand all the nuances here. Your suggestion seems 
valid to me. I'll have to think more about this.

> 2) If the task never used that muck then UINTR is in init state and
>     clearing that memory is a redunant exercise because it has been
>     cleared already

Yes. I'll add a check for that.

>> + * exit_thread() can happen in current context when the current thread is
>> + * exiting or it can happen for a new thread that is being created.
> A right that makes sense. If a new thread is created then it can call
> exit_thread(), right?


What I meant here is that exit_thread() can also be called during 
copy_process() if it runs into an issue.

bad_fork_cleanup_thread:

     exit_thread();

In this case is_uintr_receiver() will fail. I'll update the comments to 
reflect that.

>> + * For new threads is_uintr_receiver() should fail.
> Should fail?

Thanks,

Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls
  2021-09-27 23:20     ` Sohil Mehta
@ 2021-09-28  4:39       ` Greg KH
  2021-09-28 16:47         ` Sohil Mehta
  0 siblings, 1 reply; 81+ messages in thread
From: Greg KH @ 2021-09-28  4:39 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Mon, Sep 27, 2021 at 04:20:25PM -0700, Sohil Mehta wrote:
> On 9/23/2021 5:26 AM, Greg KH wrote:
> > On Mon, Sep 13, 2021 at 01:01:25PM -0700, Sohil Mehta wrote:
> > > +
> > > +/* User Posted Interrupt Descriptor (UPID) */
> > > +struct uintr_upid {
> > > +	struct {
> > > +		u8 status;	/* bit 0: ON, bit 1: SN, bit 2-7: reserved */
> > > +		u8 reserved1;	/* Reserved */
> > > +		u8 nv;		/* Notification vector */
> > > +		u8 reserved2;	/* Reserved */
> > What are these "reserved" for?
> 
> 
> The UPID is an architectural data structure defined by the hardware. The
> reserved fields are defined by the hardware (likely to keep the structure
> size as 16 bytes).

Then those values must be set to 0, right?  I think I missed the part of
the code that set them, hopefully it's somewhere...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector
  2021-09-27 19:07         ` Sohil Mehta
@ 2021-09-28  8:11           ` Thomas Gleixner
  0 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-28  8:11 UTC (permalink / raw)
  To: Sohil Mehta, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

Sohil,

On Mon, Sep 27 2021 at 12:07, Sohil Mehta wrote:
> On 9/26/2021 5:39 AM, Thomas Gleixner wrote:
>
> The User-interrupt notification processing moves all the pending 
> interrupts from UPID.PIR to the UIRR.

Indeed that makes sense. Should have thought about that myself.

>> Also the restore portion on the way back to user space has to be coupled
>> more tightly:
>>
>> arch_exit_to_user_mode_prepare()
>> {
>>          ...
>>          if (unlikely(ti_work & _TIF_UPID))
>>          	uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD);
>>          if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
>>          	switch_fpu_return();
>> }
>
> I am assuming _TIF_UPID would be set everytime SN is set and XSTATE is 
> saved.

Yes.

>> upid_set_ndst(upid)
>> {
>> 	apicid = __this_cpu_read(x86_cpu_to_apicid);
>>
>>          if (x2apic_enabled())
>>              upid->ndst.x2apic = apicid;
>>          else
>>              upid->ndst.apic = apicid;
>> }
>>
>> uintr_restore_upid(bool xrstors_pending)
>> {
>>          clear_thread_flag(TIF_UPID);
>>          
>> 	// Update destination
>>          upid_set_ndst(upid);
>>
>>          // Do we need something stronger here?
>>          barrier();
>>
>>          clear_bit(SN, upid->status);
>>
>>          // Any SENDUIPI after this point sends to this CPU
>>             
>>          // Any bit which was set in upid->pir after SN was set
>>          // and/or UINV was cleared by XSAVES up to the point
>>          // where SN was cleared above is not reflected in UIRR.
>>
>> 	// As this runs with interrupts disabled the current state
>>          // of upid->pir can be read and used for restore. A SENDUIPI
>>          // which sets a bit in upid->pir after that read will send
>>          // the notification vector which is going to be handled once
>>          // the task reenables interrupts on return to user space.
>>          // If the SENDUIPI set the bit before the read then the
>>          // notification vector handling will just observe the same
>>          // PIR state.
>>
>>          // Needs to be a locked access as there might be a
>>          // concurrent SENDUIPI modiying it.
>>          pir = read_locked(upid->pir);
>>
>>          if (xrstors_pending)) {
>>          	// Update the saved xstate for xrstors
>>             	current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;
>
> XSAVES saves the UINV value into the XSTATE buffer. I am not sure if we 
> need this again. Is it because it could have been overwritten by calling 
> XSAVES twice?

Yes that can happen AFAICT. I haven't done a deep analysis, but this
needs to looked at.

>>                  current->xstate.uintr.uirr = pir;
>
> I believe PIR should be ORed. There could be some bits already set in 
> the UIRR.
>
> Also, shouldn't UPID->PIR be cleared? If not, we would detect these 
> interrupts all over again during the next ring transition.

Right. So that PIR read above needs to be a locked cmpxchg().

>>          } else {
>>                  // Manually restore UIRR and UINV
>>                  wrmsrl(IA32_UINTR_RR, pir);
> I believe read-modify-write here as well.

Sigh, yes.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls
  2021-09-28  4:39       ` Greg KH
@ 2021-09-28 16:47         ` Sohil Mehta
  0 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-28 16:47 UTC (permalink / raw)
  To: Greg KH
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On 9/27/2021 9:39 PM, Greg KH wrote:
>
> Then those values must be set to 0, right?  I think I missed the part of
> the code that set them, hopefully it's somewhere...

Yes. The kzalloc() as part of alloc_upid() does it.

Thanks,

Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls
  2021-09-23 12:28   ` Greg KH
@ 2021-09-28 18:01     ` Sohil Mehta
  2021-09-29  7:04       ` Greg KH
  0 siblings, 1 reply; 81+ messages in thread
From: Sohil Mehta @ 2021-09-28 18:01 UTC (permalink / raw)
  To: Greg KH
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On 9/23/2021 5:28 AM, Greg KH wrote:
> On Mon, Sep 13, 2021 at 01:01:29PM -0700, Sohil Mehta wrote:
>> +/* User Interrupt Target Table Entry (UITTE) */
>> +struct uintr_uitt_entry {
>> +	u8	valid;			/* bit 0: valid, bit 1-7: reserved */
> Do you check that the other bits are set to 0?

I don't have a check but kzalloc() in alloc_uitt() should set it to 0.

>> +	u8	user_vec;
>> +	u8	reserved[6];
> What is this reserved for?

This is hardware defined structure as well. I should probably mention 
this it in the comment above.

>> +	u64	target_upid_addr;
> If this is a pointer, why not say it is a pointer?

I used a u64 to get the size and alignment of this structure as required 
by the hardware. I wasn't sure if using a struct upid * would complicate 
that.

Also this field is never used as a pointer by the kernel. It is only 
used to program an entry that is read by the hardware.

Is this reasonable or would you still prefer a pointer?


>> +} __packed __aligned(16);
>> +
>> +struct uintr_uitt_ctx {
>> +	struct uintr_uitt_entry *uitt;
>> +	/* Protect UITT */
>> +	spinlock_t uitt_lock;
>> +	refcount_t refs;
> Again, a kref please.

Will do.

Thanks,

Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 09/13] x86/uintr: Introduce vector registration and uintr_fd syscall
  2021-09-24 10:33   ` Thomas Gleixner
@ 2021-09-28 20:40     ` Sohil Mehta
  0 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-28 20:40 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/24/2021 3:33 AM, Thomas Gleixner wrote:
> If this ever comes back in some form, then I pretty please want the life
> time rules of this documented properly.

I'll document the life time rules for the UPID, vector and UITT next 
time. I realize now that they are quite convoluted in the current 
implementation.

I'll also fix the concurrency and serialization issues that you 
mentioned in this patch and the next one.


>
>> +	ret = task_work_add(r_info->upid_ctx->task, &r_info->twork, true);
> Care to look at the type of the third argument of task_work_add()?

Ah! I didn't realize the third argument changed a long time back.


>
>> +/*
>> + * sys_uintr_create_fd - Create a uintr_fd for the registered interrupt vector.
> So this creates a file descriptor for a vector which is already
> allocated and then it calls do_uintr_register_vector() which allocates
> the vector?

The syscall comment is misleading. Will fix it.

Vector allocation happens in userspace. The application is only expected 
to register the vector.

This syscall only creates an FD abstraction for the vector that is 
*being* registered.

>> + */
>> +SYSCALL_DEFINE2(uintr_create_fd, u64, vector, unsigned int, flags)
>> +{

Thanks,

Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-24 11:04   ` Thomas Gleixner
  2021-09-25 12:08     ` Thomas Gleixner
@ 2021-09-28 23:08     ` Sohil Mehta
  1 sibling, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-28 23:08 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/24/2021 4:04 AM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>> Currently, the task wait list is global one. To make the implementation
>> scalable there is a need to move to a distributed per-cpu wait list.
> How are per cpu wait lists going to solve the problem?


Currently, the global wait list can be concurrently accessed by multiple 
cpus. If we have per-cpu wait lists then the UPID scanning only needs to 
happen on the local cpu's wait list.

After an application calls uintr_wait(), the notification interrupt will 
be delivered only to the cpu where the task blocked. In this case, we 
can reduce the UPID search list and probably get rid of the global 
spinlock as well.

Though, I am not sure how much impact this would have vs. the problem of 
scanning the entire wait list.

>> +
>> +/*
>> + * Handler for UINTR_KERNEL_VECTOR.
>> + */
>> +DEFINE_IDTENTRY_SYSVEC(sysvec_uintr_kernel_notification)
>> +{
>> +	/* TODO: Add entry-exit tracepoints */
>> +	ack_APIC_irq();
>> +	inc_irq_stat(uintr_kernel_notifications);
>> +
>> +	uintr_wake_up_process();
> So this interrupt happens for any of those notifications. How are they
> differentiated?


Unfortunately, there is no help from the hardware here to identify the 
intended target.

When a task blocks we:
* switch the UINV to a kernel NV.
* leave SN as 0
* leave UPID.NDST to the current cpu
* add the task to a wait list

When the notification interrupt arrives:
* Scan the entire wait list to check if the ON bit is set for any UPID 
(very inefficient)
* Set SN to 1 for that task.
* Change the UINV to user NV.
* Remove the task from the list and make it runnable.

We could end up detecting multiple tasks that have the ON bit set. The 
notification interrupt for any task that has ON set is expected to 
arrive soon anyway. So no harm done here.

The main issue here is we would end up scanning the entire list for 
every interrupt. Not sure if there any way we could optimize this?


> Again. We have proper wait primitives.

I'll use proper wait primitives next time.
>> +	return -EINTR;
>> +}
>> +
>> +/*
>> + * Runs in interrupt context.
>> + * Scan through all UPIDs to check if any interrupt is on going.
>> + */
>> +void uintr_wake_up_process(void)
>> +{
>> +	struct uintr_upid_ctx *upid_ctx, *tmp;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&uintr_wait_lock, flags);
>> +	list_for_each_entry_safe(upid_ctx, tmp, &uintr_wait_list, node) {
>> +		if (test_bit(UPID_ON, (unsigned long*)&upid_ctx->upid->nc.status)) {
>> +			set_bit(UPID_SN, (unsigned long *)&upid_ctx->upid->nc.status);
>> +			upid_ctx->upid->nc.nv = UINTR_NOTIFICATION_VECTOR;
>> +			upid_ctx->waiting = false;
>> +			wake_up_process(upid_ctx->task);
>> +			list_del(&upid_ctx->node);
> So any of these notification interrupts does a global mass wake up? How
> does that make sense?


The wake up happens only for the tasks that have a pending interrupt. 
They are going to be woken up soon anyways.

>> +/* Called when task is unregistering/exiting */
>> +static void uintr_remove_task_wait(struct task_struct *task)
>> +{
>> +	struct uintr_upid_ctx *upid_ctx, *tmp;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&uintr_wait_lock, flags);
>> +	list_for_each_entry_safe(upid_ctx, tmp, &uintr_wait_list, node) {
>> +		if (upid_ctx->task == task) {
>> +			pr_debug("wait: Removing task %d from wait\n",
>> +				 upid_ctx->task->pid);
>> +			upid_ctx->upid->nc.nv = UINTR_NOTIFICATION_VECTOR;
>> +			upid_ctx->waiting = false;
>> +			list_del(&upid_ctx->node);
>> +		}
> What? You have to do a global list walk to find the entry which you
> added yourself?

Duh! I could have gotten the upid_ctx from the task_struct itself. Will 
fix this.

Thanks,

Sohil



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-25 12:08     ` Thomas Gleixner
@ 2021-09-28 23:13       ` Sohil Mehta
  0 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-28 23:13 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/25/2021 5:08 AM, Thomas Gleixner wrote:
> Aisde of that this is completely broken vs. CPU hotplug.
>
Thank you for pointing this out. I hadn't even considered CPU hotplug.

Thanks,
Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-26 14:41   ` Thomas Gleixner
@ 2021-09-29  1:09     ` Sohil Mehta
  0 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-29  1:09 UTC (permalink / raw)
  To: Thomas Gleixner, x86
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Andy Lutomirski, Jens Axboe, Christian Brauner,
	Peter Zijlstra, Shuah Khan, Arnd Bergmann, Jonathan Corbet,
	Ashok Raj, Jacob Pan, Gayatri Kammela, Zeng Guang, Dan Williams,
	Randy E Witt, Ravi V Shankar, Ramesh Thomas, linux-api,
	linux-arch, linux-kernel, linux-kselftest

On 9/26/2021 7:41 AM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
>> Add a new system call to allow applications to block in the kernel and
>> wait for user interrupts.
>>
>> <The current implementation doesn't support waking up from other
>> blocking system calls like sleep(), read(), epoll(), etc.
>>
>> uintr_wait() is a placeholder syscall while we decide on that
>> behaviour.>
> Which behaviour? You cannot integrate this into [clock_]nanosleep() by
> any means or wakeup something which is blocked in read(somefd) via
> SENDUIPI.

That is the (wishful) desire.

The idea is to have a behavior similar to signals for all or a subset of 
system calls. i.e. return an EINTR by interrupting the blocked syscall 
and possibly have a SA_RESTART type of mechanism.

Can we use the existing signal infrastructure to generate a temporary 
in-kernel signal upon detection of an pending user interrupt? The 
temporary signal doesn't need to be delivered to application but it 
would just be a mechanism to interrupt the blocked syscall.

I don't know anything about the signaling subsystem nor have I tried 
prototyping this. So, all this might be completely baseless.


> What you can do is implement read() and poll() support for the
> uintrfd. Anything else is just not going to fly.
>
> Adding support for read/poll is pretty much a straight forward variant
> of a correctly implemented wait()/wakeup() mechanism.

I tried doing this but I ran into a couple of issues.

1) uintrfd is mapped to a single vector (out of 64). But there is no 
easy hardware mechanism to wait for specific vectors. Waiting for one 
vector might mean waiting for all.

2) The scope of uintrfd is process wide. Also, it would be shared with 
senders. But the wait/wake mechanism is specific to the task that 
created the fd and has a UPID allocated.
As you mentioned below, relaying the pending interrupt information of 
another task would be very tricky.


> While poll()/read() support might be useful and poll() also provides a
> timeout, having an explicit (timed) wait mechanism might be interesting.

I prototyped uintr_wait() with the same intention to have an explicit 
timed yield mechanism. There is very little ambiguity about who is 
waiting for what and how we would deliver the interrupts.


> But that brings me to an interesting question. There are two cases:
>
>   1) The task installed a user space interrupt handler. Now it
>      want's to play nice and yield the CPU while waiting.
>
>      So it needs to reinstall the UINV vector on return to user and
>      update UIRR, but that'd be covered by the existing mechanism. Fine.
>
>   2) Task has no user space interrupt handler installed and just want's
>      to use that wait mechanism.
>
>      What is consuming the pending bit(s)?
>
>      If that's not a valid use case, then the wait has to check for that
>      and reject the syscall with EINVAL.

Yeah. I feel this is not a valid use case. But I am no application 
developer. I will try to seek more opinions here.


>      If it is valid, then how are the pending bits consumed and relayed to
>      user space?

This is very tricky. Because a task that owns the UPID might be 
consuming interrupts while the kernel tries to relay the pending 
interrupt information to another task.


> The same questions arise when you think about implementing poll/read
> support simply because the regular poll/read semantics are:
>
>    poll waits for the event and read consumes the event
> which would be similar to #2 above, but with an installed user space
> interrupt handler the return from the poll system call would consume the
> event immediately (assumed that UIF is set).
>

Yup. There is no read data associated with uintrfd. This might be 
confusing for the application.

Overall, I feel signal handler semantics fit better with User interrupts 
handlers. But as you mentioned there might be no easy way to achieve that.

Thanks again for providing your input on this.

Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-13 20:01 ` [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall Sohil Mehta
  2021-09-24 11:04   ` Thomas Gleixner
  2021-09-26 14:41   ` Thomas Gleixner
@ 2021-09-29  3:30   ` Andy Lutomirski
  2021-09-29  4:56     ` Sohil Mehta
  2 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2021-09-29  3:30 UTC (permalink / raw)
  To: Sohil Mehta, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:
> Add a new system call to allow applications to block in the kernel and
> wait for user interrupts.
>

...

>
> When the application makes this syscall the notification vector is
> switched to a new kernel vector. Any new SENDUIPI will invoke the kernel
> interrupt which is then used to wake up the process.

Any new SENDUIPI that happens to hit the target CPU's ucode at a time when the kernel vector is enabled will deliver the interrupt.  Any new SENDUIPI that happens to hit the target CPU's ucode at a time when a different UIPI-using task is running will *not* deliver the interrupt, unless I'm missing some magic.  Which means that wakeups will be missed, which I think makes this whole idea a nonstarter.

Am I missing something?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (14 preceding siblings ...)
  2021-09-23 14:39 ` Jens Axboe
@ 2021-09-29  4:31 ` Andy Lutomirski
  2021-09-30 16:30   ` Stefan Hajnoczi
  2021-09-30 16:26 ` Stefan Hajnoczi
  2021-10-01  8:19 ` Pavel Machek
  17 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2021-09-29  4:31 UTC (permalink / raw)
  To: Sohil Mehta, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:
> User Interrupts Introduction
> ============================
>
> User Interrupts (Uintr) is a hardware technology that enables delivering
> interrupts directly to user space.
>
> Today, virtually all communication across privilege boundaries happens by going
> through the kernel. These include signals, pipes, remote procedure calls and
> hardware interrupt based notifications. User interrupts provide the foundation
> for more efficient (low latency and low CPU utilization) versions of these
> common operations by avoiding transitions through the kernel.
>

...

I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:

Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI?  Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them?  I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.

(I can imagine some benefit to a hypothetical improved SENDUIPI with idential user semantics but that supported a proper interaction with the scheduler and blocking syscalls.  But that's not what's documented in the ISE...)

--Andy


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-29  3:30   ` Andy Lutomirski
@ 2021-09-29  4:56     ` Sohil Mehta
  2021-09-30 18:08       ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Sohil Mehta @ 2021-09-29  4:56 UTC (permalink / raw)
  To: Andy Lutomirski, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On 9/28/2021 8:30 PM, Andy Lutomirski wrote:
> On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:
>> Add a new system call to allow applications to block in the kernel and
>> wait for user interrupts.
>>
> ...
>
>> When the application makes this syscall the notification vector is
>> switched to a new kernel vector. Any new SENDUIPI will invoke the kernel
>> interrupt which is then used to wake up the process.
> Any new SENDUIPI that happens to hit the target CPU's ucode at a time when the kernel vector is enabled will deliver the interrupt.  Any new SENDUIPI that happens to hit the target CPU's ucode at a time when a different UIPI-using task is running will *not* deliver the interrupt, unless I'm missing some magic.  Which means that wakeups will be missed, which I think makes this whole idea a nonstarter.
>
> Am I missing something?


The current kernel implementation reserves 2 notification vectors (NV) 
for the 2 states of a thread (running vs blocked).

NV-1 – used only for tasks that are running. (results in a user 
interrupt or a spurious kernel interrupt)

NV-2 – used only for a tasks that are blocked in the kernel. (always 
results in a kernel interrupt)

The UPID.UINV bits are switched between NV-1 and NV-2 based on the state 
of the task.

However, NV-1 is also programmed in the running task's MISC_MSR UINV 
bits. This is what tells the ucode that the notification vector received 
is for the user instead of the kernel.

NV-2 is never programmed in the MISC_MSR of a task. When NV-2 arrives on 
any cpu there is never a possibility of it being detected as a User 
Interrupt. It will always be delivered to the kernel.

Does this help clarify the above?


I just realized, we need to be careful when the notification vectors are 
switched in the UPID. Any pending vectors detected after the switch 
should abort the blocking call. The current code is wrong in a lot of 
places where it touches the UPID.

Thanks,
Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls
  2021-09-28 18:01     ` Sohil Mehta
@ 2021-09-29  7:04       ` Greg KH
  2021-09-29 14:27         ` Sohil Mehta
  0 siblings, 1 reply; 81+ messages in thread
From: Greg KH @ 2021-09-29  7:04 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On Tue, Sep 28, 2021 at 11:01:54AM -0700, Sohil Mehta wrote:
> On 9/23/2021 5:28 AM, Greg KH wrote:
> > On Mon, Sep 13, 2021 at 01:01:29PM -0700, Sohil Mehta wrote:
> > > +/* User Interrupt Target Table Entry (UITTE) */
> > > +struct uintr_uitt_entry {
> > > +	u8	valid;			/* bit 0: valid, bit 1-7: reserved */
> > Do you check that the other bits are set to 0?
> 
> I don't have a check but kzalloc() in alloc_uitt() should set it to 0.
> 
> > > +	u8	user_vec;
> > > +	u8	reserved[6];
> > What is this reserved for?
> 
> This is hardware defined structure as well. I should probably mention this
> it in the comment above.
> 
> > > +	u64	target_upid_addr;
> > If this is a pointer, why not say it is a pointer?
> 
> I used a u64 to get the size and alignment of this structure as required by
> the hardware. I wasn't sure if using a struct upid * would complicate that.
> 
> Also this field is never used as a pointer by the kernel. It is only used to
> program an entry that is read by the hardware.
> 
> Is this reasonable or would you still prefer a pointer?

Ok, just document it really well that this is NOT a real address used by
the kernel.  As it is, that's not obvious at all.

And if this crosses the user/kernel boundry, it needs to be __u64 right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls
  2021-09-29  7:04       ` Greg KH
@ 2021-09-29 14:27         ` Sohil Mehta
  0 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-29 14:27 UTC (permalink / raw)
  To: Greg KH
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On 9/29/2021 12:04 AM, Greg KH wrote:
> On Tue, Sep 28, 2021 at 11:01:54AM -0700, Sohil Mehta wrote:
>>
>> Is this reasonable or would you still prefer a pointer?
> Ok, just document it really well that this is NOT a real address used by
> the kernel.  As it is, that's not obvious at all.


Thanks. I'll do that.

>
> And if this crosses the user/kernel boundry, it needs to be __u64 right?


This one doesn't cross the user/kernel boundary. The kernel programs a 
value in this struct for the hardware to consume.

But there might be other places where I have missed that. I'll fix those.

Thanks,
Sohil




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (15 preceding siblings ...)
  2021-09-29  4:31 ` Andy Lutomirski
@ 2021-09-30 16:26 ` Stefan Hajnoczi
  2021-10-01  0:40   ` Sohil Mehta
  2021-10-01  8:19 ` Pavel Machek
  17 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-09-30 16:26 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 4771 bytes --]

On Mon, Sep 13, 2021 at 01:01:19PM -0700, Sohil Mehta wrote:
> User Interrupts Introduction
> ============================
> 
> User Interrupts (Uintr) is a hardware technology that enables delivering
> interrupts directly to user space.
> 
> Today, virtually all communication across privilege boundaries happens by going
> through the kernel. These include signals, pipes, remote procedure calls and
> hardware interrupt based notifications. User interrupts provide the foundation
> for more efficient (low latency and low CPU utilization) versions of these
> common operations by avoiding transitions through the kernel.
> 
> In the User Interrupts hardware architecture, a receiver is always expected to
> be a user space task. However, a user interrupt can be sent by another user
> space task, kernel or an external source (like a device).
> 
> In addition to the general infrastructure to receive user interrupts, this
> series introduces a single source: interrupts from another user task.  These
> are referred to as User IPIs.
> 
> The first implementation of User IPIs will be in the Intel processor code-named
> Sapphire Rapids. Refer Chapter 11 of the Intel Architecture instruction set
> extensions for details of the hardware architecture [1].
> 
> Series-reviewed-by: Tony Luck <tony.luck@intel.com>
> 
> Main goals of this RFC
> ======================
> - Introduce this upcoming technology to the community.
> This cover letter includes a hardware architecture summary along with the
> software architecture and kernel design choices. This post is a bit long as a
> result. Hopefully, it helps answer more questions than it creates :) I am also
> planning to talk about User Interrupts next week at the LPC Kernel summit.
> 
> - Discuss potential use cases.
> We are starting to look at actual usages and libraries (like libevent[2] and
> liburing[3]) that can take advantage of this technology. Unfortunately, we
> don't have much to share on this right now. We need some help from the
> community to identify usages that can benefit from this. We would like to make
> sure the proposed APIs work for the eventual consumers.
> 
> - Get early feedback on the software architecture.
> We are hoping to get some feedback on the direction of overall software
> architecture - starting with User IPI, extending it for kernel-to-user
> interrupt notifications and external interrupts in the future. 
> 
> - Discuss some of the main architecture opens.
> There is lot of work that still needs to happen to enable this technology. We
> are looking for some input on future patches that would be of interest. Here
> are some of the big opens that we are looking to resolve.
> * Should Uintr interrupt all blocking system calls like sleep(), read(),
>   poll(), etc? If so, should we implement an SA_RESTART type of mechanism
>   similar to signals? - Refer Blocking for interrupts section below.
> 
> * Should the User Interrupt Target table (UITT) be shared between threads of a
>   multi-threaded application or maybe even across processes? - Refer Sharing
>   the UITT section below.
> 
> Why care about this? - Micro benchmark performance
> ==================================================
> There is a ~9x or higher performance improvement using User IPI over other IPC
> mechanisms for event signaling.
> 
> Below is the average normalized latency for a 1M ping-pong IPC notifications
> with message size=1.
> 
> +------------+-------------------------+
> | IPC type   |   Relative Latency      |
> |            |(normalized to User IPI) |
> +------------+-------------------------+
> | User IPI   |                     1.0 |
> | Signal     |                    14.8 |
> | Eventfd    |                     9.7 |

Is this the bi-directional eventfd benchmark?
https://github.com/intel/uintr-ipc-bench/blob/linux-rfc-v1/source/eventfd/eventfd-bi.c

Two things stand out:

1. The server and client threads are racing on the same eventfd.
   Eventfds aren't bi-directional! The eventfd_wait() function has code
   to write the value back, which is a waste of CPU cycles and hinders
   progress. I've never seen eventfd used this way in real applications.
   Can you use two separate eventfds?

2. The fd is in blocking mode and the task may be descheduled, so we're
   measuring eventfd read/write latency plus scheduler/context-switch
   latency. A fairer comparison against user interrupts would be to busy
   wait on a non-blocking fd so the scheduler/context-switch latency is
   mostly avoided. After all, the uintrfd-bi.c benchmark does this in
   uintrfd_wait():

     // Keep spinning until the interrupt is received
     while (!uintr_received[token]);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-29  4:31 ` Andy Lutomirski
@ 2021-09-30 16:30   ` Stefan Hajnoczi
  2021-09-30 17:24     ` Sohil Mehta
  0 siblings, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-09-30 16:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Sohil Mehta, the arch/x86 maintainers, Tony Luck, Dave Hansen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jens Axboe, Christian Brauner, Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 1594 bytes --]

On Tue, Sep 28, 2021 at 09:31:34PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:
> > User Interrupts Introduction
> > ============================
> >
> > User Interrupts (Uintr) is a hardware technology that enables delivering
> > interrupts directly to user space.
> >
> > Today, virtually all communication across privilege boundaries happens by going
> > through the kernel. These include signals, pipes, remote procedure calls and
> > hardware interrupt based notifications. User interrupts provide the foundation
> > for more efficient (low latency and low CPU utilization) versions of these
> > common operations by avoiding transitions through the kernel.
> >
> 
> ...
> 
> I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:
> 
> Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI?  Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them?  I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.

I was wondering the same thing. One thing came to mind:

An application that wants to be *interrupted* from what it's doing
rather than waiting until the next polling point. For example,
applications that are CPU-intensive and have green threads. I can't name
a real application like this though :P.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-30 16:30   ` Stefan Hajnoczi
@ 2021-09-30 17:24     ` Sohil Mehta
  2021-09-30 17:26       ` Andy Lutomirski
  2021-10-01 16:35       ` Stefan Hajnoczi
  0 siblings, 2 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-09-30 17:24 UTC (permalink / raw)
  To: Stefan Hajnoczi, Andy Lutomirski
  Cc: the arch/x86 maintainers, Tony Luck, Dave Hansen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jens Axboe, Christian Brauner, Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest


On 9/30/2021 9:30 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 28, 2021 at 09:31:34PM -0700, Andy Lutomirski wrote:
>>
>> I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:
>>
>> Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI?  Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them?  I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.
> I was wondering the same thing. One thing came to mind:
>
> An application that wants to be *interrupted* from what it's doing
> rather than waiting until the next polling point. For example,
> applications that are CPU-intensive and have green threads. I can't name
> a real application like this though :P.

Thank you Stefan and Andy for giving this some thought.

We are consolidating the information internally on where and how exactly 
we expect to see benefits with real workloads for the various sources of 
User Interrupts. It will take a few days to get back on this one.


> (I can imagine some benefit to a hypothetical improved SENDUIPI with idential user semantics but that supported a proper interaction with the scheduler and blocking syscalls.  But that's not what's documented in the ISE...)

Andy, can you please provide some more context/details on this? Is this 
regarding the blocking syscalls discussion (in patch 11) or something else?


Thanks,
Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-30 17:24     ` Sohil Mehta
@ 2021-09-30 17:26       ` Andy Lutomirski
  2021-10-01 16:35       ` Stefan Hajnoczi
  1 sibling, 0 replies; 81+ messages in thread
From: Andy Lutomirski @ 2021-09-30 17:26 UTC (permalink / raw)
  To: Sohil Mehta, Stefan Hajnoczi
  Cc: the arch/x86 maintainers, Tony Luck, Dave Hansen,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Jens Axboe, Christian Brauner, Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest



On Thu, Sep 30, 2021, at 10:24 AM, Sohil Mehta wrote:
> On 9/30/2021 9:30 AM, Stefan Hajnoczi wrote:
>> On Tue, Sep 28, 2021 at 09:31:34PM -0700, Andy Lutomirski wrote:
>>>
>>> I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:
>>>
>>> Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI?  Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them?  I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.
>> I was wondering the same thing. One thing came to mind:
>>
>> An application that wants to be *interrupted* from what it's doing
>> rather than waiting until the next polling point. For example,
>> applications that are CPU-intensive and have green threads. I can't name
>> a real application like this though :P.
>
> Thank you Stefan and Andy for giving this some thought.
>
> We are consolidating the information internally on where and how exactly 
> we expect to see benefits with real workloads for the various sources of 
> User Interrupts. It will take a few days to get back on this one.

Thanks!

>
>
>> (I can imagine some benefit to a hypothetical improved SENDUIPI with idential user semantics but that supported a proper interaction with the scheduler and blocking syscalls.  But that's not what's documented in the ISE...)
>
> Andy, can you please provide some more context/details on this? Is this 
> regarding the blocking syscalls discussion (in patch 11) or something else?
>

Yes, and I'll follow up there.  I hereby upgrade my opinion of SENDUIPI wakeups to "probably doable but maybe not in a nice way."

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-29  4:56     ` Sohil Mehta
@ 2021-09-30 18:08       ` Andy Lutomirski
  2021-09-30 19:29         ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2021-09-30 18:08 UTC (permalink / raw)
  To: Sohil Mehta, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On Tue, Sep 28, 2021, at 9:56 PM, Sohil Mehta wrote:
> On 9/28/2021 8:30 PM, Andy Lutomirski wrote:
>> On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:
>>> Add a new system call to allow applications to block in the kernel and
>>> wait for user interrupts.
>>>
>> ...
>>
>>> When the application makes this syscall the notification vector is
>>> switched to a new kernel vector. Any new SENDUIPI will invoke the kernel
>>> interrupt which is then used to wake up the process.
>> Any new SENDUIPI that happens to hit the target CPU's ucode at a time when the kernel vector is enabled will deliver the interrupt.  Any new SENDUIPI that happens to hit the target CPU's ucode at a time when a different UIPI-using task is running will *not* deliver the interrupt, unless I'm missing some magic.  Which means that wakeups will be missed, which I think makes this whole idea a nonstarter.
>>
>> Am I missing something?
>
>
> The current kernel implementation reserves 2 notification vectors (NV) 
> for the 2 states of a thread (running vs blocked).
>
> NV-1 – used only for tasks that are running. (results in a user 
> interrupt or a spurious kernel interrupt)
>
> NV-2 – used only for a tasks that are blocked in the kernel. (always 
> results in a kernel interrupt)
>
> The UPID.UINV bits are switched between NV-1 and NV-2 based on the state 
> of the task.

Aha, cute.  So NV-1 is only sent if the target is directly paying attention and, assuming all the atomics are done right, NV-2 will be sent for tasks that are asleep.

Logically, I think these are the possible states for a receiving task:

1. Running.  SENDUIPI will actually deliver the event directly (or not if uintr is masked).  If the task just stopped running and the atomics are right, then the schedule-out code can, I think, notice.

2. Not running, but either runnable or not currently waiting for uintr (e.g. blocked in an unrelated syscall).  This is straightforward -- no IPI or other action is needed other than setting the uintr-pending bit.

3. Blocked and waiting for uintr.  For this to work right, anyone trying to send with SENDUIPI (or maybe a vdso or similar clever wrapper around it) needs to result in either a fault or an IPI so the kernel can process the wakeup.

(Note that, depending on how fancy we get with file descriptors and polling, we need to watch out for the running-and-also-waiting-for-kernel-notification state.  That one will never work right.)

3 is the nasty case, and your patch makes it work with this NV-2 trick.  The trick is a bit gross for a couple reasons.  First, it conveys no useful information to the kernel except that an unknown task did SENDUIPI and maybe that the target was most recently on a given CPU.  So a big list search is needed.  Also, it hits an essentially arbitrary and possibly completely innocent victim CPU and task, and people doing any sort of task isolation workload will strongly dislike this.  For some of those users, "strongly" may mean "treat system as completely failed, fail over to something else and call expensive tech support."  So we can't do that.

I think we have three choices:

Use a fancy wrapper around SENDUIPI.  This is probably a bad idea.

Treat the NV-2 as a real interrupt and honor affinity settings.  This will be annoying and slow, I think, if it's even workable at all.

Handle this case with faults instead of interrupts.  We could set a reserved bit in UPID so that SENDUIPI results in #GP, decode it, and process it.  This puts the onus on the actual task causing trouble, which is nice, and it lets us find the UPID and target directly instead of walking all of them.  I don't know how well it would play with hypothetical future hardware-initiated uintrs, though.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-30 18:08       ` Andy Lutomirski
@ 2021-09-30 19:29         ` Thomas Gleixner
  2021-09-30 22:01           ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-09-30 19:29 UTC (permalink / raw)
  To: Andy Lutomirski, Sohil Mehta, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On Thu, Sep 30 2021 at 11:08, Andy Lutomirski wrote:
> On Tue, Sep 28, 2021, at 9:56 PM, Sohil Mehta wrote:
> I think we have three choices:
>
> Use a fancy wrapper around SENDUIPI.  This is probably a bad idea.
>
> Treat the NV-2 as a real interrupt and honor affinity settings.  This
> will be annoying and slow, I think, if it's even workable at all.

We can make it a real interrupt in form of a per CPU interrupt, but
affinity settings are not really feasible because the affinity is in the
UPID.ndst field. So, yes we can target it to some CPU, but that's racy.

> Handle this case with faults instead of interrupts.  We could set a
> reserved bit in UPID so that SENDUIPI results in #GP, decode it, and
> process it.  This puts the onus on the actual task causing trouble,
> which is nice, and it lets us find the UPID and target directly
> instead of walking all of them.  I don't know how well it would play
> with hypothetical future hardware-initiated uintrs, though.

I thought about that as well and dismissed it due to the hardware
initiated ones but thinking more about it, those need some translation
unit (e.g. irq remapping) anyway, so it might be doable to catch those
as well. So we could just ignore them for now and go for the #GP trick
and deal with the device initiated ones later when they come around :)

But even with that we still need to keep track of the armed ones per CPU
so we can handle CPU hotunplug correctly. Sigh...

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-30 19:29         ` Thomas Gleixner
@ 2021-09-30 22:01           ` Andy Lutomirski
  2021-10-01  0:01             ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2021-09-30 22:01 UTC (permalink / raw)
  To: Thomas Gleixner, Sohil Mehta, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest



On Thu, Sep 30, 2021, at 12:29 PM, Thomas Gleixner wrote:
> On Thu, Sep 30 2021 at 11:08, Andy Lutomirski wrote:
>> On Tue, Sep 28, 2021, at 9:56 PM, Sohil Mehta wrote:
>> I think we have three choices:
>>
>> Use a fancy wrapper around SENDUIPI.  This is probably a bad idea.
>>
>> Treat the NV-2 as a real interrupt and honor affinity settings.  This
>> will be annoying and slow, I think, if it's even workable at all.
>
> We can make it a real interrupt in form of a per CPU interrupt, but
> affinity settings are not really feasible because the affinity is in the
> UPID.ndst field. So, yes we can target it to some CPU, but that's racy.
>
>> Handle this case with faults instead of interrupts.  We could set a
>> reserved bit in UPID so that SENDUIPI results in #GP, decode it, and
>> process it.  This puts the onus on the actual task causing trouble,
>> which is nice, and it lets us find the UPID and target directly
>> instead of walking all of them.  I don't know how well it would play
>> with hypothetical future hardware-initiated uintrs, though.
>
> I thought about that as well and dismissed it due to the hardware
> initiated ones but thinking more about it, those need some translation
> unit (e.g. irq remapping) anyway, so it might be doable to catch those
> as well. So we could just ignore them for now and go for the #GP trick
> and deal with the device initiated ones later when they come around :)

Sounds good to me. In the long run, if Intel wants device initiated fancy interrupts to work well, they need a new design.

>
> But even with that we still need to keep track of the armed ones per CPU
> so we can handle CPU hotunplug correctly. Sigh...

I don’t think any real work is needed. We will only ever have armed UPIDs (with notification interrupts enabled) for running tasks, and hot-unplugged CPUs don’t have running tasks.  We do need a way to drain pending IPIs before we offline a CPU, but that’s a separate problem and may be unsolvable for all I know. Is there a magic APIC operation to wait until all initiated IPIs targeting the local CPU arrive?  I guess we can also just mask the notification vector so that it won’t crash us if we get a stale IPI after going offline.

>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-09-30 22:01           ` Andy Lutomirski
@ 2021-10-01  0:01             ` Thomas Gleixner
  2021-10-01  4:41               ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-10-01  0:01 UTC (permalink / raw)
  To: Andy Lutomirski, Sohil Mehta, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On Thu, Sep 30 2021 at 15:01, Andy Lutomirski wrote:
> On Thu, Sep 30, 2021, at 12:29 PM, Thomas Gleixner wrote:
>>
>> But even with that we still need to keep track of the armed ones per CPU
>> so we can handle CPU hotunplug correctly. Sigh...
>
> I don’t think any real work is needed. We will only ever have armed
> UPIDs (with notification interrupts enabled) for running tasks, and
> hot-unplugged CPUs don’t have running tasks.

That's not the problem. The problem is the wait for uintr case where the
task is obviously not running:

CPU 1
     upid = T1->upid;
     upid->vector = UINTR_WAIT_VECTOR;
     upid->ndst = local_apic_id();
     ...
     do {
         ....
         schedule();
     }

CPU 0
    unplug CPU 1

    SENDUPI(index)
        // Hardware does:
        tblentry = &ttable[index];
        upid = tblentry->upid;
        upid->pir |= tblentry->uv;
        send_IPI(upid->vector, upid->ndst);

So SENDUPI will send the IPI to the APIC ID provided by T1->upid.ndst
which points to the offlined CPU 1 and therefore is obviously going to
/dev/null. IOW, lost wakeup...

> We do need a way to drain pending IPIs before we offline a CPU, but
> that’s a separate problem and may be unsolvable for all I know. Is
> there a magic APIC operation to wait until all initiated IPIs
> targeting the local CPU arrive?  I guess we can also just mask the
> notification vector so that it won’t crash us if we get a stale IPI
> after going offline.

All of this is solved already otherwise CPU hot unplug would explode in
your face every time. The software IPI send side is carefully
synchronized vs. hotplug (at least in theory). May I ask you politely to
make yourself familiar with all that before touting "We do need..." based
on random assumptions?

The above SENDUIPI vs. CPU hotplug scenario is the same problem as we
have with regular device interrupts which are targeted at an outgoing
CPU. We have magic mechanisms in place to handle that to the extent
possible, but due to the insanity of X86 interrupt handling mechanics
that still leaves a very tiny hole which might cause a lost and
subsequently stale interrupt. Nothing we can fix in software.

So on CPU offline the hotplug code walks through all device interrupts
and checks whether they are targeted at the outgoing CPU. If so they are
rerouted to an online CPU with lots of care to make the possible race
window as small as it gets. That's nowadays only a problem on systems
where interrupt remapping is not available or disabled via commandline.

For tasks which just have the user interrupt armed there is no problem
because SENDUPI modifies UPID->PIR which is reevaluated when the task
which got migrated to an online CPU is going back to user space.

The uintr_wait() syscall creates the very same problem as we have with
device interrupts. Which means we need to make that wait thing:

     upid = T1->upid;
     upid->vector = UINTR_WAIT_VECTOR;
     upid->ndst = local_apic_id();
     list_add(this_cpu_ptr(pcp_uintrs), upid->pcp_uintr);
     ...
     do {
         ....
         schedule();
     }
     list_del_init(upid->pcp_uintr);

and the hotplug code do:

    for_each_entry_safe(upid, this_cpu_ptr(pcp_uintrs), ...) {
       list_del(upid->pcp_uintr);
       upid->ndst = apic_id_of_random_online_cpu();
       if (do_magic_checks_whether_ipi_is_pending())
         send_ipi(upid->vector, upid->ndst);
    }

See?

We could map that to the interrupt subsystem by creating a virtual
interrupt domain for this, but that would make uintr_wait() look like
this:

     irq = uintr_alloc_irq();
     request_irq(irq, ......);
     upid = T1->upid;
     upid->vector = UINTR_WAIT_VECTOR;
     upid->ndst = local_apic_id();
     list_add(this_cpu_ptr(pcp_uintrs), upid->pcp_uintr);
     ...
     do {
         ....
         schedule();
     }
     list_del_init(upid->pcp_uintr);
     free_irq(irq);

But the benefit of that is dubious as it creates overhead on both sides
of the sleep and the only real purpose of the irq request would be to
handle CPU hotunplug without the above per CPU list mechanics.

Welcome to my wonderful world!

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-30 16:26 ` Stefan Hajnoczi
@ 2021-10-01  0:40   ` Sohil Mehta
  0 siblings, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-10-01  0:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

On 9/30/2021 9:26 AM, Stefan Hajnoczi wrote:
> On Mon, Sep 13, 2021 at 01:01:19PM -0700, Sohil Mehta wrote:
>> +------------+-------------------------+
>> | IPC type   |   Relative Latency      |
>> |            |(normalized to User IPI) |
>> +------------+-------------------------+
>> | User IPI   |                     1.0 |
>> | Signal     |                    14.8 |
>> | Eventfd    |                     9.7 |
> Is this the bi-directional eventfd benchmark?
> https://github.com/intel/uintr-ipc-bench/blob/linux-rfc-v1/source/eventfd/eventfd-bi.c

Yes. I have left it unmodified from the original source. But, I should 
have looked at it more closely.

> Two things stand out:
>
> 1. The server and client threads are racing on the same eventfd.
>     Eventfds aren't bi-directional! The eventfd_wait() function has code
>     to write the value back, which is a waste of CPU cycles and hinders
>     progress. I've never seen eventfd used this way in real applications.
>     Can you use two separate eventfds?

Sure. I can do that.


> 2. The fd is in blocking mode and the task may be descheduled, so we're
>     measuring eventfd read/write latency plus scheduler/context-switch
>     latency. A fairer comparison against user interrupts would be to busy
>     wait on a non-blocking fd so the scheduler/context-switch latency is
>     mostly avoided. After all, the uintrfd-bi.c benchmark does this in
>     uintrfd_wait():
>
>       // Keep spinning until the interrupt is received
>       while (!uintr_received[token]);

That makes sense. I'll give this a try and send out the updated results.

Thanks,
Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-10-01  0:01             ` Thomas Gleixner
@ 2021-10-01  4:41               ` Andy Lutomirski
  2021-10-01  9:56                 ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2021-10-01  4:41 UTC (permalink / raw)
  To: Thomas Gleixner, Sohil Mehta, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest



On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
> On Thu, Sep 30 2021 at 15:01, Andy Lutomirski wrote:
>> On Thu, Sep 30, 2021, at 12:29 PM, Thomas Gleixner wrote:
>>>
>>> But even with that we still need to keep track of the armed ones per CPU
>>> so we can handle CPU hotunplug correctly. Sigh...
>>
>> I don’t think any real work is needed. We will only ever have armed
>> UPIDs (with notification interrupts enabled) for running tasks, and
>> hot-unplugged CPUs don’t have running tasks.
>
> That's not the problem. The problem is the wait for uintr case where the
> task is obviously not running:
>
> CPU 1
>      upid = T1->upid;
>      upid->vector = UINTR_WAIT_VECTOR;
>      upid->ndst = local_apic_id();
>      ...
>      do {
>          ....
>          schedule();
>      }
>
> CPU 0
>     unplug CPU 1
>
>     SENDUPI(index)
>         // Hardware does:
>         tblentry = &ttable[index];
>         upid = tblentry->upid;
>         upid->pir |= tblentry->uv;
>         send_IPI(upid->vector, upid->ndst);
>
> So SENDUPI will send the IPI to the APIC ID provided by T1->upid.ndst
> which points to the offlined CPU 1 and therefore is obviously going to
> /dev/null. IOW, lost wakeup...

Yes, but I don't think this is how we should structure this.

CPU 1
 upid->vector = UINV;
 upid->ndst = local_apic_id()
 exit to usermode;
 return from usermode;
 ...

 schedule();
 fpu__save_crap [see below]:
   if (this task is waiting for a uintr) {
     upid->resv0 = 1;  /* arm #GP */
   } else {
     upid->sn = 1;
   }


>
>> We do need a way to drain pending IPIs before we offline a CPU, but
>> that’s a separate problem and may be unsolvable for all I know. Is
>> there a magic APIC operation to wait until all initiated IPIs
>> targeting the local CPU arrive?  I guess we can also just mask the
>> notification vector so that it won’t crash us if we get a stale IPI
>> after going offline.
>
> All of this is solved already otherwise CPU hot unplug would explode in
> your face every time. The software IPI send side is carefully
> synchronized vs. hotplug (at least in theory). May I ask you politely to
> make yourself familiar with all that before touting "We do need..." based
> on random assumptions?

I'm aware that the software send IPI side is synchronized against hotplug.  But SENDUIPI is not unless we're going to have the CPU offline code IPI every other CPU to make sure that their SENDUIPIs have completed -- we don't control the SENDUIPI code.

After reading the ISE docs again, I think it might be possible to use the ON bit to synchronize.  In the schedule-out path, if we discover that ON = 1, then there is an IPI in flight to us.  In theory, we could wait for it, although actually doing so could be a mess.  That's why I'm asking whether there's a way to tell the APIC to literally wait for all IPIs that are *already sent* to be delivered.

>
> The above SENDUIPI vs. CPU hotplug scenario is the same problem as we
> have with regular device interrupts which are targeted at an outgoing
> CPU. We have magic mechanisms in place to handle that to the extent
> possible, but due to the insanity of X86 interrupt handling mechanics
> that still leaves a very tiny hole which might cause a lost and
> subsequently stale interrupt. Nothing we can fix in software.
>
> So on CPU offline the hotplug code walks through all device interrupts
> and checks whether they are targeted at the outgoing CPU. If so they are
> rerouted to an online CPU with lots of care to make the possible race
> window as small as it gets. That's nowadays only a problem on systems
> where interrupt remapping is not available or disabled via commandline.
>
> For tasks which just have the user interrupt armed there is no problem
> because SENDUPI modifies UPID->PIR which is reevaluated when the task
> which got migrated to an online CPU is going back to user space.
>
> The uintr_wait() syscall creates the very same problem as we have with
> device interrupts. Which means we need to make that wait thing:
>
>      upid = T1->upid;
>      upid->vector = UINTR_WAIT_VECTOR;

This is exactly what I'm suggesting we *don't* do.  Instead we set a reserved bit, we decode SENDUIPI in the #GP handler, and we emulate, in-kernel, the notification process for non-running tasks.

Now that I read the docs some more, I'm seriously concerned about this XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If we actually use this, then the whole last_cpu "preserve the state in registers" optimization goes out the window.  So does anything that happens to assume that merely saving the state doesn't destroy it on respectable modern CPUs  XRSTORS will #GP if you XRSTORS twice, which makes me nervous and would need a serious audit of our XRSTORS paths.

This is gross.

--Andy

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
                   ` (16 preceding siblings ...)
  2021-09-30 16:26 ` Stefan Hajnoczi
@ 2021-10-01  8:19 ` Pavel Machek
  17 siblings, 0 replies; 81+ messages in thread
From: Pavel Machek @ 2021-10-01  8:19 UTC (permalink / raw)
  To: Sohil Mehta
  Cc: x86, Tony Luck, Dave Hansen, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H . Peter Anvin, Andy Lutomirski, Jens Axboe,
	Christian Brauner, Peter Zijlstra, Shuah Khan, Arnd Bergmann,
	Jonathan Corbet, Ashok Raj, Jacob Pan, Gayatri Kammela,
	Zeng Guang, Dan Williams, Randy E Witt, Ravi V Shankar,
	Ramesh Thomas, linux-api, linux-arch, linux-kernel,
	linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 1067 bytes --]

Hi!

> Instructions
> ------------
> senduipi <index> - send a user IPI to a target task based on the UITT index.
> 
> clui - Mask user interrupts by clearing UIF (User Interrupt Flag).
> 
> stui - Unmask user interrupts by setting UIF.
> 
> testui - Test current value of UIF.
> 
> uiret - return from a user interrupt handler.

Are other CPU vendors allowed to implement compatible instructions?

If not, we should probably have VDSO entries so kernel can abstract
differences between CPUs.

> Untrusted processes
> -------------------
> The current implementation expects only trusted and cooperating processes to
> communicate using user interrupts. Coordination is expected between processes
> for a connection teardown. In situations where coordination doesn't happen
> (say, due to abrupt process exit), the kernel would end up keeping shared
> resources (like UPID) allocated to avoid faults.

Keeping resources allocated after process exit is a no-no.

Best regards,
								Pavel
-- 
http://www.livejournal.com/~pavelmachek

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-10-01  4:41               ` Andy Lutomirski
@ 2021-10-01  9:56                 ` Thomas Gleixner
  2021-10-01 15:13                   ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2021-10-01  9:56 UTC (permalink / raw)
  To: Andy Lutomirski, Sohil Mehta, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:
> On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
>> All of this is solved already otherwise CPU hot unplug would explode in
>> your face every time. The software IPI send side is carefully
>> synchronized vs. hotplug (at least in theory). May I ask you politely to
>> make yourself familiar with all that before touting "We do need..." based
>> on random assumptions?
>
> I'm aware that the software send IPI side is synchronized against
> hotplug.  But SENDUIPI is not unless we're going to have the CPU
> offline code IPI every other CPU to make sure that their SENDUIPIs
> have completed -- we don't control the SENDUIPI code.

That's correct, but on CPU hot unplug _all_ running tasks have been
migrated to an online CPU _before_ the APIC is turned off. So they all
went through schedule() which set the UPID->SN bit. That's obviously
racy, but that has to be handled in exit to user mode anyway because
that's not different from any other migration or preemption. So that's
_not_ a problem at all.

The problem only exists if we can't do the #GP trick for tasks which are
sitting in uintr_wait(). Because then we _have_ to be careful vs. a
concurrent SENDUPI. But that'd be not any different from the problem
vs. device interrupts which we have today.

If we can use #GP then there is no problem at all and we avoid all the
nasty stuff vs. hotplug and avoid the list walk etc.

> After reading the ISE docs again, I think it might be possible to use
> the ON bit to synchronize.  In the schedule-out path, if we discover
> that ON = 1, then there is an IPI in flight to us.  In theory, we
> could wait for it, although actually doing so could be a mess.  That's
> why I'm asking whether there's a way to tell the APIC to literally
> wait for all IPIs that are *already sent* to be delivered.

You could busy poll with interrupts enabled, but that does not solve
anything. What guarantees that after APIC.IRR is clear no further IPI is
sent? Nothing at all. But again, that's not any different from device
interrupts and we already handle that today:

      cpu down()
      ...
      disable interrupts();
      for_each_interrupt_affine_to_cpu(irq) {
      	change_affinity_to_online_cpu(irq, new_target_cpu);
        // Did device send to the old vector?
        if (APIC.IRR & vector_bit(old_vector))
           send_IPI(new_target_cpu, new_vector);
      }

So for uintr_wait() w/o #GP we'd need to do:

      for_each_waiter_on_cpu(w) {
           move_waiter_to_new_target_cpu_wait_list(w);
           w->ndest = new_target_cpu;
           if (w->ON)
              send_IPI(new_target_cpu, UIWAIT_VECTOR);
      }

>> The uintr_wait() syscall creates the very same problem as we have with
>> device interrupts. Which means we need to make that wait thing:
>>
>>      upid = T1->upid;
>>      upid->vector = UINTR_WAIT_VECTOR;
>
> This is exactly what I'm suggesting we *don't* do.  Instead we set a
> reserved bit, we decode SENDUIPI in the #GP handler, and we emulate,
> in-kernel, the notification process for non-running tasks.

Yes, under the assumption that we can use #GP without breaking device
delivery.

> Now that I read the docs some more, I'm seriously concerned about this
> XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If
> we actually use this, then the whole last_cpu "preserve the state in
> registers" optimization goes out the window.  So does anything that
> happens to assume that merely saving the state doesn't destroy it on
> respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which
> makes me nervous and would need a serious audit of our XRSTORS paths.

I have no idea what you are fantasizing about. You can XRSTORS five
times in a row as long as your XSTATE memory image is correct.

If you don't want to use XSAVES to manage UINTR then you have to manualy
fiddle with the MSRs and UIF in schedule() and return to user space.

Also keeping UINV alive when scheduling out creates a life time issue
vs. UPID:

CPU 0   CPU 1                   CPU2
        T1 -> schedule         // UPID is live in UINTR MSRs
        do_stuff_in_kernel()
        local_irq_disable();
                                SENDUIPI(T1 -> CPU1)
pull T1
T1 exits
free UPID

        local_irq_enable();
        ucode handles UINV -> UAF

Clearing UINV prevents the ucode from handling the IPI and fiddling with
UPID. The CPU will forward the IPI vector to the kernel which acks it
and does nothing else, i.e. it's a spurious interrupt.

Coming back to state preserving. All what needs to be done for a
situation where the rest of the XSTATE is live in the registers, i.e.
the T -> kthread -> T scheduling scenario, is to restore UINV on exit to
user mode and handle UPID.PIR which might contain newly set bits which
are obviously not yet in UPID.IRR. That can be done by MSR fiddling or
by issuing an self IPI on the UINV vector which will be handled in ucode
on the first user space instruction after return.

When the FPU has to be restored then the state has to be updated in the
XSAVE memory image before doing XRSTORS.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-10-01  9:56                 ` Thomas Gleixner
@ 2021-10-01 15:13                   ` Andy Lutomirski
  2021-10-01 18:04                     ` Sohil Mehta
  2021-10-01 21:29                     ` Thomas Gleixner
  0 siblings, 2 replies; 81+ messages in thread
From: Andy Lutomirski @ 2021-10-01 15:13 UTC (permalink / raw)
  To: Thomas Gleixner, Sohil Mehta, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On Fri, Oct 1, 2021, at 2:56 AM, Thomas Gleixner wrote:
> On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:
>> On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
>
>> Now that I read the docs some more, I'm seriously concerned about this
>> XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If
>> we actually use this, then the whole last_cpu "preserve the state in
>> registers" optimization goes out the window.  So does anything that
>> happens to assume that merely saving the state doesn't destroy it on
>> respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which
>> makes me nervous and would need a serious audit of our XRSTORS paths.
>
> I have no idea what you are fantasizing about. You can XRSTORS five
> times in a row as long as your XSTATE memory image is correct.

I'm just reading TFM, which is some kind of dystopian fantasy.

11.8.2.4 XRSTORS

Before restoring the user-interrupt state component, XRSTORS verifies that UINV is 0. If it is not, XRSTORS
causes a general-protection fault (#GP) before loading any part of the user-interrupt state component. (UINV
is IA32_UINTR_MISC[39:32]; XRSTORS does not check the contents of the remainder of that MSR.)

So if UINV is set in the memory image and you XRSTORS five times in a row, the first one will work assuming UINV was zero.  The second one will #GP.  And:

11.8.2.3 XSAVES
After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32];
XSAVES does not modify the remainder of that MSR.)

So if we're running a UPID-enabled user task and we switch to a kernel thread, we do XSAVES and UINV is cleared.  Then we switch back to the same task and don't do XRSTORS (or otherwise write IA32_UINTR_MISC) and UINV is still clear.

And we had better clear UINV when running a kernel thread because the UPID might get freed or the kernel thread might do some CPL3 shenanigans (via EFI, perhaps? I don't know if any firmwares actually do this).

So all this seems to put UINV into the "independent" category of feature along with LBR.  And the 512-byte wastes from extra copies of the legacy area and the loss of the XMODIFIED optimization will just be collateral damage.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-09-30 17:24     ` Sohil Mehta
  2021-09-30 17:26       ` Andy Lutomirski
@ 2021-10-01 16:35       ` Stefan Hajnoczi
  2021-10-01 16:41         ` Richard Henderson
  1 sibling, 1 reply; 81+ messages in thread
From: Stefan Hajnoczi @ 2021-10-01 16:35 UTC (permalink / raw)
  To: Sohil Mehta, Peter Maydell, Alex Bennée, Richard Henderson
  Cc: Andy Lutomirski, the arch/x86 maintainers, Tony Luck,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2018 bytes --]

On Thu, Sep 30, 2021 at 10:24:24AM -0700, Sohil Mehta wrote:
> 
> On 9/30/2021 9:30 AM, Stefan Hajnoczi wrote:
> > On Tue, Sep 28, 2021 at 09:31:34PM -0700, Andy Lutomirski wrote:
> > > 
> > > I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:
> > > 
> > > Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI?  Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them?  I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.
> > I was wondering the same thing. One thing came to mind:
> > 
> > An application that wants to be *interrupted* from what it's doing
> > rather than waiting until the next polling point. For example,
> > applications that are CPU-intensive and have green threads. I can't name
> > a real application like this though :P.
> 
> Thank you Stefan and Andy for giving this some thought.
> 
> We are consolidating the information internally on where and how exactly we
> expect to see benefits with real workloads for the various sources of User
> Interrupts. It will take a few days to get back on this one.

One possible use case came to mind in QEMU's TCG just-in-time compiler:

QEMU's TCG threads execute translated code. There are events that
require interrupting these threads. Today a check is performed at the
start of every translated block. Most of the time the check is false and
it's a waste of CPU.

User interrupts can eliminate the need for checks by interrupting TCG
threads when events occur.

I don't know whether this will improve performance or how feasible it is
to implement, but I've added people who might have ideas. (For a summary
of user interrupts, see
https://lwn.net/SubscriberLink/871113/60652640e11fc5df/.)

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 00/13] x86 User Interrupts support
  2021-10-01 16:35       ` Stefan Hajnoczi
@ 2021-10-01 16:41         ` Richard Henderson
  0 siblings, 0 replies; 81+ messages in thread
From: Richard Henderson @ 2021-10-01 16:41 UTC (permalink / raw)
  To: Stefan Hajnoczi, Sohil Mehta, Peter Maydell, Alex Bennée
  Cc: Andy Lutomirski, the arch/x86 maintainers, Tony Luck,
	Dave Hansen, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest, qemu-devel

On 10/1/21 12:35 PM, Stefan Hajnoczi wrote:
> QEMU's TCG threads execute translated code. There are events that
> require interrupting these threads. Today a check is performed at the
> start of every translated block. Most of the time the check is false and
> it's a waste of CPU.
> 
> User interrupts can eliminate the need for checks by interrupting TCG
> threads when events occur.

We used to use interrupts, and stopped because we need to wait until the guest is in a 
stable state.  The guest is always in a stable state at the beginning of each TB.

See 378df4b2375.


r~

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-10-01 15:13                   ` Andy Lutomirski
@ 2021-10-01 18:04                     ` Sohil Mehta
  2021-10-01 21:29                     ` Thomas Gleixner
  1 sibling, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-10-01 18:04 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On 10/1/2021 8:13 AM, Andy Lutomirski wrote:
>
> I'm just reading TFM, which is some kind of dystopian fantasy.
>
> 11.8.2.4 XRSTORS
>
> Before restoring the user-interrupt state component, XRSTORS verifies that UINV is 0. If it is not, XRSTORS
> causes a general-protection fault (#GP) before loading any part of the user-interrupt state component. (UINV
> is IA32_UINTR_MISC[39:32]; XRSTORS does not check the contents of the remainder of that MSR.)
>
> So if UINV is set in the memory image and you XRSTORS five times in a row, the first one will work assuming UINV was zero.  The second one will #GP.  And:
>
> 11.8.2.3 XSAVES
> After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32];
> XSAVES does not modify the remainder of that MSR.)
>
> So if we're running a UPID-enabled user task and we switch to a kernel thread, we do XSAVES and UINV is cleared.  Then we switch back to the same task and don't do XRSTORS (or otherwise write IA32_UINTR_MISC) and UINV is still clear.

Andy,

I am still catching up with the rest of the discussion but I wanted to 
provide some input here.

Have you had a chance to look at the discussion on this topic in patch 5?
https://lore.kernel.org/lkml/87bl4fcxz8.ffs@tglx/
The pseudo code Thomas provided and my comments on the same cover the 
above situation.

The UINV bits in the IA32_UINTR_MISC act as an on/off switch for 
detecting user interrupts (i.e. moving them from UPID.PIR to UIRR). When 
XSAVES saves UIRR into memory we want the switch to atomically turn off 
to stop detecting additional interrupts. When we restore the state back 
the hardware wants to be sure the switch is off before writing to UIRR. 
If not, the UIRR state could potentially be overwritten.

That's how I understand the XSAVES/XRSTORS behavior. I can confirm with 
the hardware architects if you want more details here.

Regarding the #GP trick proposal, I am planning to get some feedback 
from the hardware folks to see if any potential issues could arise.

I am on a pre-planned break next week. I apologize (in advance) for the 
delay in responding.

Thanks,
Sohil



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-10-01 15:13                   ` Andy Lutomirski
  2021-10-01 18:04                     ` Sohil Mehta
@ 2021-10-01 21:29                     ` Thomas Gleixner
  2021-10-01 23:00                       ` Sohil Mehta
  2021-10-01 23:04                       ` Andy Lutomirski
  1 sibling, 2 replies; 81+ messages in thread
From: Thomas Gleixner @ 2021-10-01 21:29 UTC (permalink / raw)
  To: Andy Lutomirski, Sohil Mehta, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On Fri, Oct 01 2021 at 08:13, Andy Lutomirski wrote:

> On Fri, Oct 1, 2021, at 2:56 AM, Thomas Gleixner wrote:
>> On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:
>>> On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
>>
>>> Now that I read the docs some more, I'm seriously concerned about this
>>> XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If
>>> we actually use this, then the whole last_cpu "preserve the state in
>>> registers" optimization goes out the window.  So does anything that
>>> happens to assume that merely saving the state doesn't destroy it on
>>> respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which
>>> makes me nervous and would need a serious audit of our XRSTORS paths.
>>
>> I have no idea what you are fantasizing about. You can XRSTORS five
>> times in a row as long as your XSTATE memory image is correct.
>
> I'm just reading TFM, which is some kind of dystopian fantasy.
>
> 11.8.2.4 XRSTORS
>
> Before restoring the user-interrupt state component, XRSTORS verifies
> that UINV is 0. If it is not, XRSTORS causes a general-protection
> fault (#GP) before loading any part of the user-interrupt state
> component. (UINV is IA32_UINTR_MISC[39:32]; XRSTORS does not check the
> contents of the remainder of that MSR.)

Duh. I was staring at the SDM and searching for a hint. Stupid me!

> So if UINV is set in the memory image and you XRSTORS five times in a
> row, the first one will work assuming UINV was zero.  The second one
> will #GP.

Yes. I can see what you mean now :)

> 11.8.2.3 XSAVES
> After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32];
> XSAVES does not modify the remainder of that MSR.)
>
> So if we're running a UPID-enabled user task and we switch to a kernel
> thread, we do XSAVES and UINV is cleared.  Then we switch back to the
> same task and don't do XRSTORS (or otherwise write IA32_UINTR_MISC)
> and UINV is still clear.

Yes, that has to be mopped up on the way to user space.

> And we had better clear UINV when running a kernel thread because the
> UPID might get freed or the kernel thread might do some CPL3
> shenanigans (via EFI, perhaps? I don't know if any firmwares actually
> do this).

Right. That's what happens already with the current pile.

> So all this seems to put UINV into the "independent" category of
> feature along with LBR.  And the 512-byte wastes from extra copies of
> the legacy area and the loss of the XMODIFIED optimization will just
> be collateral damage.

So we'd end up with two XSAVES on context switch. We can simply do:

        XSAVES();
        fpu.state.xtsate.uintr.uinv = 0;

which allows to do as many XRSTORS in a row as we want. Only the final
one on the way to user space will have to restore the real vector if the
register state is not valid:

       if (fpu_state_valid()) {
            if (needs_uinv(current)
               wrmsrl(UINV, vector);
       } else {
            if (needs_uinv(current)
               fpu.state.xtsate.uintr.uinv = vector;
            XRSTORS();
       }

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-10-01 21:29                     ` Thomas Gleixner
@ 2021-10-01 23:00                       ` Sohil Mehta
  2021-10-01 23:04                       ` Andy Lutomirski
  1 sibling, 0 replies; 81+ messages in thread
From: Sohil Mehta @ 2021-10-01 23:00 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski, the arch/x86 maintainers
  Cc: Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest

On 10/1/2021 2:29 PM, Thomas Gleixner wrote:
> So we'd end up with two XSAVES on context switch. We can simply do:
>          XSAVES();
>          fpu.state.xtsate.uintr.uinv = 0;


I am a bit confused. Do we need to set UINV to 0 explicitly?

If XSAVES gets called twice during context switch then the UINV in the 
XSTATE buffer automatically gets set to 0. Since XSAVES saves the 
current UINV value in the MISC_MSR which was already set to 0 by the 
previous XSAVES.

Though, this probably happens due to pure luck than intentional design :)

> which allows to do as many XRSTORS in a row as we want. Only the final
> one on the way to user space will have to restore the real vector if the
> register state is not valid:
>
>         if (fpu_state_valid()) {
>              if (needs_uinv(current)
>                 wrmsrl(UINV, vector);
>         } else {
>              if (needs_uinv(current)
>                 fpu.state.xtsate.uintr.uinv = vector;
>              XRSTORS();
>         }

I might have missed some subtle difference. Has this logic changed from 
what you previously suggested for arch_exit_to_user_mode_prepare()?

        if (xrstors_pending)) {
             // Update the saved xstate for xrstors
             // Unconditionally update the UINV since it could have been 
overwritten by calling XSAVES twice.
                current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;
                 current->xstate.uintr.uirr |= pir;
         } else {
                 // Manually restore UIRR and UINV
                 rdmsrl(IA32_UINTR_RR, uirr);
                 wrmsrl(IA32_UINTR_RR, uirr | pir);

             misc.val64 = 0;
                 misc.uittsz = current->uintr->uittsz;
                 misc.uinv = UINTR_NOTIFICATION_VECTOR;
                 wrmsrl(IA32_UINTR_MISC, misc.val64);
         }

> Hmm?


The one case I can see this failing is if there was another XRSTORS 
after the "final" restore in arch_exit_to_user_mode_prepare()? I think 
that is not possible but I am not an expert on this. Did I misunderstand 
something?

Thanks,
Sohil


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall
  2021-10-01 21:29                     ` Thomas Gleixner
  2021-10-01 23:00                       ` Sohil Mehta
@ 2021-10-01 23:04                       ` Andy Lutomirski
  1 sibling, 0 replies; 81+ messages in thread
From: Andy Lutomirski @ 2021-10-01 23:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Sohil Mehta, the arch/x86 maintainers,
	Tony Luck, Dave Hansen, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Jens Axboe, Christian Brauner,
	Peter Zijlstra (Intel),
	Shuah Khan, Arnd Bergmann, Jonathan Corbet, Raj Ashok, Jacob Pan,
	Gayatri Kammela, Zeng Guang, Williams, Dan J, Randy E Witt,
	Shankar, Ravi V, Ramesh Thomas, Linux API, linux-arch,
	Linux Kernel Mailing List, linux-kselftest



> On Oct 1, 2021, at 2:29 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Fri, Oct 01 2021 at 08:13, Andy Lutomirski wrote:
> 
>>> On Fri, Oct 1, 2021, at 2:56 AM, Thomas Gleixner wrote:
>>> On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:
>>>>> On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
>>> 
>>>> Now that I read the docs some more, I'm seriously concerned about this
>>>> XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If
>>>> we actually use this, then the whole last_cpu "preserve the state in
>>>> registers" optimization goes out the window.  So does anything that
>>>> happens to assume that merely saving the state doesn't destroy it on
>>>> respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which
>>>> makes me nervous and would need a serious audit of our XRSTORS paths.
>>> 
>>> I have no idea what you are fantasizing about. You can XRSTORS five
>>> times in a row as long as your XSTATE memory image is correct.
>> 
>> I'm just reading TFM, which is some kind of dystopian fantasy.
>> 
>> 11.8.2.4 XRSTORS
>> 
>> Before restoring the user-interrupt state component, XRSTORS verifies
>> that UINV is 0. If it is not, XRSTORS causes a general-protection
>> fault (#GP) before loading any part of the user-interrupt state
>> component. (UINV is IA32_UINTR_MISC[39:32]; XRSTORS does not check the
>> contents of the remainder of that MSR.)
> 
> Duh. I was staring at the SDM and searching for a hint. Stupid me!
> 
>> So if UINV is set in the memory image and you XRSTORS five times in a
>> row, the first one will work assuming UINV was zero.  The second one
>> will #GP.
> 
> Yes. I can see what you mean now :)
> 
>> 11.8.2.3 XSAVES
>> After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32];
>> XSAVES does not modify the remainder of that MSR.)
>> 
>> So if we're running a UPID-enabled user task and we switch to a kernel
>> thread, we do XSAVES and UINV is cleared.  Then we switch back to the
>> same task and don't do XRSTORS (or otherwise write IA32_UINTR_MISC)
>> and UINV is still clear.
> 
> Yes, that has to be mopped up on the way to user space.
> 
>> And we had better clear UINV when running a kernel thread because the
>> UPID might get freed or the kernel thread might do some CPL3
>> shenanigans (via EFI, perhaps? I don't know if any firmwares actually
>> do this).
> 
> Right. That's what happens already with the current pile.
> 
>> So all this seems to put UINV into the "independent" category of
>> feature along with LBR.  And the 512-byte wastes from extra copies of
>> the legacy area and the loss of the XMODIFIED optimization will just
>> be collateral damage.
> 
> So we'd end up with two XSAVES on context switch. We can simply do:
> 
>        XSAVES();
>        fpu.state.xtsate.uintr.uinv = 0;

Could work. As long as UINV is armed, RR can change at any time (maybe just when IF=1? The manual is unclear).  But the first XSAVES disarms UINV, so maybe this won’t confuse any callers.

> 
> which allows to do as many XRSTORS in a row as we want. Only the final
> one on the way to user space will have to restore the real vector if the
> register state is not valid:
> 
>       if (fpu_state_valid()) {
>            if (needs_uinv(current)
>               wrmsrl(UINV, vector);
>       } else {
>            if (needs_uinv(current)
>               fpu.state.xtsate.uintr.uinv = vector;
>            XRSTORS();
>       }
> 
> Hmm?

I like it better than anything else I’ve seen.

> 
> Thanks,
> 
>        tglx 

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2021-10-01 23:04 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-13 20:01 [RFC PATCH 00/13] x86 User Interrupts support Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 01/13] x86/uintr/man-page: Include man pages draft for reference Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 02/13] Documentation/x86: Add documentation for User Interrupts Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 03/13] x86/cpu: Enumerate User Interrupts support Sohil Mehta
2021-09-23 22:24   ` Thomas Gleixner
2021-09-24 19:59     ` Sohil Mehta
2021-09-27 20:42     ` Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 04/13] x86/fpu/xstate: Enumerate User Interrupts supervisor state Sohil Mehta
2021-09-23 22:34   ` Thomas Gleixner
2021-09-27 22:25     ` Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 05/13] x86/irq: Reserve a user IPI notification vector Sohil Mehta
2021-09-23 23:07   ` Thomas Gleixner
2021-09-25 13:30     ` Thomas Gleixner
2021-09-26 12:39       ` Thomas Gleixner
2021-09-27 19:07         ` Sohil Mehta
2021-09-28  8:11           ` Thomas Gleixner
2021-09-27 19:26     ` Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 06/13] x86/uintr: Introduce uintr receiver syscalls Sohil Mehta
2021-09-23 12:26   ` Greg KH
2021-09-24  0:05     ` Thomas Gleixner
2021-09-27 23:20     ` Sohil Mehta
2021-09-28  4:39       ` Greg KH
2021-09-28 16:47         ` Sohil Mehta
2021-09-23 23:52   ` Thomas Gleixner
2021-09-27 23:57     ` Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 07/13] x86/process/64: Add uintr task context switch support Sohil Mehta
2021-09-24  0:41   ` Thomas Gleixner
2021-09-28  0:30     ` Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 08/13] x86/process/64: Clean up uintr task fork and exit paths Sohil Mehta
2021-09-24  1:02   ` Thomas Gleixner
2021-09-28  1:23     ` Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 09/13] x86/uintr: Introduce vector registration and uintr_fd syscall Sohil Mehta
2021-09-24 10:33   ` Thomas Gleixner
2021-09-28 20:40     ` Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 10/13] x86/uintr: Introduce user IPI sender syscalls Sohil Mehta
2021-09-23 12:28   ` Greg KH
2021-09-28 18:01     ` Sohil Mehta
2021-09-29  7:04       ` Greg KH
2021-09-29 14:27         ` Sohil Mehta
2021-09-24 10:54   ` Thomas Gleixner
2021-09-13 20:01 ` [RFC PATCH 11/13] x86/uintr: Introduce uintr_wait() syscall Sohil Mehta
2021-09-24 11:04   ` Thomas Gleixner
2021-09-25 12:08     ` Thomas Gleixner
2021-09-28 23:13       ` Sohil Mehta
2021-09-28 23:08     ` Sohil Mehta
2021-09-26 14:41   ` Thomas Gleixner
2021-09-29  1:09     ` Sohil Mehta
2021-09-29  3:30   ` Andy Lutomirski
2021-09-29  4:56     ` Sohil Mehta
2021-09-30 18:08       ` Andy Lutomirski
2021-09-30 19:29         ` Thomas Gleixner
2021-09-30 22:01           ` Andy Lutomirski
2021-10-01  0:01             ` Thomas Gleixner
2021-10-01  4:41               ` Andy Lutomirski
2021-10-01  9:56                 ` Thomas Gleixner
2021-10-01 15:13                   ` Andy Lutomirski
2021-10-01 18:04                     ` Sohil Mehta
2021-10-01 21:29                     ` Thomas Gleixner
2021-10-01 23:00                       ` Sohil Mehta
2021-10-01 23:04                       ` Andy Lutomirski
2021-09-13 20:01 ` [RFC PATCH 12/13] x86/uintr: Wire up the user interrupt syscalls Sohil Mehta
2021-09-13 20:01 ` [RFC PATCH 13/13] selftests/x86: Add basic tests for User IPI Sohil Mehta
2021-09-13 20:27 ` [RFC PATCH 00/13] x86 User Interrupts support Dave Hansen
2021-09-14 19:03   ` Mehta, Sohil
2021-09-23 12:19     ` Greg KH
2021-09-23 14:09       ` Greg KH
2021-09-23 14:46         ` Dave Hansen
2021-09-23 15:07           ` Greg KH
2021-09-23 23:24         ` Sohil Mehta
2021-09-23 23:09       ` Sohil Mehta
2021-09-24  0:17       ` Sohil Mehta
2021-09-23 14:39 ` Jens Axboe
2021-09-29  4:31 ` Andy Lutomirski
2021-09-30 16:30   ` Stefan Hajnoczi
2021-09-30 17:24     ` Sohil Mehta
2021-09-30 17:26       ` Andy Lutomirski
2021-10-01 16:35       ` Stefan Hajnoczi
2021-10-01 16:41         ` Richard Henderson
2021-09-30 16:26 ` Stefan Hajnoczi
2021-10-01  0:40   ` Sohil Mehta
2021-10-01  8:19 ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).