Re: [RFC] tentative prctl task isolation interface

From: Marcelo Tosatti <mtosatti@redhat.com>
To: Alex Belits <abelits@marvell.com>
Cc: Christoph Lameter <cl@linux.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"pauld@redhat.com" <pauld@redhat.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"frederic@kernel.org" <frederic@kernel.org>,
	"willy@infradead.org" <willy@infradead.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Nitesh Narayan Lal <nitesh@redhat.com>
Subject: Re: [RFC] tentative prctl task isolation interface
Date: Thu, 21 Jan 2021 13:20:59 -0300	[thread overview]
Message-ID: <20210121162059.GA18719@fuller.cnet> (raw)
In-Reply-To: <20210121155141.GA11373@fuller.cnet>

Adding Nitesh to CC.

On Thu, Jan 21, 2021 at 12:51:41PM -0300, Marcelo Tosatti wrote:
> Hi Alex,
> 
> On Fri, Jan 15, 2021 at 10:35:14AM -0800, Alex Belits wrote:
> > On 1/15/21 05:24, Christoph Lameter wrote:
> > 
> > > ----------------------------------------------------------------------
> > > On Thu, 14 Jan 2021, Marcelo Tosatti wrote:
> > > 
> > > > > How does one do a oneshot flush of OS activities?
> > > > 
> > > >          ret = prctl(PR_TASK_ISOLATION_REQUEST, ISOL_F_QUIESCE, 0, 0, 0);
> > > >          if (ret == -1) {
> > > >                  perror("prctl PR_TASK_ISOLATION_REQUEST");
> > > >                  exit(0);
> > > >          }
> > > > 
> > > > > 
> > > > > I.e. I have a polling loop over numerous shared and I/o devices in user
> > > > > space and I want to make sure that the system is quite before I enter the
> > > > > loop.
> > > > 
> > > > You could configure things in two ways: with syscalls allowed or not.
> > > 
> > > Well syscalls that do not cause deferred processing like getting the time
> > > or determining the current cpu should be ok to use.
> > 
> > Some of those syscalls go through vdso, and don't enter the kernel --
> > nothing specific is necessary to allow them, and it would be pointless and
> > difficult to prevent them.
> > 
> > For syscalls that enter the kernel, it's often difficult to predict, if they
> > will or won't cause deferred processing, so I am afraid, it won't be
> > possible to provide a "safe" class of syscalls for this purpose and not end
> > up with something minimal like reading /sys and /proc. Right now isolation
> > only "allows" syscalls that exit isolation.
> 
> Christoph wrote:
> 
> "> Features that I think may be needed:
> > 
> > F_ISOL_QUIESCE                -> quiet down now but allow all OS activities. OS
> >                       activites reset flag
> > 
> > F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
> >                       require such actions in the future.
> > 
> > F_ISOL_BAREMETAL_WARN -> Similar. Create a warning in the syslog when OS
> >                               services require delayed processing etc
> >                               but continue while resetting the flag.
> "
> 
> It seems the only difference between HARD and WARN (lets call it SOFT) 
> would be whether a notification is sent to userspace.
> 
> The definition 
> 
> "F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
>                        require such actions in the future."
> 
> fails in the static_key_enable case: Alex's idea is to queue the i-cache
> flush if the remote task/cpu is in isolated mode (and perform the flush 
> when entering the kernel).
> 
> So even if userspace uses syscalls that do not require delayed
> processing, there are events which are out of control of the
> application and might require it.
> 
> So lets assume the application performs a number of syscalls on a
> given time critical codepath. 
> 
> Either the system is configured so that 
> the number/frequency of static_key_enable's is limited, or the cost of
> i-cache flushes must be accounted on that critical codepath.
> 
> Anyway, trying to improve Christoph's definition:
> 
> F_ISOL_QUIESCE                -> flush any pending operations that might cause
> 				 the CPU to be interrupted (ex: free's
> 				 per-CPU queues, sync MM statistics
> 				 counters, etc).
> 
> F_ISOL_ISOLATE		      -> inform the kernel that userspace is
> 				 entering isolated mode (see description
> 				 below on "ISOLATION MODES").
> 
> F_ISOL_UNISOLATE              -> inform the kernel that userspace is
> 				 leaving isolated mode.
> 
> F_ISOL_NOTIFY		      -> notification mode of isolation breakage
> 				 modes.
> 
> 
> Isolation modes:
> ---------------
> 
> There are two main types of isolation modes: 
> 
> - SOFT mode: does not prevent activities which might generate interruptions
> (such as CPU hotplug).
> 
> - HARD mode: prevents all blockable activities that might generate interruptions.
> Administrators can override this via /sys.
> 
> Notifications:
> -------------
> 
> Notification mode of isolation breakage can be configured as follows:
> 
> - None (default): No notification is performed by the kernel on isolation
>   breakage.
> 
> - Syslog: Isolation breakage is reported to syslog. 
> 
> (new modes can be added, for example signals).
> 
> A new feature can be added to disallow syscalls (by default syscalls
> are enabled, with reporting of pending activities that might cause
> an interruption in a VDSO).
> 
> How about that?
> 
> > F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that
> >                       require such actions in the future.
> > 
> > F_ISOL_BAREMETAL_WARN -> Similar. Create a warning in the syslog when OS
> >                               services require delayed processing etc
> >                               but continue while resetting the flag.
> 
> 
> 
> > It may be possible to set up a filter by the system (allowing few safe
> > things like reading /proc) and let the user expand it by adding combinations
> > of syscall / file descriptor. If some device is known to process operations
> > safely, user can open it and mark file descriptor as allowed, say, for
> > reading.
> 
> Makes sense.
> 
> > > And I already said that I want the system to quiet down and allow system
> > > calls. Some indication that deferred actions have occurred may be useful
> > > by f.e. resetting the flag.
> 
> Do you think reporting activities that add overhead (the i-cache flush
> in mind) to syscalls separately in the VDSO?
> 
> > I think, it should be possible to process a syscall, and if any deferred
> > action occurred, exit isolation on return to userspace. 
> 
> On the interface we are creating:
> 
> 	ret = syscall()...
> 	if (vdso.pending_activity) {
> 		prctl(PR_TASK_ISOLATION_REQUEST, F_ISOL_UNISOLATE, 0, 0);
> 		...
> 	}
> 
> Why would it be necessary to exit isolation on return to userspace
> again?
> 
> > Then there is a
> > question, how userspace should be notified about isolation being lost.
> > Normally this happens with a signal, but that is useful if we want syscall
> > to fail with EINTR, not to succeed. Make sure that signal arrives after
> > successful syscall return but before deferred action to happen? Sounds
> > convoluted. Maybe reflecting isolation status in vdso and having the user
> > check it there will be a good solution.
> 
> Why can't userspace enable/disable isolation mode (and the kernel only
> reports it) ?
> 
> I fail to see why the order of the events "isolated mode disablement"
> and "return to userspace" is critical.
> 
> > When I worked on my implementation I have encountered both a problem of
> > interaction with the rest of system from isolated task (at least simple
> > things as logging) and a problem of handling enter/exit from isolation on a
> > system when it's possible for a task to be interrupted early after entering
> > isolation due to various events that were still in progress on other CPUs.
> > 
> > I ended up implementing a manager/helper task that talks to tasks over a
> > socket (when they are not isolated) and over ring buffers in shared memory
> > (when they are isolated). While the current implementation is rather
> > limited, the intention is to delegate to it everything that isolated task
> > either can't do at all (like, writing logs) or that it would be cumbersome
> > to implement (like monitoring the state of task, determining presence of
> > deferred work after the task returned to userspace), etc.
> 
> Interesting. Are you considering opensourcing such library? Seems like a
> generic problem.
> 
> > It would be great if the complexity and amount of functionality of that
> > manager/helper task can be reduced, however I believe that having such a
> > task is a legitimate way of implementing things that otherwise would require
> > additional functionality in kernel.
> > 
> > > 
> > > > 1) Add a new isolation feature ISOL_F_BLOCK_SYSCALLS (to block certain
> > > > syscalls) along with ISOL_F_SETUP_NOTIF (to notify upon isolation
> > > > breaking):
> > > 
> > > Well come up with a use case for that .... I know mine. What you propose
> > > could be  useful for debugging for me but I would prefer the quiet down
> > > approach where I determine when I use some syscalls or not and will deal
> > > with the consequences.
> > 
> > For my purposes breaking isolation on syscalls and notifications about
> > isolation breaking is a very useful approach -- this is why I kept it
> > exactly as it was in the original implementation by Chris Metcalf.
> > 
> > In applications that I intend to use isolation for, the primary concern is
> > consistent time for running code in userspace, so syscalls should be only
> > issued when the task is specifically not in isolated mode. If the program
> > issues a syscall by mistake (and that may happen when some libraries are
> > used, or thread synchronization primitives are kept from non-isolated
> > version of the program, even though isolated tasks are not supposed to use
> > those), it means not only that deferred work may cause delay in the future,
> > but also that there is an additional time to be spent in kernel. This should
> > be immediately visible to the developer, and the best way to do it is by
> > breaking isolation on syscall immediately.
> 
> I guess you can do that by hooking a BPF program to cpu->is_isolated ==
> true (for development) and syscall entry.
> 
> > > > 
> > > > > Features that I think may be needed:
> > > > > 
> > > > > F_ISOL_QUIESCE		-> quiet down now but allow all OS activities. OS
> > > > > 			activites reset flag
> > > > > 
> > > > > F_ISOL_BAREMETAL_HARD	-> No OS interruptions. Fault on syscalls that
> > > > > 			require such actions in the future.
> > > > 
> > > > Question: why BAREMETAL ?
> > > 
> > > To disinguish it from "Realtime". We want the processor for ourselves
> > > without anything else running on it.
> > > 
> > > > Two comments:
> > > > 
> > > > 1) HARD mode could also block activities from different CPUs that can
> > > > interrupt this isolated CPU (for example CPU hotplug, or increasing
> > > > per-CPU trace buffer size).
> > > 
> > > Blocking? The app should fail if any deferred actions are triggered as a
> > > result of syscalls. It would give a warning with _WARN
> > 
> > There are many supposedly innocent things, nowhere at the scale of CPU
> > hotplug, that happen in a system and result in synchronization implemented
> > as an IPI to every online CPU. We should consider them to be an ordinary
> > occurrence, so there is a choice:
> > 
> > 1. Ignore them completely and allow them in isolated mode. This will delay
> > userspace with no indication and no isolation breaking.
> > 
> > 2. Allow them, and notify userspace afterwards (through vdso or through
> > userspace helper/manager over shared memory). This may be useful in those
> > rare situations when the consequences of delay can be mitigated afterwards.
> > 
> > 3. Make them break isolation, with userspace being notified normally (ex:
> > with a signal in the current implementation). I guess, can be used if
> > somehow most of the causes will be eliminated.
> > 
> > 4. Prevent them from reaching the target CPU and make sure that whatever
> > synchronization they are intended to cause, will happen when intended target
> > CPU will enter to kernel later. Since we may have to synchronize things like
> > code modification, some of this synchronization has to happen very early on
> > kernel entry.
> > 
> > I am most interested in (4), so this is what was implemented in my version
> > of the patch (and currently I am trying to achieve completeness and, if
> > possible, elegance of the implementation).
> 
> Agree. (3) will be necessary as intermediate step. The proposed
> improvement to Christoph's reply, in this thread, separates notification 
> and syscall blockage. 
> 
> > I guess, if we want to add more controls, we can allow the user to choose
> > either of those four options, or of a subset of them. In my opinion, if (4)
> > will be available, and the only additional cost will be time for
> > synchronization spent in breaking isolation procedure, there is not much
> > need in the other three. Without (4) I don't think, the goal of providing
> > consistent, interruption-free environment is achieved at all, so not
> > implementing it would be very bad.
> 
> Agree.
> 
> > > > 2) For a type of application it is the case that certain interruptions
> > > > can be tolerated, as long as they do not cross certain thresholds.
> > > > For example, one loses the flexibility to read/write MSRs
> > > > on the isolated CPUs (including performance counters,
> > > > RDT/MBM type MSRs, frequency/power statistics) by
> > > > forcing a "no interruptions" mode.
> > > 
> > > Does reading these really cause deferred actions by the OS? AFAICT you
> > > could map these into memory as well as read them without OS activities.
> > 
> > Access to those is hardware/architecture-specific, and in many cases,
> > indeed, there is no need to issue a syscall at all.
> > 
> > However for many applications the model with a helper task performing
> > interactions with OS on a different core and exchanging data over shared
> > memory may be sufficient, and it will also provide clear separation between
> > operations that do require consistent timing and those that don't.
> 
> I see.
> 
> > > "Interruptions that can be tolerated".... Well that is the wild west of
> > > "realtime" where you can define how much of a time slice is "real" and how
> > > much can be use by other processes. I do not think that any of that should
> > > come into this API.
> > > 
> > 
> > To be honest, I have no idea, what can and can not be tolerated by
> > applications other than what I am familiar with. Applications that I know,
> > require no interruptions at all, so I want to implement that. I assume,
> > someone already uses existing CPU isolation for the purpose of providing
> > "nearly interrupt-less" environment.
> > 
> > I can imaging something like a task of controlling a large slow-updating LED
> > display by bit-banging a strictly timed long serial message representing a
> > frame or frame update. If interrupted, it may, depending on the protocol,
> > corrupt the state of a single LED or fail to update until the end of the
> > screen, but the next start of message will reset the state, and everything
> > will work until the next interrupt. Maybe there are more realistic or useful
> > examples.
> 
> Agree that "no interruptions" as a goal makes most sense. 
> 
> Can "whitelist" certain interruptions if necessary (to handle the MSR
> read case), if user desires.
>