Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode

From: Alex Belits <abelits@marvell.com>
To: "tglx@linutronix.de" <tglx@linutronix.de>,
	"peterz@infradead.org" <peterz@infradead.org>
Cc: "mingo@kernel.org" <mingo@kernel.org>,
	Prasun Kapoor <pkapoor@marvell.com>,
	"linux-api@vger.kernel.org" <linux-api@vger.kernel.org>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	"frederic@kernel.org" <frederic@kernel.org>,
	"catalin.marinas@arm.com" <catalin.marinas@arm.com>,
	"will@kernel.org" <will@kernel.org>,
	"linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>
Subject: Re: [EXT] Re: [PATCH v4 00/13] "Task_isolation" mode
Date: Fri, 24 Jul 2020 03:00:03 +0000	[thread overview]
Message-ID: <851ee54e8317cd186338a76a045f738476144fcc.camel@marvell.com> (raw)
In-Reply-To: <877dutx5xj.fsf@nanos.tec.linutronix.de>

On Thu, 2020-07-23 at 23:44 +0200, Thomas Gleixner wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> Alex Belits <abelits@marvell.com> writes:
> > On Thu, 2020-07-23 at 17:49 +0200, Peter Zijlstra wrote:
> > > 'What does noinstr mean? and why do we have it" -- don't dare
> > > touch
> > > the
> > > entry code until you can answer that.
> > 
> > noinstr disables instrumentation, so there would not be calls and
> > dependencies on other parts of the kernel when it's not yet safe to
> > call them. Relevant functions already have it, and I add an inline
> > call
> > to perform flags update and synchronization. Unless something else
> > is
> > involved, those operations are safe, so I am not adding anything
> > that
> > can break those.
> 
> Sure.
> 
>  1) That inline function can be put out of line by the compiler and
>     placed into the regular text section which makes it subject to
>     instrumentation
> 
>  2) That inline function invokes local_irq_save() which is subject to
>     instrumentation _before_ the entry state for the instrumentation
>     mechanisms is established.
> 
>  3) That inline function invokes sync_core() before important state
> has
>     been established, which is especially interesting in NMI like
>     exceptions.
> 
> As you clearly documented why all of the above is safe and does not
> cause any problems, it's just me and Peter being silly, right?
> 
> Try again.

I don't think, accusations and mockery are really necessary here.

I am trying to do the right thing here. In particular, I am trying to
port the code that was developed on platforms that have not yet
implemented those useful instrumentation safety features of x86 arch
support. For most of the development time I had to figure out, where
the synchronization can be safely inserted into kernel entry code on
three platforms and tens of interrupt controller drivers, with some of
those presenting unusual exceptions (forgive me the pun) from platform-
wide conventions. I really appreciate the work you did cleaning up
kernel entry procedures, my 5.6 version of this patch had to follow a
much more complex and I would say, convoluted entry handling on x86,
and now I don't have to do that, thanks to you.

Unfortunately, most of my mental effort recently had to be spent on
three things:

1. (small): finding a way to safely enable events and synchronize state
on kernel entry, so it will not have a race condition between
isolation-breaking kernel entry and an event that was disabled while
the task was isolated.

2. (big): trying to derive any useful rules applicable to kernel entry
in various architectures, finding that there is very little consistency
across architectures, and whatever exists, can be broken by interrupt
controller drivers that don't all follow the same rules as the rest of
the platform.

3. (medium): introducing calls to synchronization on all kernel entry
procedures, in places where it is guaranteed to not normally yet have
done any calls to parts of the kernel that may be affected by "stale"
state, and do it in a manner as consistent and generalized as possible.

The current state of kernel entry handling on arm and arm64
architectures has significant differences from x86 and from each other.
There is also a matter of interrupt controllers. As can be seen in
interrupt controller-specific patch, I had to accommodate some variety
of custom interrupt entry code. What can not be seen, is that I had to
check that all other interrupt controller drivers and architecture-
specific entry procedures, and find that they _do_ follow some
understandable rules -- unfortunately architecture-specific and not
documented in any manner.

I have no valid reasons for complaining about it. I could not expect
that authors of all kernel entry procedures would have any
foreknowledge that someone at some point may have a reason to establish
any kind of synchronization point for CPU cores. And this is why I had
to do my research by manually drawing call trees and sequences,
separately for every entry on every supported architecture, and across
two or three versions of kernel, as those were changing along the way.

The result of this may be not a "design" per se, but an understanding
of how things are implemented, and what rules are being followed, so I
could add my code in a manner consistent with what is done, and
document the whole thing. Then there will be some written rules to
check for, when anything of this kind will be necessary again (say,
with TLB, but considering how much now is done in userspace, possibly
to accommodate more exotic CPU features that may have state messed up
by userspace). I am afraid, this task, kernel entry documentation,
would take me some time, and I did not want to delay my task isolation
patch for this reason.

As I followed whatever rules I have determined to be applicable, I have
produced code that introduces hooks in multiple seemingly unrelated to
each other places. Whenever there was a, forgive me the pun, exception
to those rules, another special case had to be handled.

So no, I did not just add entry hooks randomly, and your accusations of
having no design are unwarranted. My patches reflect what is already in
code and in its design, I have added one simple rule that entry hook
runs at the point when no dependency on something that requires
synchronization, exists yet. The entry hook is small, you have already
seen all of it while listing things that are not compatible with
noinst. Its mechanism and purpose are explained in general description
of task isolation. I don't think, I can provide a better explanation.

I have to apologize for not taking into account all your carefully
built instrumentation safety support. That was one thing I have missed.
However at this point the only way for me to find out that I am wrong
about it, and my code does not comply with expectations defined by
advanced state of x86 architecture development, was to present whatever
I could do right, based on experience with other platforms. I don't
think, this warrants such hostility.

Another issue that you have asked me to defend is the existence and
scope of task isolation itself. I have provided long explanation in
changelog and previous discussions of this patch, and before me so did
Chris Metcalf and occasionally Yuri Norov. While I understand that this
is an unusual feature and by its nature it affects kernel in multiple
places, it does not deserve to be called a "mess" and other synonyms of
"mess". It's an attempt to introduce a feature that turns Linux
userspace into superior replacement of RTOS. Considering current state
of CPU and SoC development, it is becoming very difficult even for
vendor-specific RTOS to keep up with advanced hardware features. Linux
keeps up with them just fine, however it lacks the ability to truly
leave the CPU core alone, to run the performance-critical and latency-
critical part of a task in a manner that RTOS user would expect. Very
close but not yet. Task isolation provides the last step for this RTOS
replacement. It is implemented in a manner that allows the user to
combine Linux resource handling and initialization with RTOS isolation
and latency. The decision about page faults is a part of this design,
as well as many other decisions implemented in this code. Many may
disagree with either those decisions, or the validity of a goal, some
may even argue that it's a bad thing to provide a reason to stop RTOS
development (I think, this is a good thing but that's not the point).

However most definitely this is not a "mess", and it I do not believe
that I have to defend the validity of this direction of development, or
be accused of general incompetence every time someone finds a
frustrating mistake in my code. As I said, I am trying to do the right
thing, and want to bring my code not only to the state where x86
support is on par with other platforms (that is, working when
instrumentation is disabled), but also make it fully compliant with
current requirements of x86 platform.

-- 
Alex