All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>,
	Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Christoph Lameter <cl@linux.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	Alex Belits <abelits@marvell.com>, Phil Auld <pauld@redhat.com>,
	Frederic Weisbecker <frederic@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH] mm: introduce sysctl file to flush per-cpu vmstat statistics
Date: Wed, 2 Dec 2020 15:38:42 -0300	[thread overview]
Message-ID: <20201202183842.GC80090@fuller.cnet> (raw)
In-Reply-To: <87h7p4dwus.fsf@nanos.tec.linutronix.de>

On Wed, Dec 02, 2020 at 04:57:31PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 09:31, Christoph Lameter wrote:
> > On Fri, 27 Nov 2020, Marcelo Tosatti wrote:
> >
> >> Decided to switch to prctl interface, and then it starts
> >> to become similar to "task mode isolation" patchset API.
> >
> > Right I think that was a good approach.
> 
> prctl() is the right thing to do.
> 
> >> In addition to quiescing pending activities on the CPU, it would
> >> also be useful to assign a per-task attribute (which is then assigned
> >> to a per-CPU attribute), indicating whether that CPU is running
> >> an isolated task or not.
> >
> > Sounds good but what would this do? Give a warning like the isolation
> > patchset?
> 
> This all needs a lot more thought about the overall picture. We already
> have too many knobs and ad hoc hooks which fiddle with isolation.

Well this would mainly add "task->is_isolated" and therefore (when given
task is scheduled on CPU), "cpu->is_running_isolated_task". One can 
then use that information from the individual places in the kernel
to make decisions.

> The current CPU isolation is a best effort approach and I agree that for
> more strict isolation modes we need to be able to enforce that and hunt
> down offenders and think about them one by one.

What we see in practice, for vRAN scenarios is the following: There are 
per-application and even "per-installation" latency requirements (lets say,
considering a packet travelling a number of forwarding hosts). Depending on:

 * The end to end desired packet latency.

 * The number of links, and latency of each link.

One can give each forwarding node an "allowed interruption sum per fixed period". 

Or, for 4G/5G vRAN application requirement (due to HARQ response times,
see http://dl.ifip.org/db/conf/icin/icin2016/1570230300.pdf), so far that 
is 100us per 1ms, or 50us per 1ms (on the safe side of the 100us) 
going down to 20us per 1ms.

If the kernel can keep:

1) How long each interruption took, that happened in a given period.
2) If the sum of those interruptions crosses a given threshold.

It would be useful to get traces of those interruptions 
(tracing this manually, on the field, _after_ reports of
excessive latency on the application is quite difficult 
to do).

A scheme which parses trace entries in userspace would 
only trigger after processing the trace entries 
(and finding out about the over-threshold interruption).

So it would be:

	1) Asynchronous.
	2) Application would not be informed.

But it seems that a "0us tolerance with synchronous notifications" 
(task mode patchset) is also a valid use-case.

> >> To be called before real time loop, one would have:
> 
> Can we please agree in the first place, that "real time" is absolutely
> the wrong term here?
> 
> It's about running undisturbed CPU bound computations whatever nature
> they are. It does not matter whether that loop does busy polling ala
> DPDK, whether it runs a huge math computation on a data set or
> whatever people come up with.

Agree. Should say "interruption length".

> 
> >> 	prctl(PR_SET_TASK_ISOLATION, ISOLATION_ENABLE) [1]
> >> 	real time loop
> >> 	prctl(PR_SET_TASK_ISOLATION, ISOLATION_DISABLE)
> >>
> >> (with the attribute also being cleared on task exit).
> >>
> >> The general description would be:
> >>
> >> "Set task isolated mode for a given task, returning an error
> >> if the task is not pinned to a single CPU.
> 
> Plus returning an error if the task has no permissions to request
> this. This should not be an unprivileged prctl ever.
> 
> >> In this mode, the kernel will avoid interruptions to isolated
> >> CPUs when possible."
> >>
> >> Any objections against such an interface ?
> >
> > Maybe do both like in the isolation patchset?
> 
> We really want to define the scopes first. And here you go:
> 
> > Often code can tolerate a few interruptions (in some code branches
> > regular syscalls may be needed) but one wants the thread to be
> > as quiet as possible.
> 
> So you say some code can tolerate a few interrupts, then comes Alex and
> says 'no disturbance' at all.

Some setups can, some don't. 

> The point is that all of this shares the mechanisms to quiesce certain
> parts of the kernel so this wants to build common infrastructure and the
> prctl(ISOLATION, MODE) mode argument defines the scope of isolation
> which the task asks for and the infrastructure decides whether it can be
> granted and if so orchestrates the operation and provides a common
> infrastructure for instrumentation, violation monitoring etc.
> 
> We really need to stop to look at particular workloads and defining
> adhoc solutions tailored to their particular itch if we don't want to
> end up with an uncoordinated and unmaintainable zoo of interfaces, hooks
> and knobs.

Agree.

> Just looking at the problem at hand as an example. NOHZ already issues
> quiet_vmstat(), but it does not cancel already scheduled work. Now
> Marcelo wants a new mechanism which is supposed to cancel the work and
> then Alex want's to prevent it from being rescheduled. If that's not
> properly coordinated this goes down the drain very fast.
> 
> So can we please come up with a central place to handle this prctl()
> with a future proof argument list so the various isolation needs can be
> expressed as required?
> 
> That allows Marcelo to start tackling the vmstat side and Alex can
> utilize that and build the other parts into it piece by piece.
> 
> Thanks,
> 
>         tglx

Thanks Thomas.

I think need to discuss/think on the reporting via tracing
infrastructure too... Maybe Arnaldo has some ideas.

Alex how is this (in general) ? 
Should be able to enable the signalling with it, if i am not mistaken.

Will post a version with the vmstat sync shortly.

Index: linux-2.6-vmstat2/include/uapi/linux/prctl.h
===================================================================
--- linux-2.6-vmstat2.orig/include/uapi/linux/prctl.h
+++ linux-2.6-vmstat2/include/uapi/linux/prctl.h
@@ -247,4 +247,9 @@ struct prctl_mm_map {
 #define PR_SET_IO_FLUSHER		57
 #define PR_GET_IO_FLUSHER		58
 
+/* Task isolation control */
+#define PR_TASK_ISOLATION_FEATURES	59
+#define PR_TASK_ISOLATION_GET		60
+#define PR_TASK_ISOLATION_SET		61
+
 #endif /* _LINUX_PRCTL_H */
Index: linux-2.6-vmstat2/kernel/sys.c
===================================================================
--- linux-2.6-vmstat2.orig/kernel/sys.c
+++ linux-2.6-vmstat2/kernel/sys.c
@@ -58,6 +58,7 @@
 #include <linux/sched/coredump.h>
 #include <linux/sched/task.h>
 #include <linux/sched/cputime.h>
+#include <linux/isolation.h>
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
@@ -2530,6 +2531,49 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
 
 		error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
 		break;
+	case PR_TASK_ISOLATION_FEATURES:
+		struct isolation_features ifeat;
+
+		memset(&ifeat, 0, sizeof(ifeat));
+
+		error = prctl_task_isolation_features(ifeat);
+		if (copy_to_user((char __user *)arg2, &ifeat, sizeof(ifeat)))
+			return -EFAULT;
+		break;
+
+	case PR_TASK_ISOLATION_SET:
+		struct isolation_control *icontrol;
+
+		icontrol = kzalloc(sizeof(struct isolation_control), GFP_KERNEL);
+		if (!icontrol)
+			return -ENOMEM;
+
+		if (copy_from_user(icontrol, (char __user *)arg2,
+				   sizeof(struct isolation_control))) {
+			kfree(icontrol);
+			return -EFAULT;
+		}
+
+		error = prctl_task_isolation_set(icontrol);
+		kfree(icontrol);
+		break;
+	case PR_TASK_ISOLATION_GET:
+		struct isolation_control *icontrol;
+
+		icontrol = kzalloc(sizeof(struct isolation_control), GFP_KERNEL);
+		if (!icontrol)
+			return -ENOMEM;
+
+		error = prctl_task_isolation_get(icontrol);
+
+		if (copy_to_user(icontrol, (char __user *)arg2,
+				   sizeof(struct isolation_control))) {
+			kfree(icontrol);
+			return -EFAULT;
+		}
+
+		kfree(icontrol);
+		break;
 	default:
 		error = -EINVAL;
 		break;
Index: linux-2.6-vmstat2/Documentation/userspace-api/task_isolation.rst
===================================================================
--- /dev/null
+++ linux-2.6-vmstat2/Documentation/userspace-api/task_isolation.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================
+Task isolation CPU interface
+============================
+
+The kernel might perform a number of activities in the background,
+on a given CPU, in the form of workqueues or interrupts.
+
+This interface allows userspace to indicate to the kernel when
+its running latency critical code. This allows the system to
+take preventive measures to avoid deferred actions and
+create a OS noise free environment for the application.
+
+Usage
+=====
+``PR_TASK_ISOLATION_FEATURES``:
+        Returns the supported features. Features are defined
+        at include/uapi/linux/isolation.h.
+
+        Usage::
+
+                prctl(PR_TASK_ISOLATION_FEATURES, ifeat, 0, 0, 0);
+
+        The 'ifeat' argument is a pointer to a struct isolation_features:
+
+                struct isolation_features {
+                        __u32   flags;
+                        __u32   pad[3];
+                };
+
+        Where flags contains bit sets for the features the kernel supports.
+
+        FIXME: describe error codes
+
+``PR_TASK_ISOLATION_SET``:
+        Enables task isolation mode. The type of task isolation is specified
+        via
+
+        struct isolation_control {
+                __u32   flags;
+                __u32   pad[15];
+        };
+
+``PR_TASK_ISOLATION_GET``:
+        Retrieves the currently configured task isolation mode and
+        parameters.
+
+
+(OUTDATED) After initialization, and before the latency sensitive loop, application can call:
+
+(OUTDATED)           prctl(PR_QUIESCE_CPU, 0, 0, 0, 0);
+
+Example
+=======
+
+The ``samples/quiesce_cpu/`` directory contains a sample
+application.
+
Index: linux-2.6-vmstat2/include/uapi/linux/isolation.h
===================================================================
--- /dev/null
+++ linux-2.6-vmstat2/include/uapi/linux/isolation.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/* For PR_TASK_ISOLATION_FEATURES */
+struct isolation_features {
+	__u32	flags;
+	__u32	pad[3];
+};
+
+/* For PR_TASK_ISOLATION_GET / PR_TASK_ISOLATION_SET */
+struct isolation_control {
+	__u32	flags;
+	__u32	pad[15];
+};
+
+/*
+ * Quiesce pending background activity, such as
+ * flushing per-CPU VM statistics.
+ */
+
+#define TASK_ISOLATION_QUIESCE_CPU 0x1
+
Index: linux-2.6-vmstat2/kernel/isolation.c
===================================================================
--- /dev/null
+++ linux-2.6-vmstat2/kernel/isolation.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Implementation of task isolation.
+ *
+ * Authors:
+ *   Chris Metcalf <cmetcalf@mellanox.com>
+ *   Alex Belits <abelits@marvell.com>
+ *   Yuri Norov <ynorov@marvell.com>
+ *   Marcelo Tosatti <mtosatti@redhat.com>
+ */
+
+#include <linux/sched.h>
+#include <linux/isolation.h>
+
+int prctl_task_isolation_features(struct isolation_features *ifeat)
+{
+	ifeat->flags = TASK_ISOLATION_QUIESCE_CPU;
+
+	return 0;
+}
+
+int prctl_task_isolation_set(struct isolation_control *control)
+{
+	return 0;
+}
+
+int prctl_task_isolation_get(struct isolation_control *control)
+{
+	return 0;
+}



  parent reply	other threads:[~2020-12-02 18:39 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-17 16:28 [PATCH] mm: introduce sysctl file to flush per-cpu vmstat statistics Marcelo Tosatti
2020-11-17 18:03 ` Matthew Wilcox
2020-11-17 19:06   ` Christopher Lameter
2020-11-17 19:09     ` Matthew Wilcox
2020-11-20 18:04       ` Christopher Lameter
2020-11-17 20:23     ` Marcelo Tosatti
2020-11-20 18:02       ` Marcelo Tosatti
2020-11-20 18:20       ` Christopher Lameter
2020-11-23 18:02         ` Marcelo Tosatti
2020-11-24 17:12         ` Vlastimil Babka
2020-11-24 19:52           ` Marcelo Tosatti
2020-11-27 15:48         ` Marcelo Tosatti
2020-11-28  3:49           ` [EXT] " Alex Belits
2020-11-30 18:18             ` Marcelo Tosatti
2020-11-30 18:29               ` Marcelo Tosatti
2020-12-03 22:47                 ` Alex Belits
2020-12-03 22:21               ` Alex Belits
2020-11-30  9:31           ` Christoph Lameter
2020-12-02 12:43             ` Marcelo Tosatti
2020-12-02 15:57             ` Thomas Gleixner
2020-12-02 17:43               ` Christoph Lameter
2020-12-03  3:17                 ` Thomas Gleixner
2020-12-07  8:08                   ` Christoph Lameter
2020-12-07 16:09                     ` Thomas Gleixner
2020-12-07 19:01                       ` Thomas Gleixner
2020-12-02 18:38               ` Marcelo Tosatti [this message]
2020-12-04  0:20                 ` Frederic Weisbecker
2020-12-04 13:31                   ` Marcelo Tosatti
2020-12-04  1:43               ` [EXT] " Alex Belits
2021-01-13 12:15                 ` [RFC] tentative prctl task isolation interface Marcelo Tosatti
2021-01-14  9:22                   ` Christoph Lameter
2021-01-14 19:34                     ` Marcelo Tosatti
2021-01-15 13:24                       ` Christoph Lameter
2021-01-15 18:35                         ` Alex Belits
2021-01-21 15:51                           ` Marcelo Tosatti
2021-01-21 16:20                             ` Marcelo Tosatti
2021-01-22 13:05                               ` Marcelo Tosatti
2021-02-01 10:48                             ` Christoph Lameter
2021-02-01 12:47                               ` Alex Belits
2021-02-01 18:20                               ` Marcelo Tosatti
2021-01-18 15:18                         ` Marcelo Tosatti
2020-11-24  5:02 ` [mm] e655d17ffa: BUG:using_smp_processor_id()in_preemptible kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201202183842.GC80090@fuller.cnet \
    --to=mtosatti@redhat.com \
    --cc=abelits@marvell.com \
    --cc=acme@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=frederic@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.