All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC] sched: deferred set priority (dprio)
@ 2014-07-25 19:45 Sergey Oboguev
  2014-07-25 20:12 ` Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Sergey Oboguev @ 2014-07-25 19:45 UTC (permalink / raw)
  To: linux-kernel

[This is a repost of the message from few day ago, with patch file
inline instead of being pointed by the URL.]

This patch is intended to improve the support for fine-grain parallel
applications that may sometimes need to change the priority of their threads at
a very high rate, hundreds or even thousands of times per scheduling timeslice.

These are typically applications that have to execute short or very short
lock-holding critical or otherwise time-urgent sections of code at a very high
frequency and need to protect these sections with "set priority" system calls,
one "set priority" call to elevate current thread priority before entering the
critical or time-urgent section, followed by another call to downgrade thread
priority at the completion of the section. Due to the high frequency of
entering and leaving critical or time-urgent sections, the cost of these "set
priority" system calls may raise to a noticeable part of an application's
overall expended CPU time. Proposed "deferred set priority" facility allows to
largely eliminate the cost of these system calls.

Instead of executing a system call to elevate its thread priority, an
application simply writes its desired priority level to a designated memory
location in the userspace. When the kernel attempts to preempt the thread, it
first checks the content of this location, and if the application's stated
request to change its priority has been posted in the designated memory area,
the kernel will execute this request and alter the priority of the thread being
preempted before performing a rescheduling, and then make a scheduling decision
based on the new thread priority level thus implementing the priority
protection of the critical or time-urgent section desired by the application.
In a predominant number of cases however, an application will complete the
critical section before the end of the current timeslice and cancel or alter
the request held in the userspace area. Thus a vast majority of an
application's change priority requests will be handled and mutually cancelled
or coalesced within the userspace, at a very low overhead and without incurring
the cost of a system call, while maintaining safe preemption control. The cost
of an actual kernel-level "set priority" operation is incurred only if an
application is actually being preempted while inside the critical section, i.e.
typically at most once per scheduling timeslice instead of hundreds or
thousands "set priority" system calls in the same timeslice.

One of the intended purposes of this facility (but its not sole purpose) is to
render a lightweight mechanism for priority protection of lock-holding critical
sections that would be an adequate match for lightweight locking primitives
such as futex, with both featuring a fast path completing within the userspace.

More detailed description can be found in:
https://raw.githubusercontent.com/oboguev/dprio/master/dprio.txt

The patch is currently based on linux-3.15.2.

User-level library implementing userspace-side boilerplate code:
https://github.com/oboguev/dprio/tree/master/src/userlib

Test set:
https://github.com/oboguev/dprio/tree/master/src/test

The patch is enabled with CONFIG_DEFERRED_SETPRIO.
There is also a few other config settings: a setting for dprio debug code,
a setting that controls the initial value for the authorization list restricting
the use of the facility based on user or group ids, and a setting to improve
the determinism in the rescheduling latency when dprio request is pending
under low-memory conditions. Please see dprio.txt for details.

Comments would be appreciated.

Thanks,
Sergey

Signed-off-by: Sergey Oboguev <oboguev@yahoo.com>
---
 fs/exec.c                  |    8 +
 fs/proc/Makefile           |    1 +
 fs/proc/authlist.c         |  493 +++++++++++++++++++++++++++++
 include/linux/authlist.h   |  114 +++++++
 include/linux/dprio.h      |  130 ++++++++
 include/linux/init_task.h  |   17 +
 include/linux/sched.h      |   15 +
 include/uapi/linux/prctl.h |    2 +
 init/Kconfig               |    2 +
 kernel/Kconfig.dprio       |   81 +++++
 kernel/exit.c              |    6 +
 kernel/fork.c              |   87 +++++-
 kernel/sched/Makefile      |    1 +
 kernel/sched/core.c        |  200 +++++++++++-
 kernel/sched/dprio.c       |  734 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sys.c               |    6 +
 kernel/sysctl.c            |   11 +
 17 files changed, 1897 insertions(+), 11 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 238b7aa..9f5b649 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -56,6 +56,7 @@
 #include <linux/pipe_fs_i.h>
 #include <linux/oom.h>
 #include <linux/compat.h>
+#include <linux/dprio.h>

 #include <asm/uaccess.h>
 #include <asm/mmu_context.h>
@@ -1433,6 +1434,7 @@ static int do_execve_common(struct filename *filename,
        struct file *file;
        struct files_struct *displaced;
        int retval;
+       struct dprio_saved_context dprio_context;

        if (IS_ERR(filename))
                return PTR_ERR(filename);
@@ -1483,6 +1485,9 @@ static int do_execve_common(struct filename *filename,
        if (retval)
                goto out_unmark;

+       dprio_handle_request();
+       dprio_save_reset_context(&dprio_context);
+
        bprm->argc = count(argv, MAX_ARG_STRINGS);
        if ((retval = bprm->argc) < 0)
                goto out;
@@ -1521,6 +1526,7 @@ static int do_execve_common(struct filename *filename,
        putname(filename);
        if (displaced)
                put_files_struct(displaced);
+       dprio_free_context(&dprio_context);
        return retval;

 out:
@@ -1529,6 +1535,8 @@ out:
                mmput(bprm->mm);
        }

+       dprio_restore_context(&dprio_context);
+
 out_unmark:
        current->fs->in_exec = 0;
        current->in_execve = 0;
diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 239493e..7d55986 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -29,3 +29,4 @@ proc-$(CONFIG_PROC_KCORE)     += kcore.o
 proc-$(CONFIG_PROC_VMCORE)     += vmcore.o
 proc-$(CONFIG_PRINTK)  += kmsg.o
 proc-$(CONFIG_PROC_PAGE_MONITOR)       += page.o
+proc-$(CONFIG_DEFERRED_SETPRIO) += authlist.o
diff --git a/fs/proc/authlist.c b/fs/proc/authlist.c
new file mode 100644
index 0000000..b6f1fbe
--- /dev/null
+++ b/fs/proc/authlist.c
@@ -0,0 +1,493 @@
+/*
+ * fs/proc/authlist.c
+ *
+ * Authorization list.
+ *
+ * Started by (C) 2014 Sergey Oboguev <oboguev@yahoo.com>
+ *
+ * This code is licenced under the GPL version 2 or later.
+ * For details see linux-kernel-base/COPYING.
+ */
+
+#include <linux/types.h>
+#include <linux/ctype.h>
+#include <linux/unistd.h>
+#include <linux/stddef.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/compiler.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/cred.h>
+#include <linux/sysctl.h>
+#include <linux/authlist.h>
+
+#define error_out(rc)  do { error = (rc);  goto out; } while (0)
+
+static const char tag_uid[] = "uid";
+static const char tag_nouid[] = "nouid";
+static const char tag_gid[] = "gid";
+static const char tag_nogid[] = "nogid";
+static const char tag_everybody[] = "everybody";
+static const char tag_nobody[] = "nobody";
+
+static inline bool is_ws(char c)
+{
+       return c == ' ' || c == '\t' || c == '\r' || c == '\n';
+}
+
+static inline bool is_ws_eos(char c)
+{
+       return is_ws(c) || c == '\0';
+}
+
+/* count whitespace-separated entries in the descriptor string */
+static int count_entries(const char *desc)
+{
+       const char *p = desc;
+       int nentries = 0;
+
+       for (;;) {
+               /* skip leading whitespace */
+               while (is_ws(*p))
+                       p++;
+
+               /* reached the end of the string? */
+               if (*p == '\0')
+                       break;
+
+               /* detected non-ws section */
+               nentries++;
+
+               /* skip non-ws section */
+               while (!is_ws_eos(*p))
+                       p++;
+       }
+
+       return nentries;
+}
+
+static inline bool istag(const char **ep, const char *tag)
+{
+       int len = strlen(tag);
+       const char *p = *ep;
+
+       if (0 == strncmp(p, tag, len)) {
+               if (is_ws_eos(p[len])) {
+                       *ep += len;
+                       return true;
+               }
+       }
+
+       return false;
+}
+
+static inline bool istag_col(const char **ep, const char *tag)
+{
+       int len = strlen(tag);
+       const char *p = *ep;
+
+       if (0 == strncmp(p, tag, len) && p[len] == ':') {
+               *ep += len + 1;
+               return true;
+       }
+
+       return false;
+}
+
+static int parse_id(const char **ep, struct authlist_entry *entry,
+                   enum authlist_kind kind)
+{
+       struct user_namespace *ns = current_user_ns();
+       /* decimal representation of 32-bit number fits in 10 chars */
+       char sval[11];
+       const char *p = *ep;
+       char *xp = sval;
+       int error;
+       uid_t uid;
+       gid_t gid;
+
+       while (isdigit(*p)) {
+               if (xp - sval >= sizeof(sval) - 1)
+                       return -EINVAL;
+               *xp++ = *p++;
+       }
+       *xp = '\0';
+       if (!sval[0] || !is_ws_eos(*p))
+               return -EINVAL;
+
+       switch (kind) {
+       case AUTHLIST_KIND_UID:
+       case AUTHLIST_KIND_NOUID:
+               error = kstrtouint(sval, 10, &uid);
+               if (error)
+                       return error;
+               entry->kuid = make_kuid(ns, uid);
+               if (!uid_valid(entry->kuid))
+                       return -EINVAL;
+               break;
+
+       case AUTHLIST_KIND_GID:
+       case AUTHLIST_KIND_NOGID:
+               error = kstrtouint(sval, 10, &gid);
+               if (error)
+                       return error;
+               entry->kgid = make_kgid(ns, gid);
+               if (!gid_valid(entry->kgid))
+                       return -EINVAL;
+               break;
+
+       default:
+               return -EINVAL;
+       }
+
+       entry->kind = kind;
+       *ep = p;
+
+       return 0;
+}
+
+static int parse_entry(const char **ep, struct authlist_entry *entry)
+{
+       if (istag(ep, tag_everybody))
+               entry->kind = AUTHLIST_KIND_EVERYBODY;
+       else if (istag(ep, tag_nobody))
+               entry->kind = AUTHLIST_KIND_NOBODY;
+       else if (istag_col(ep, tag_uid))
+               return parse_id(ep, entry, AUTHLIST_KIND_UID);
+       else if (istag_col(ep, tag_nouid))
+               return parse_id(ep, entry, AUTHLIST_KIND_NOUID);
+       else if (istag_col(ep, tag_gid))
+               return parse_id(ep, entry, AUTHLIST_KIND_GID);
+       else if (istag_col(ep, tag_nogid))
+               return parse_id(ep, entry, AUTHLIST_KIND_NOGID);
+       else
+               return -EINVAL;
+
+       return 0;
+}
+
+/*
+ * Import authlist from the userspace
+ */
+static int write_authlist(struct authlist *authlist, void __user *buffer,
+                         size_t *lenp, loff_t *ppos)
+{
+       struct authlist_entry *entries = NULL, *old_entries;
+       char *memblk = NULL;
+       int error = 0;
+       int nentries;
+       int ne;
+       int terminal = -1;
+       const char *p;
+
+       /* ensure atomic transfer */
+       if (*ppos != 0)
+               return -EINVAL;
+
+       if (*lenp > AUTHLIST_LENGTH_LIMIT)
+               return -EINVAL;
+
+       memblk = kmalloc(*lenp + 1, GFP_KERNEL);
+       if (memblk == NULL)
+               return -ENOMEM;
+
+       if (copy_from_user(memblk, buffer, *lenp))
+               error_out(-EFAULT);
+
+       memblk[*lenp] = '\0';
+
+       nentries = count_entries(memblk);
+       if (nentries == 0)
+               error_out(-EINVAL);
+
+       entries = kmalloc(sizeof(struct authlist_entry) * nentries, GFP_KERNEL);
+       if (entries == NULL)
+               error_out(-ENOMEM);
+
+       for (p = memblk, ne = 0;; ne++) {
+               /* skip leading whitespace */
+               while (is_ws(*p))
+                       p++;
+
+               /* reached the end of the string? */
+               if (*p == '\0')
+                       break;
+
+               error = parse_entry(&p, entries + ne);
+               if (error)
+                       goto out;
+
+               switch (entries[ne].kind) {
+               case AUTHLIST_KIND_EVERYBODY:
+               case AUTHLIST_KIND_NOBODY:
+                       if (terminal != -1)
+                               error_out(-EINVAL);
+                       terminal = ne;
+                       break;
+
+               default:
+                       break;
+               }
+       }
+
+       /*
+        * Last entry must be everybody/nobody.
+        * Intermediate entry cannot be everybody/nobody.
+        */
+       if (terminal != nentries - 1)
+               error_out(-EINVAL);
+
+       down_write(&authlist->rws);
+       old_entries = authlist->entries;
+       authlist->nentries = nentries;
+       authlist->entries = entries;
+       up_write(&authlist->rws);
+
+       kfree(old_entries);
+       entries = NULL;
+
+       *ppos += *lenp;
+
+out:
+
+       kfree(memblk);
+       kfree(entries);
+
+       return error;
+}
+
+/*
+ * Export authlist to the userspace
+ */
+static int read_authlist(struct authlist *authlist,
+                        void __user *buffer,
+                        size_t *lenp, loff_t *ppos)
+{
+       struct user_namespace *ns = current_user_ns();
+       char *memblk = NULL;
+       char *vp = NULL;
+       int error = 0;
+       int len;
+       uid_t uid;
+       gid_t gid;
+
+       down_read(&authlist->rws);
+
+       if (authlist->nentries == 0) {
+               switch (authlist->initial_value) {
+               case AUTHLIST_KIND_EVERYBODY:
+                       vp = (char *) tag_everybody;
+                       break;
+
+               case AUTHLIST_KIND_NOBODY:
+               default:
+                       vp = (char *) tag_nobody;
+                       break;
+               }
+       } else {
+               struct authlist_entry *entry;
+               /* <space>noguid:4294967295 */
+               size_t maxentrysize = 1 + 6 + 1 + 10;
+               size_t alloc_size = maxentrysize * authlist->nentries + 1;
+               int ne;
+
+               memblk = kmalloc(alloc_size, GFP_KERNEL);
+               if (memblk == NULL) {
+                       up_read(&authlist->rws);
+                       return -ENOMEM;
+               }
+
+               vp = memblk;
+               *vp = '\0';
+               entry = authlist->entries;
+               for (ne = 0;  ne < authlist->nentries;  ne++, entry++) {
+                       vp += strlen(vp);
+                       if (ne != 0)
+                               *vp++ = ' ';
+                       switch (entry->kind) {
+                       case AUTHLIST_KIND_UID:
+                               uid = from_kuid(ns, entry->kuid);
+                               if (uid == (uid_t) -1) {
+                                       error = EIDRM;
+                                       break;
+                               }
+                               sprintf(vp, "%s:%u", tag_uid, (unsigned) uid);
+                               break;
+
+                       case AUTHLIST_KIND_NOUID:
+                               uid = from_kuid(ns, entry->kuid);
+                               if (uid == (uid_t) -1) {
+                                       error = EIDRM;
+                                       break;
+                               }
+                               sprintf(vp, "%s:%u", tag_nouid, (unsigned) uid);
+                               break;
+
+                       case AUTHLIST_KIND_GID:
+                               gid = from_kgid(ns, entry->kgid);
+                               if (gid == (gid_t) -1) {
+                                       error = EIDRM;
+                                       break;
+                               }
+                               sprintf(vp, "%s:%u", tag_gid, (unsigned) gid);
+                               break;
+
+                       case AUTHLIST_KIND_NOGID:
+                               gid = from_kgid(ns, entry->kgid);
+                               if (gid == (gid_t) -1) {
+                                       error = EIDRM;
+                                       break;
+                               }
+                               sprintf(vp, "%s:%u", tag_nogid, (unsigned) gid);
+                               break;
+
+                       case AUTHLIST_KIND_EVERYBODY:
+                               strcpy(vp, tag_everybody);
+                               break;
+
+                       case AUTHLIST_KIND_NOBODY:
+                               strcpy(vp, tag_nobody);
+                               break;
+                       }
+
+                       if (unlikely(error != 0)) {
+                               up_read(&authlist->rws);
+                               kfree(memblk);
+                               return error;
+                       }
+               }
+
+               vp = memblk;
+       }
+
+       up_read(&authlist->rws);
+
+       len = strlen(vp);
+
+       /* ensure atomic transfer */
+       if (*ppos != 0) {
+               if (*ppos == len + 1) {
+                       *lenp = 0;
+                       goto out;
+               }
+               error_out(-EINVAL);
+       }
+
+       if (len + 2 > *lenp)
+               error_out(-ETOOSMALL);
+
+       if (likely(len) && copy_to_user(buffer, vp, len))
+               error_out(-EFAULT);
+
+       if (copy_to_user(buffer + len, "\n", 2))
+               error_out(-EFAULT);
+
+       *lenp = len + 1;
+       *ppos += len + 1;
+
+out:
+
+       kfree(memblk);
+
+       return error;
+}
+
+/*
+ * proc_doauthlist - read or write authorization list
+ * @table: the sysctl table
+ * @write: true if this is a write to the sysctl file
+ * @buffer: the user buffer
+ * @lenp: the size of the user buffer
+ * @ppos: file position
+ *
+ * Reads/writes an authorization list as a string from/to the user buffer.
+ *
+ * On struct authlist -> userspace string read, if the user buffer provided
+ * is not large enough to hold the string atomically, an error will be
+ * returned. The copied string will include '\n' and is NUL-terminated.
+ *
+ * On userspace string -> struct authlist write, if the user buffer does not
+ * contain a valid string-from authorization list atomically, or if the
+ * descriptor is malformatted, an error will be returned.
+ *
+ * Returns 0 on success.
+ */
+int proc_doauthlist(struct ctl_table *table, int write,
+                   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+       struct authlist *authlist = (struct authlist *) table->data;
+
+       if (write)
+               return write_authlist(authlist, buffer, lenp, ppos);
+       else
+               return read_authlist(authlist, buffer, lenp, ppos);
+}
+
+static bool in_egroup(const struct cred *cred, kgid_t kgid)
+{
+       if (gid_eq(cred->egid, kgid))
+               return true;
+
+       return groups_search(cred->group_info, kgid);
+}
+
+/*
+ * Check if @authlist permits the called with @cred credentials to perform the
+ * operation guarded by the @authlist.
+ */
+int authlist_check_permission(struct authlist *authlist,
+                             const struct cred *cred)
+{
+       struct authlist_entry *entry;
+       int ne, error = 0;
+
+       down_read(&authlist->rws);
+
+       if (authlist->nentries == 0) {
+               if (authlist->initial_value == AUTHLIST_KIND_EVERYBODY)
+                       error_out(0);
+               error_out(-EPERM);
+       }
+
+       entry = authlist->entries;
+
+       for (ne = 0;  ne < authlist->nentries;  ne++, entry++) {
+               switch (entry->kind) {
+               case AUTHLIST_KIND_UID:
+                       if (uid_eq(entry->kuid, cred->euid))
+                               error_out(0);
+                       break;
+
+               case AUTHLIST_KIND_NOUID:
+                       if (uid_eq(entry->kuid, cred->euid))
+                               error_out(-EPERM);
+                       break;
+
+               case AUTHLIST_KIND_GID:
+                       if (in_egroup(cred, entry->kgid))
+                               error_out(0);
+                       break;
+
+               case AUTHLIST_KIND_NOGID:
+                       if (in_egroup(cred, entry->kgid))
+                               error_out(-EPERM);
+                       break;
+
+               case AUTHLIST_KIND_EVERYBODY:
+                       error_out(0);
+
+               case AUTHLIST_KIND_NOBODY:
+                       error_out(-EPERM);
+               }
+       }
+
+out:
+
+       up_read(&authlist->rws);
+
+       return error;
+}
+
diff --git a/include/linux/authlist.h b/include/linux/authlist.h
new file mode 100644
index 0000000..f270644
--- /dev/null
+++ b/include/linux/authlist.h
@@ -0,0 +1,114 @@
+/*
+ * include/linux/authlist.h
+ *
+ * Authorization list.
+ *
+ * Started by (C) 2014 Sergey Oboguev <oboguev@yahoo.com>
+ *
+ * This code is licenced under the GPL version 2 or later.
+ * For details see linux-kernel-base/COPYING.
+ */
+
+#ifndef _LINUX_AUTHLIST_H
+#define _LINUX_AUTHLIST_H
+
+#include <linux/uidgid.h>
+#include <linux/rwsem.h>
+
+/*
+ * String representation of authorization list is a sequence of
+ * whitespace-separated entries in the format
+ *
+ *     uid:<numeric-uid>
+ *     gid:<numeric-gid>
+ *     nouid:<numeric-uid>
+ *     nogid:<numeric-gid>
+ *     everybody
+ *     nobody
+ *
+ * For instance:
+ *
+ *     uid:47  uid:100  gid:12  nobody
+ * or
+ *
+ *     nogid:300  everybody
+ *
+ * Terminal entry must be either "nobody" or "everybody".
+ */
+
+/*
+ * Define types of entries in the list.
+ *
+ * AUTHLIST_KIND_EVERYBODY or AUTHLIST_KIND_NOBODY must be the
+ * terminal entry.
+ */
+enum authlist_kind {
+       AUTHLIST_KIND_UID = 0,          /* allow UID */
+       AUTHLIST_KIND_GID,              /* allow GID */
+       AUTHLIST_KIND_NOUID,            /* disallow UID */
+       AUTHLIST_KIND_NOGID,            /* disallow GID */
+       AUTHLIST_KIND_EVERYBODY,        /* allow everybody */
+       AUTHLIST_KIND_NOBODY            /* disallow everybody */
+};
+
+struct authlist_entry {
+       enum authlist_kind kind;
+       union {
+               kuid_t  kuid;
+               kgid_t  kgid;
+       };
+};
+
+/*
+ * @rws                        rw semaphore to synchronize access to
the structure
+ *
+ * @initial_value      used only if @nentries is 0, can be either
+ *                     AUTHLIST_KIND_EVERYBODY or AUTHLIST_KIND_NOBODY
+ *
+ * @nentries           count of entries, 0 means use @initial_value
+ *
+ * @entries            array of authlist_entry structures,
+ *                     size of the array is given by @nentries
+ */
+struct authlist {
+       struct rw_semaphore rws;
+       enum authlist_kind initial_value;
+       int nentries;
+       struct authlist_entry *entries;
+};
+
+
+#define AUTHLIST_INITIALIZER(name, _initial_value)     \
+{                                                      \
+       .rws = __RWSEM_INITIALIZER(name.rws),           \
+       .initial_value = (_initial_value),              \
+       .nentries = 0,                                  \
+       .entries = NULL                                 \
+}
+
+/*
+ * Maximum authlist string length limit.
+ *
+ * Imposed to prevent malicious attempts to cause exessive memory allocation
+ * by using insanely long authlist strings.
+ */
+#define AUTHLIST_LENGTH_LIMIT  (1024 * 32)
+
+
+/*
+ * sysctl routine to read-in the authlist from the userspace
+ * and write it out to the userspace
+ */
+int proc_doauthlist(struct ctl_table *table, int write,
+                   void __user *buffer, size_t *lenp, loff_t *ppos);
+
+
+/*
+ * Check if @authlist permits the caller with credentials @cred to perform
+ * the operation guarded by the @authlist.
+ */
+int authlist_check_permission(struct authlist *authlist,
+                             const struct cred *cred);
+
+#endif /* _LINUX_AUTHLIST_H */
+
diff --git a/include/linux/dprio.h b/include/linux/dprio.h
new file mode 100644
index 0000000..1118fdf
--- /dev/null
+++ b/include/linux/dprio.h
@@ -0,0 +1,130 @@
+/*
+ * include/linux/dprio.h
+ *
+ * Deferred set priority.
+ *
+ * Started by (C) 2014 Sergey Oboguev <oboguev@yahoo.com>
+ *
+ * This code is licenced under the GPL version 2 or later.
+ * For details see linux-kernel-base/COPYING.
+ */
+
+#ifndef _LINUX_DPRIO_H
+#define _LINUX_DPRIO_H
+
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/authlist.h>
+
+#ifdef CONFIG_DEFERRED_SETPRIO
+
+/*
+ * @mask contains bit-flags indicating which policies have been pre-approved.
+ * Other fields are valid only if the corresponding bit is set in the @mask.
+ */
+static __always_inline void __dprio_info_assumptions(void)
+{
+       /* SCHED_xxx is used as a bit index in @mask */
+       BUILD_BUG_ON(SCHED_NORMAL > 31);
+       BUILD_BUG_ON(SCHED_FIFO > 31);
+       BUILD_BUG_ON(SCHED_RR > 31);
+       BUILD_BUG_ON(SCHED_BATCH > 31);
+       BUILD_BUG_ON(SCHED_IDLE > 31);
+}
+struct dprio_info {
+       unsigned mask;
+       s32 normal_sched_nice;
+       s32 batch_sched_nice;
+       u32 fifo_sched_priority;
+       u32 rr_sched_priority;
+       bool capable_sys_nice;
+};
+
+/*
+ * Called by dup_task_struct to reset non-inherited fields
+ */
+static __always_inline void set_task_in_dprio(struct task_struct *tsk,
+                                             bool in_dprio)
+{
+#ifdef CONFIG_DEBUG_DEFERRED_SETPRIO
+       tsk->in_dprio = in_dprio;
+#endif
+}
+
+static inline void dprio_dup_task_struct(struct task_struct *tsk)
+{
+       /* reset deferred setprio fields not inherited from the parent */
+       tsk->dprio_ku_area_pp = NULL;
+       tsk->dprio_info = NULL;
+       set_task_in_dprio(tsk, false);
+}
+
+void dprio_detach(struct task_struct *tsk);
+void dprio_handle_request(void);
+bool dprio_check_for_request(struct task_struct *prev);
+long dprio_prctl(int option, unsigned long a2, unsigned long a3,
+                unsigned long a4, unsigned long a5);
+
+struct dprio_saved_context {
+       struct dprio_ku_area __user * __user *dprio_ku_area_pp;
+       struct dprio_info *dprio_info;
+};
+
+static inline void dprio_save_reset_context(struct dprio_saved_context *saved)
+{
+       saved->dprio_ku_area_pp = current->dprio_ku_area_pp;
+       saved->dprio_info = current->dprio_info;
+
+       if (unlikely(saved->dprio_ku_area_pp)) {
+               preempt_disable();
+               current->dprio_ku_area_pp = NULL;
+               current->dprio_info = NULL;
+               preempt_enable();
+       }
+}
+
+static inline void dprio_restore_context(struct dprio_saved_context *saved)
+{
+       if (unlikely(saved->dprio_ku_area_pp)) {
+               preempt_disable();
+               current->dprio_ku_area_pp = saved->dprio_ku_area_pp;
+               current->dprio_info = saved->dprio_info;
+               preempt_enable();
+       }
+}
+
+static inline void dprio_free_context(struct dprio_saved_context *saved)
+{
+       if (unlikely(saved->dprio_info))
+               kfree(saved->dprio_info);
+}
+
+#ifdef CONFIG_DEFERRED_SETPRIO_ALLOW_EVERYBODY
+  #define DPRIO_AUTHLIST_INITIAL_VALUE  AUTHLIST_KIND_EVERYBODY
+#else
+  #define DPRIO_AUTHLIST_INITIAL_VALUE  AUTHLIST_KIND_NOBODY
+#endif
+
+extern struct authlist dprio_authlist;
+
+int dprio_check_permission(void);
+
+#else /* ndef CONFIG_DEFERRED_SETPRIO */
+
+static inline void set_task_in_dprio(struct task_struct *tsk, bool in_dprio) {}
+static inline void dprio_dup_task_struct(struct task_struct *tsk) {}
+static inline void dprio_detach(struct task_struct *tsk) {}
+static inline void dprio_handle_request(void) {}
+
+struct dprio_saved_context {
+       char dummy[0];          /* suppress compiler warning */
+};
+
+static inline void dprio_save_reset_context(struct
dprio_saved_context *saved) {}
+static inline void dprio_restore_context(struct dprio_saved_context *saved) {}
+static inline void dprio_free_context(struct dprio_saved_context *saved) {}
+
+#endif /* CONFIG_DEFERRED_SETPRIO */
+
+#endif /* _LINUX_DPRIO_H */
+
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6df7f9f..bdc6767 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -164,6 +164,22 @@ extern struct task_group root_task_group;
 # define INIT_RT_MUTEXES(tsk)
 #endif

+#ifdef CONFIG_DEBUG_DEFERRED_SETPRIO
+# define INIT_DEFERRED_SETPRIO_DEBUG                                   \
+       .in_dprio = false,
+#else
+# define INIT_DEFERRED_SETPRIO_DEBUG
+#endif
+
+#ifdef CONFIG_DEFERRED_SETPRIO
+# define INIT_DEFERRED_SETPRIO                                         \
+       .dprio_ku_area_pp = NULL,                                       \
+       .dprio_info = NULL,                                             \
+       INIT_DEFERRED_SETPRIO_DEBUG
+#else
+# define INIT_DEFERRED_SETPRIO
+#endif
+
 /*
  *  INIT_TASK is used to set up the first task table, touch at
  * your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -234,6 +250,7 @@ extern struct task_group root_task_group;
        INIT_CPUSET_SEQ(tsk)                                            \
        INIT_RT_MUTEXES(tsk)                                            \
        INIT_VTIME(tsk)                                                 \
+       INIT_DEFERRED_SETPRIO                                           \
 }


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 221b2bd..eacf48f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1610,6 +1610,16 @@ struct task_struct {
        unsigned int    sequential_io;
        unsigned int    sequential_io_avg;
 #endif
+#ifdef CONFIG_DEFERRED_SETPRIO
+       struct dprio_ku_area __user * __user *dprio_ku_area_pp;
+       struct dprio_info *dprio_info;
+#endif
+#ifdef CONFIG_PUT_TASK_TIMEBOUND
+       struct work_struct put_task_work;
+#endif
+#ifdef CONFIG_DEBUG_DEFERRED_SETPRIO
+       bool in_dprio;
+#endif
 };

 /* Future-safe accessor for struct task_struct's cpus_allowed. */
@@ -2150,6 +2160,11 @@ extern int sched_setscheduler_nocheck(struct
task_struct *, int,
                                      const struct sched_param *);
 extern int sched_setattr(struct task_struct *,
                         const struct sched_attr *);
+extern int sched_setattr_precheck(struct task_struct *p,
+                                 const struct sched_attr *attr);
+extern int sched_setattr_prechecked(struct task_struct *p,
+                                   const struct sched_attr *attr,
+                                   bool merge_reset_on_fork);
 extern struct task_struct *idle_task(int cpu);
 /**
  * is_idle_task - is the specified task an idle task?
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 58afc04..3513db5 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -152,4 +152,6 @@
 #define PR_SET_THP_DISABLE     41
 #define PR_GET_THP_DISABLE     42

+#define PR_SET_DEFERRED_SETPRIO        43
+
 #endif /* _LINUX_PRCTL_H */
diff --git a/init/Kconfig b/init/Kconfig
index 9d3585b..fe20a45 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1886,3 +1886,5 @@ config ASN1
          functions to call on what tags.

 source "kernel/Kconfig.locks"
+source "kernel/Kconfig.dprio"
+
diff --git a/kernel/Kconfig.dprio b/kernel/Kconfig.dprio
new file mode 100644
index 0000000..2c83cf0
--- /dev/null
+++ b/kernel/Kconfig.dprio
@@ -0,0 +1,81 @@
+menuconfig DEFERRED_SETPRIO
+       bool "Enable deferred setting of task priority"
+       default n
+       help
+         Enabling this option allows authorized applications to use
+         PR_SET_DEFERRED_SETPRIO request in prctl system call.
+
+         Applications that change task priority with very high frequency can
+         benefit from using this facility as long as they are specifically
+         implemented to use prctl(PR_SET_DEFERRED_SETPRIO). If the system does
+         not intend to run such applications there is no benefit to using
+         this option.
+
+         The downside of selecting this option is slightly increased latency
+         in task switching only in the case when a deferred set
priority request
+         by a previous task is pending at task switch time. Added delay in task
+         context switch in this case is in the order of 1 usec
(typical time for
+         executing deferred sched_setattr system call), which normally is not
+         significant, but may be a consideration in a system intended for hard
+         real-time use.
+
+         If unsure, say N.
+
+if DEFERRED_SETPRIO
+
+config PUT_TASK_TIMEBOUND
+       bool "Deterministic task switch latency when deferred set task
priority is used"
+       depends on DEFERRED_SETPRIO && RT_MUTEXES
+       default n
+       help
+         Enabling this option ensures deterministic time-bound task switch
+         latency when a deferred set priority request is pending on a task
+         rescheduling and switch and the processing of this request causes
+         an adjustment of priority inheritance chain under very low memory
+         conditions (depleted atomic pool).
+
+         Select Y if building the kernel for hard real-time system requiring
+         the determinism in task switch latency. Select N for general-purpose
+         desktop or server system.
+
+         This option has memory cost of about 20-40 bytes per each running task
+         in the system.
+
+config DEBUG_DEFERRED_SETPRIO
+       bool "Enable debugging code for deferred task priority setting"
+       depends on DEFERRED_SETPRIO
+       default n
+       help
+         Enable debugging code for DEFERRED_SETPRIO.
+
+         If unsure, say N.
+
+choice
+       prompt "Default authorization for deferred set priority"
+       depends on DEFERRED_SETPRIO
+       default DEFERRED_SETPRIO_ALLOW_NOBODY
+       help
+         Select whether users on the system are allowed by default to use the
+         deferred set priority facility. This setting defines the initial
+         value for the authorization list (as "everybody" or "nobody") that
+         can be altered dynamically via /proc/sys/kernel/dprio_authlist.
+
+config DEFERRED_SETPRIO_ALLOW_EVERYBODY
+       bool "Allow everybody to use the deferred set priority by default"
+       help
+         Allow by default every user on the system to use the deferred set
+         priority facility. Authorization list is initialized to "everybody"
+         at system startup time but can be altered later dynamically via
+         /proc/sys/kernel/dprio_authlist.
+
+config DEFERRED_SETPRIO_ALLOW_NOBODY
+       bool "Do not allow anybody to use the deferred set priority by default"
+       help
+         Disallow by default every user on the system except superuser to use
+         the deferred set priority facility. Authorization list is initialized
+         to "nobody" at system startup time but can be altered later
dynamically
+         via /proc/sys/kernel/dprio_authlist.
+
+endchoice
+
+endif # DEFERRED_SETPRIO
diff --git a/kernel/exit.c b/kernel/exit.c
index 6ed6a1d..ae9191f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -53,6 +53,7 @@
 #include <linux/oom.h>
 #include <linux/writeback.h>
 #include <linux/shm.h>
+#include <linux/dprio.h>

 #include <asm/uaccess.h>
 #include <asm/unistd.h>
@@ -719,6 +720,11 @@ void do_exit(long code)

        ptrace_event(PTRACE_EVENT_EXIT, code);

+       /*
+        * No more deferred priority changes applied in __schedule for this task
+        */
+       dprio_detach(tsk);
+
        validate_creds_for_do_exit(tsk);

        /*
diff --git a/kernel/fork.c b/kernel/fork.c
index 54a8d26..28a2d61 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -74,6 +74,7 @@
 #include <linux/uprobes.h>
 #include <linux/aio.h>
 #include <linux/compiler.h>
+#include <linux/dprio.h>

 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -234,7 +235,7 @@ static inline void put_signal_struct(struct
signal_struct *sig)
                free_signal_struct(sig);
 }

-void __put_task_struct(struct task_struct *tsk)
+static inline void __do_put_task_struct(struct task_struct *tsk)
 {
        WARN_ON(!tsk->exit_state);
        WARN_ON(atomic_read(&tsk->usage));
@@ -249,6 +250,83 @@ void __put_task_struct(struct task_struct *tsk)
        if (!profile_handoff_task(tsk))
                free_task(tsk);
 }
+
+#ifdef CONFIG_PUT_TASK_TIMEBOUND
+/*
+ * If timebound, use preallocated struct work_struct always guaranteed
+ * to be available, even if atomic kmalloc pool is depleted.
+ */
+static inline struct work_struct *alloc_put_task_work(struct task_struct *tsk)
+{
+       return &tsk->put_task_work;
+}
+
+static inline void free_put_task_work(struct work_struct *work)
+{
+}
+
+static inline struct task_struct *put_task_work_tsk(struct work_struct *work)
+{
+       return container_of(work, struct task_struct, put_task_work);
+}
+#else
+struct put_task_work {
+       struct work_struct work;
+       struct task_struct *tsk;
+};
+
+static inline struct work_struct *alloc_put_task_work(struct task_struct *tsk)
+{
+       struct put_task_work *dwork =
+               kmalloc(sizeof(*dwork), GFP_NOWAIT | __GFP_NOWARN);
+       if (unlikely(!dwork))
+               return NULL;
+       dwork->tsk = tsk;
+       return &dwork->work;
+}
+
+static inline void free_put_task_work(struct work_struct *work)
+{
+       struct put_task_work *dwork =
+               container_of(work, struct put_task_work, work);
+       kfree(dwork);
+}
+
+static inline struct task_struct *put_task_work_tsk(struct work_struct *work)
+{
+       struct put_task_work *dwork =
+               container_of(work, struct put_task_work, work);
+       return dwork->tsk;
+}
+#endif
+
+#ifdef CONFIG_DEFERRED_SETPRIO
+static void __put_task_struct_work(struct work_struct *work)
+{
+       __do_put_task_struct(put_task_work_tsk(work));
+       free_put_task_work(work);
+}
+#endif
+
+void __put_task_struct(struct task_struct *tsk)
+{
+#ifdef CONFIG_DEFERRED_SETPRIO
+       /*
+        * When called from inside of __schedule(), try to defer processing
+        * to a worker thread, in order to mininize the scheduling latency
+        * and make it deterministic.
+        */
+       if (unlikely(preempt_count() & PREEMPT_ACTIVE)) {
+               struct work_struct *work = alloc_put_task_work(tsk);
+               if (likely(work)) {
+                       INIT_WORK(work, __put_task_struct_work);
+                       schedule_work(work);
+                       return;
+               }
+       }
+#endif
+       __do_put_task_struct(tsk);
+}
 EXPORT_SYMBOL_GPL(__put_task_struct);

 void __init __weak arch_task_cache_init(void) { }
@@ -314,6 +392,8 @@ static struct task_struct *dup_task_struct(struct
task_struct *orig)
        if (err)
                goto free_ti;

+       dprio_dup_task_struct(tsk);
+
        tsk->stack = ti;

        setup_thread_stack(tsk, orig);
@@ -1581,6 +1661,11 @@ long do_fork(unsigned long clone_flags,
        long nr;

        /*
+        * Process pending "deferred set priority" request.
+        */
+       dprio_handle_request();
+
+       /*
         * Determine whether and which event to report to ptracer.  When
         * called from kernel_thread or CLONE_UNTRACED is explicitly
         * requested, no event is reported; otherwise, report if the event
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index ab32b7b..a93d07c 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
 obj-$(CONFIG_SCHEDSTATS) += stats.o
 obj-$(CONFIG_SCHED_DEBUG) += debug.o
 obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_DEFERRED_SETPRIO) += dprio.o
\ No newline at end of file
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 084d17f..f4c0d3c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -74,6 +74,7 @@
 #include <linux/binfmts.h>
 #include <linux/context_tracking.h>
 #include <linux/compiler.h>
+#include <linux/dprio.h>

 #include <asm/switch_to.h>
 #include <asm/tlb.h>
@@ -2615,6 +2616,111 @@ again:
        BUG(); /* the idle class will always have a runnable task */
 }

+#ifdef CONFIG_DEFERRED_SETPRIO
+
+/*
+ * __schedule should never be reentered recursively while it is handling
+ * deferred change priority request in dprio_set_schedattr, i.e. when
+ * @prev->in_dprio is true.
+ *
+ * To prevent reenterancy, dprio_handle_request(...) keeps preemption
+ * disable counter non-zero and also sets PREEMPT_ACTIVE flag.
+ */
+static __always_inline bool dprio_sched_recursion(struct task_struct *prev)
+{
+#ifdef CONFIG_DEBUG_DEFERRED_SETPRIO
+       if (unlikely(prev->in_dprio)) {
+               WARN_ONCE(1, KERN_ERR "BUG: dprio recursion in __schedule\n");
+
+               prev->state = TASK_RUNNING;
+               clear_tsk_need_resched(prev);
+               clear_preempt_need_resched();
+               sched_preempt_enable_no_resched();
+
+               return true;
+       }
+#endif /* CONFIG_DEBUG_DEFERRED_SETPRIO */
+
+       return false;
+}
+
+/*
+ * Check if deferred change priority request from the userland is pending
+ * and if so, handle it.
+ *
+ *     Academically speaking, it would be desirable (instead of calling
+ *     dprio_set_schedattr *before* pick_next_task) to call it *after*
+ *     pick_next_task and only if (next != prev). However in practice this
+ *     would save at most one sched_setattr call per task scheduling interval
+ *     (only for the tasks that use dprio), and then only sometimes, only when
+ *     both dprio request is pending at rescheduling time and the task gets
+ *     actually preempted by another task. At typical values of Linux
scheduling
+ *     parameters and the cost of sched_setattr call this translates to an
+ *     additional possible saving for dprio tasks that is well under 0.1%,
+ *     and probably much lower.
+ *
+ *     Nevertheless if dprio_set_schedattr were ever to be moved after the call
+ *     to pick_next_task, existing class schedulers would need to be revised
+ *     to support, in addition to call sequence
+ *
+ *       [pick_next_task] [context_switch]
+ *
+ *     also the sequence
+ *
+ *       [pick_next_task] [unlock rq] [...] [lock rq]
[pick_next_task] [context_switch]
+ *
+ *     where [...] may include a bunch of intervening class scheduler method
+ *     calls local CPU and other CPUs, since we'd be giving up the rq lock.
+ *     This would require splitting pick_next_task into "prepare" and
+ *     "commit/abort" phases.
+ */
+static __always_inline void dprio_sched_handle_request(struct
task_struct *prev)
+{
+       if (unlikely(prev->dprio_ku_area_pp != NULL) &&
+           unlikely(dprio_check_for_request(prev))) {
+               int sv_pc;
+
+               /*
+                * Do not attempt to process "deferred set priority" request for
+                * TASK_DEAD, STOPPED, TRACED and other states where it won't be
+                * appropriate.
+                */
+               switch (prev->state) {
+               case TASK_RUNNING:
+               case TASK_INTERRUPTIBLE:
+               case TASK_UNINTERRUPTIBLE:
+                       break;
+               default:
+                       return;
+               }
+
+               sv_pc = preempt_count();
+               if (!(sv_pc & PREEMPT_ACTIVE))
+                       __preempt_count_add(PREEMPT_ACTIVE);
+               set_task_in_dprio(prev, true);
+               /*
+                * Keep preemption disabled to avoid __schedule() recursion.
+                * In addition PREEMPT_ACTIVE notifies dprio_handle_request()
+                * and routines that may be called from inside of it, such as
+                * __put_task_struct(), of the calling context.
+                */
+               dprio_handle_request();
+
+               set_task_in_dprio(prev, false);
+               if (!(sv_pc & PREEMPT_ACTIVE))
+                       __preempt_count_sub(PREEMPT_ACTIVE);
+       }
+}
+#else  /* !defined CONFIG_DEFERRED_SETPRIO */
+
+static __always_inline bool dprio_sched_recursion(struct task_struct *prev)
+       { return false; }
+
+static __always_inline void dprio_sched_handle_request(struct
task_struct *prev)
+       {}
+
+#endif  /* CONFIG_DEFERRED_SETPRIO */
+
 /*
  * __schedule() is the main scheduler function.
  *
@@ -2668,6 +2774,10 @@ need_resched:

        schedule_debug(prev);

+       if (dprio_sched_recursion(prev))
+               return;
+       dprio_sched_handle_request(prev);
+
        if (sched_feat(HRTICK))
                hrtick_clear(rq);

@@ -3247,9 +3357,31 @@ static bool check_same_owner(struct task_struct *p)
        return match;
 }

+/*
+ * Flags for _sched_setscheduler and __sched_setscheduler:
+ *
+ *     SCHEDOP_KERNEL          on behalf of the kernel
+ *     SCHEDOP_USER            on behalf of the userspace
+ *
+ *     SCHEDOP_PRECHECK_ONLY   precheck security only, do not
+ *                             actually change priority
+ *     SCHEDOP_PRECHECKED      security has been prechecked
+ *
+ *     SCHEDOP_MERGE_RESET_ON_FORK  use logical "or" of
+ *                             attr->sched_flags & SCHED_FLAG_RESET_ON_FORK
+ *                             and p->sched_reset_on_fork
+ *
+ * SCHEDOP_KERNEL and SCHEDOP_USER are mutually exclusive.
+ */
+#define SCHEDOP_KERNEL                 (1 << 0)
+#define SCHEDOP_USER                   (1 << 1)
+#define SCHEDOP_PRECHECK_ONLY          (1 << 2)
+#define SCHEDOP_PRECHECKED             (1 << 3)
+#define SCHEDOP_MERGE_RESET_ON_FORK    (1 << 4)
+
 static int __sched_setscheduler(struct task_struct *p,
                                const struct sched_attr *attr,
-                               bool user)
+                               int opflags)
 {
        int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 :
                      MAX_RT_PRIO - 1 - attr->sched_priority;
@@ -3259,16 +3391,28 @@ static int __sched_setscheduler(struct task_struct *p,
        const struct sched_class *prev_class;
        struct rq *rq;
        int reset_on_fork;
+       bool check_security;

        /* may grab non-irq protected spin_locks */
        BUG_ON(in_interrupt());
+
+       check_security = (opflags & SCHEDOP_USER) && !(opflags &
SCHEDOP_PRECHECKED);
+
 recheck:
        /* double check policy once rq lock held */
        if (policy < 0) {
+               /*
+                * TODO:  Appears to be the bug in original 15.3.2 code here.
+                *
+                * (a) Does not check if user supplied attr->sched_policy = -1.
+                * (b) Will lose SCHED_FLAG_RESET_ON_FORK on the locked pass.
+                */
                reset_on_fork = p->sched_reset_on_fork;
                policy = oldpolicy = p->policy;
        } else {
                reset_on_fork = !!(attr->sched_flags &
SCHED_FLAG_RESET_ON_FORK);
+               if (opflags & SCHEDOP_MERGE_RESET_ON_FORK)
+                       reset_on_fork |= p->sched_reset_on_fork;

                if (policy != SCHED_DEADLINE &&
                                policy != SCHED_FIFO && policy != SCHED_RR &&
@@ -3295,7 +3439,7 @@ recheck:
        /*
         * Allow unprivileged RT tasks to decrease priority:
         */
-       if (user && !capable(CAP_SYS_NICE)) {
+       if (check_security && !capable(CAP_SYS_NICE)) {
                if (fair_policy(policy)) {
                        if (attr->sched_nice < task_nice(p) &&
                            !can_nice(p, attr->sched_nice))
@@ -3343,7 +3487,7 @@ recheck:
                        return -EPERM;
        }

-       if (user) {
+       if (check_security) {
                retval = security_task_setscheduler(p);
                if (retval)
                        return retval;
@@ -3378,13 +3522,17 @@ recheck:
                if (dl_policy(policy))
                        goto change;

-               p->sched_reset_on_fork = reset_on_fork;
+               if (!(opflags & SCHEDOP_PRECHECK_ONLY)) {
+                       if (opflags & SCHEDOP_MERGE_RESET_ON_FORK)
+                               reset_on_fork |= p->sched_reset_on_fork;
+                       p->sched_reset_on_fork = reset_on_fork;
+               }
                task_rq_unlock(rq, p, &flags);
                return 0;
        }
 change:

-       if (user) {
+       if (opflags & SCHEDOP_USER) {
 #ifdef CONFIG_RT_GROUP_SCHED
                /*
                 * Do not allow realtime tasks into groups that have no runtime
@@ -3432,6 +3580,13 @@ change:
                return -EBUSY;
        }

+       if (opflags & SCHEDOP_PRECHECK_ONLY) {
+               task_rq_unlock(rq, p, &flags);
+               return 0;
+       }
+
+       if (opflags & SCHEDOP_MERGE_RESET_ON_FORK)
+               reset_on_fork |= p->sched_reset_on_fork;
        p->sched_reset_on_fork = reset_on_fork;
        oldprio = p->prio;

@@ -3479,7 +3634,7 @@ change:
 }

 static int _sched_setscheduler(struct task_struct *p, int policy,
-                              const struct sched_param *param, bool check)
+                              const struct sched_param *param, int opflags)
 {
        struct sched_attr attr = {
                .sched_policy   = policy,
@@ -3496,7 +3651,7 @@ static int _sched_setscheduler(struct
task_struct *p, int policy,
                attr.sched_policy = policy;
        }

-       return __sched_setscheduler(p, &attr, check);
+       return __sched_setscheduler(p, &attr, opflags);
 }
 /**
  * sched_setscheduler - change the scheduling policy and/or RT
priority of a thread.
@@ -3511,16 +3666,41 @@ static int _sched_setscheduler(struct
task_struct *p, int policy,
 int sched_setscheduler(struct task_struct *p, int policy,
                       const struct sched_param *param)
 {
-       return _sched_setscheduler(p, policy, param, true);
+       return _sched_setscheduler(p, policy, param, SCHEDOP_USER);
 }
 EXPORT_SYMBOL_GPL(sched_setscheduler);

 int sched_setattr(struct task_struct *p, const struct sched_attr *attr)
 {
-       return __sched_setscheduler(p, attr, true);
+       return __sched_setscheduler(p, attr, SCHEDOP_USER);
 }
 EXPORT_SYMBOL_GPL(sched_setattr);

+/*
+ * Check for security context required to execute sched_setattr,
+ * but do not execute actual task scheduler properties setting.
+ */
+int sched_setattr_precheck(struct task_struct *p, const struct
sched_attr *attr)
+{
+       return __sched_setscheduler(p, attr, SCHEDOP_USER |
+                                            SCHEDOP_PRECHECK_ONLY);
+}
+EXPORT_SYMBOL_GPL(sched_setattr_precheck);
+
+/*
+ * Execute sched_setattr bypassing security checks.
+ */
+int sched_setattr_prechecked(struct task_struct *p,
+                            const struct sched_attr *attr,
+                            bool merge_reset_on_fork)
+{
+       int exflags = merge_reset_on_fork ? SCHEDOP_MERGE_RESET_ON_FORK : 0;
+       return __sched_setscheduler(p, attr, SCHEDOP_USER |
+                                            SCHEDOP_PRECHECKED |
+                                            exflags);
+}
+EXPORT_SYMBOL_GPL(sched_setattr_prechecked);
+
 /**
  * sched_setscheduler_nocheck - change the scheduling policy and/or
RT priority of a thread from kernelspace.
  * @p: the task in question.
@@ -3537,7 +3717,7 @@ EXPORT_SYMBOL_GPL(sched_setattr);
 int sched_setscheduler_nocheck(struct task_struct *p, int policy,
                               const struct sched_param *param)
 {
-       return _sched_setscheduler(p, policy, param, false);
+       return _sched_setscheduler(p, policy, param, SCHEDOP_KERNEL);
 }

 static int
diff --git a/kernel/sched/dprio.c b/kernel/sched/dprio.c
new file mode 100644
index 0000000..31c6b37
--- /dev/null
+++ b/kernel/sched/dprio.c
@@ -0,0 +1,734 @@
+/*
+ * kernel/sched/dprio.c
+ *
+ * Deferred set priority.
+ *
+ * Started by (C) 2014 Sergey Oboguev <oboguev@yahoo.com>
+ *
+ * This code is licenced under the GPL version 2 or later.
+ * For details see linux-kernel-base/COPYING.
+ */
+
+#include <linux/types.h>
+#include <linux/unistd.h>
+#include <linux/stddef.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/dprio.h>
+#include <linux/slab.h>
+#include <linux/compiler.h>
+#include <linux/uaccess.h>
+#include <linux/capability.h>
+#include <linux/prctl.h>
+#include <linux/init.h>
+
+struct authlist dprio_authlist =
+       AUTHLIST_INITIALIZER(dprio_authlist, DPRIO_AUTHLIST_INITIAL_VALUE);
+
+/*
+ * Userspace-kernel dprio protocol is as follows:
+ *
+ * Userspace:
+ *
+ *     Select and fill-in dprio_ku_area:
+ *         Set @resp = DPRIO_RESP_NONE.
+ *         Set @sched_attr.
+ *
+ *     Set @cmd to point dprio_ku_area.
+ *
+ *     @cmd is u64 variable previously designated in the call
+ *     prctl(PR_SET_DEFERRED_SETPRIO, & @cmd, ...)
+ *
+ * Kernel:
+ *
+ *     1) On task preemption attempt or at other processing point,
+ *        such as fork or exec, read @cmd.
+ *        If cannot (e.g. @cmd inaccessible incl. page swapped out), quit.
+ *        Note: will reattempt again on next preemption cycle.
+ *
+ *     2) If read-in value of @cmd is 0, do nothing. Quit.
+ *
+ *     3) Set @resp = DPRIO_RESP_UNKNOWN.
+ *        If cannot (e.g. inaccessible), quit.
+ *
+ *     4) Set @cmd = NULL.
+ *        If cannot (e.g. inaccessible), quit.
+ *        Note that in this case request handling will be reattempted on next
+ *        thread preemption cycle. Thus @resp value of DPRIO_RESP_UNKNOWN may
+ *        be transient and overwritten with DPRIO_RESP_OK or DPRIO_RESP_ERROR
+ *        if @cmd is not reset to 0 by the kernel (or to 0 or to the address
+ *        of another dprio_ku_area by the userspace).
+ *
+ *     5) Read @sched_attr.
+ *        If cannot (e.g. inaccessible), quit.
+ *
+ *     6) Try to change task scheduling attributes in accordance with read-in
+ *        value of @sched_attr.
+ *
+ *     7) If successful, set @resp = DPRIO_RESP_OK and Quit.
+ *
+ *     8) If unsuccessful, set @error = appopriate errno-style value.
+ *        If cannot (e.g. @error inaccessible), quit.
+ *        Set @resp = DPRIO_RESP_ERROR.
+ *        If cannot (e.g. @resp inaccessible), quit.
+ *
+ * Explanation of possible @resp codes:
+ *
+ * DPRIO_RESP_NONE
+ *
+ *     Request has not been processed yet.
+ *
+ * DPRIO_RESP_OK
+ *
+ *     Request has been successfully processed.
+ *
+ * DPRIO_RESP_ERROR
+ *
+ *     Request has failed, @error has errno-style error code.
+ *
+ * DPRIO_RESP_UNKNOWN
+ *
+ *     Request processing has been attempted, but the outcome is unknown.
+ *     Request might have been successful or failed.
+ *     Current os-level thread priority becomes unknown.
+ *
+ *     @error field may be invalid.
+ *
+ *     This code is written to @resp at the start of request processing,
+ *     then @resp is changed to OK or ERR at the end of request processing
+ *     if dprio_ku_area and @cmd stay accessible for write.
+ *
+ *     This status code is never left visible to the userspace code in the
+ *     current thread if dprio_ku_area and @cmd are locked in memory and remain
+ *     properly accessible for read and write during request processing.
+ *
+ *     This status code might happen (i.e. stay visible to userspace code
+ *     in the current thread) if access to dprio_ku_area or @cmd is lost
+ *     during request processing, for example the page that contains the area
+ *     gets swapped out or the area is otherwise not fully accessible for
+ *     reading and writing.
+ *
+ *     If @error has value of DPRIO_RESP_UNKNOWN and @cmd is still pointing
+ *     to dprio_ku_area containing @error, it is possible for the request to
+ *     be reprocessed again at the next context switch and @error change to
+ *     DPRIO_RESP_OK or DPRIO_RESP_ERROR. To ensure @error does not change
+ *     under your feet, change @cmd to either NULL or address of another
+ *     dprio_ku_area distinct from one containing this @error.
+ */
+enum {
+       DPRIO_RESP_NONE     = 0,
+       DPRIO_RESP_OK       = 1,
+       DPRIO_RESP_ERROR    = 2,
+       DPRIO_RESP_UNKNOWN  = 3
+};
+
+struct dprio_ku_area {
+       /*
+        * Size of struct sched_attr may change in future definitions
+        * of the structure, therefore @sched_attr should come after
+        * @resp and @error in order to maintain the compatibility
+        * between userland and kernel built with different versions
+        * of struct sched_attr definition.
+        *
+        * Userland code should use volatile and/or compiler barriers
+        * to ensure the protocol.
+        */
+       /*volatile*/ u32 resp;          /* DPRIO_RESP_xxx */
+       /*volatile*/ u32 error;         /* one of errno values */
+       /*volatile*/ struct sched_attr sched_attr;
+};
+
+/*
+ * Returns 0 on success.
+ */
+static inline int __copyin(void *dst, const void __user *src,
+                          unsigned size, bool atomic)
+{
+       int ret;
+
+       /* Use barrier() to sequence userspace-kernel dprio protocol */
+       barrier();
+       if (atomic) {
+               pagefault_disable();
+               ret = __copy_from_user_inatomic(dst, src, size);
+               pagefault_enable();
+       } else {
+               ret = copy_from_user(dst, src, size);
+       }
+       barrier();
+
+       return ret;
+}
+
+/*
+ * Returns 0 on success.
+ */
+static inline int __copyout(void __user *dst, const void *src,
+                           unsigned size, bool atomic)
+{
+       int ret;
+
+       /* Use barrier() to sequence userspace-kernel dprio protocol */
+       barrier();
+       if (atomic) {
+               pagefault_disable();
+               ret = __copy_to_user_inatomic(dst, src, size);
+               pagefault_enable();
+       } else {
+               ret = copy_to_user(dst, src, size);
+       }
+       barrier();
+
+       return ret;
+}
+
+#define __copyin_var(x, uptr, atomic)  \
+       __copyin(&(x), (uptr), sizeof(x), (atomic))
+
+#define __copyout_var(x, uptr, atomic) \
+       __copyout((uptr), &(x), sizeof(x), (atomic))
+
+
+/*
+ * Mimics sched_copy_attr()
+ */
+#define CHUNK_SIZE 32u
+static int dprio_copyin_sched_attr(struct sched_attr __user *uattr,
+                                  struct sched_attr *attr,
+                                  bool atomic)
+{
+       u32 size;
+
+       if (!access_ok(VERIFY_READ, uattr, SCHED_ATTR_SIZE_VER0))
+               return -EFAULT;
+
+       /*
+        * zero the full structure, so that a short copy will be nice.
+        */
+       memset(attr, 0, sizeof(*attr));
+
+       if (__copyin_var(size, &uattr->size, atomic))
+               return -EFAULT;
+
+       if (size > PAGE_SIZE)   /* silly large */
+               return -E2BIG;
+
+       if (!size)              /* abi compat */
+               size = SCHED_ATTR_SIZE_VER0;
+
+       if (size < SCHED_ATTR_SIZE_VER0)
+               return -E2BIG;
+
+       /*
+        * If we're handed a bigger struct than we know of,
+        * ensure all the unknown bits are 0 - i.e. new
+        * user-space does not rely on any kernel feature
+        * extensions we dont know about yet.
+        */
+       if (size > sizeof(*attr)) {
+               unsigned char __user *addr;
+               unsigned char __user *end;
+               unsigned char val[CHUNK_SIZE];
+               unsigned k, chunk_size;
+
+               addr = (char __user *)uattr + sizeof(*attr);
+               end  = (char __user *)uattr + size;
+
+               for (; addr < end; addr += chunk_size) {
+                       chunk_size = min((unsigned) (end - addr), CHUNK_SIZE);
+                       if (__copyin(val, addr, chunk_size, atomic))
+                               return -EFAULT;
+                       for (k = 0;  k < chunk_size; k++) {
+                               if (val[k])
+                                       return -E2BIG;
+                       }
+               }
+               size = sizeof(*attr);
+       }
+
+       if (__copyin(attr, uattr, size, atomic))
+               return -EFAULT;
+
+       attr->size = size;
+
+       /*
+        * XXX: do we want to be lenient like existing syscalls; or do we want
+        * to be strict and return an error on out-of-bounds values?
+        * See also other uses of clamp(..., MIN_NICE, MAX_NICE) below.
+        */
+       attr->sched_nice = clamp(attr->sched_nice, MIN_NICE, MAX_NICE);
+
+       return 0;
+}
+
+
+/*
+ * Detach the task from userland deferred setprio request area and deallocate
+ * all resources for the connection. Called from:
+ *
+ *   - prctl(PR_SET_DEFERRED_SETPRIO) with area argument passed as NULL
+ *     to terminate previous connection
+ *
+ *   - prctl(PR_SET_DEFERRED_SETPRIO) with new non-NULL area argument
+ *     setting new connection. Previous connection is terminated before
+ *     establishing a new one
+ *
+ *   - when the task is terminated in do_exit()
+ */
+void dprio_detach(struct task_struct *tsk)
+{
+       preempt_disable();
+
+       tsk->dprio_ku_area_pp = NULL;
+
+       if (unlikely(tsk->dprio_info)) {
+               kfree(tsk->dprio_info);
+               tsk->dprio_info = NULL;
+       }
+
+       preempt_enable();
+}
+
+/*
+ * Pre-process sched_attr just read from the userspace, whether during precheck
+ * or during dprio request execution, to impose uniform interpretation of
+ * structure format and values.
+ */
+static void uniform_attr(struct sched_attr *attr)
+{
+       /* accommodate legacy hack */
+       if (attr->sched_policy & SCHED_RESET_ON_FORK) {
+               attr->sched_flags |= SCHED_FLAG_RESET_ON_FORK;
+               attr->sched_policy &= ~SCHED_RESET_ON_FORK;
+       }
+
+       if (attr->sched_policy == SCHED_IDLE)
+               attr->sched_nice = MAX_NICE;
+}
+
+/*
+ * Precheck whether current process is authorized to set its scheduling
+ * properties to @uattr. If yes, make record in @info and return 0.
+ * If not, return error.
+ */
+static int precheck(struct dprio_info *info, struct sched_attr __user *uattr)
+{
+       struct sched_attr attr;
+       u32 policy;
+       unsigned mask;
+       int error;
+
+       error = dprio_copyin_sched_attr(uattr, &attr, false);
+       if (error)
+               return error;
+
+       uniform_attr(&attr);
+
+       policy = attr.sched_policy;
+       mask = 1 << policy;
+
+       switch (policy) {
+       case SCHED_NORMAL:
+               attr.sched_nice = clamp(attr.sched_nice, MIN_NICE, MAX_NICE);
+               if ((info->mask & mask) &&
+                   attr.sched_nice >= info->normal_sched_nice)
+                       break;
+               error = sched_setattr_precheck(current, &attr);
+               if (error == 0) {
+                       info->normal_sched_nice = attr.sched_nice;
+                       info->mask |= mask;
+               }
+               break;
+
+       case SCHED_BATCH:
+               attr.sched_nice = clamp(attr.sched_nice, MIN_NICE, MAX_NICE);
+               if ((info->mask & mask) &&
+                   attr.sched_nice >= info->batch_sched_nice)
+                       break;
+               error = sched_setattr_precheck(current, &attr);
+               if (error == 0) {
+                       info->batch_sched_nice = attr.sched_nice;
+                       info->mask |= mask;
+               }
+               break;
+
+       case SCHED_FIFO:
+               if ((info->mask & mask) &&
+                   attr.sched_priority <= info->fifo_sched_priority)
+                       break;
+               error = sched_setattr_precheck(current, &attr);
+               if (error == 0) {
+                       info->fifo_sched_priority = attr.sched_priority;
+                       info->mask |= mask;
+               }
+               break;
+
+       case SCHED_RR:
+               if ((info->mask & mask) &&
+                   attr.sched_priority <= info->rr_sched_priority)
+                       break;
+               error = sched_setattr_precheck(current, &attr);
+               if (error == 0) {
+                       info->rr_sched_priority = attr.sched_priority;
+                       info->mask |= mask;
+               }
+               break;
+
+       case SCHED_IDLE:
+               if (info->mask & mask)
+                       break;
+               error = sched_setattr_precheck(current, &attr);
+               if (error == 0)
+                       info->mask |= mask;
+               break;
+
+       case SCHED_DEADLINE:
+               /*
+                * DL is not a meaningful policy for deferred set
+                * priority
+                */
+       default:
+               error = -EINVAL;
+               break;
+       }
+
+       return error;
+}
+
+/*
+ * Implements prctl(PR_SET_DEFERRED_SETPRIO).
+ *
+ * To set PR_SET_DEFERRED_SETPRIO:
+ *
+ *     a2 = address of u64 variable in the userspace that holds the pointer
+ *          to dprio_ku_area or NULL
+ *
+ *     a3 = address of userspace array of pointers to sched_attr entries
+ *          to preapprove for subsequent pre-checked use by deferred set
+ *          priority requests
+ *
+ *     a4 = count of entries in a3 or 0
+ *
+ *     a5 = 0
+ *
+ * To reset PR_SET_DEFERRED_SETPRIO:
+ *
+ *     a2 = 0
+ *     a3 = 0
+ *     a4 = 0
+ *     a5 = 0
+ *
+ * Thus valid calls are:
+ *
+ *     struct sched_attr **sched_attrs_pp;
+ *     prctl(PR_SET_DEFERRED_SETPRIO, dprio_ku_area_pp,
+ *           sched_attrs_pp, nattrs, 0)
+ *
+ *     prctl(PR_SET_DEFERRED_SETPRIO, NULL, NULL, 0, 0)
+ *
+ */
+long dprio_prctl(int option, unsigned long a2, unsigned long a3,
+                unsigned long a4, unsigned long a5)
+{
+       struct dprio_ku_area __user * __user *ku_area_pp;
+       struct dprio_ku_area __user *ku_area_p;
+       struct dprio_info *info = NULL;
+       unsigned long ne, nentries;
+       struct sched_attr __user * __user *uattr_pp;
+       struct sched_attr __user *uattr_p;
+       bool atomic = false;
+       long error = 0;
+
+       if (option != PR_SET_DEFERRED_SETPRIO)
+               return -EINVAL;
+
+       ku_area_pp = (struct dprio_ku_area __user * __user *) a2;
+
+       /*
+       * Handle reset operation for PR_SET_DEFERRED_SETPRIO
+        */
+       if (ku_area_pp == NULL) {
+               if (a3 | a4 | a5)
+                       return -EINVAL;
+               dprio_handle_request();
+               dprio_detach(current);
+               return 0;
+       }
+
+       /*
+        * Handle set operation for PR_SET_DEFERRED_SETPRIO
+        */
+       uattr_pp = (struct sched_attr __user * __user *) a3;
+       nentries = a4;
+       if (a5)
+               return -EINVAL;
+
+       /* sanity check to avoid long spinning in the kernel */
+       if (nentries > 4096) {
+               error = -EINVAL;
+               goto out;
+       }
+
+       /* Check alignment */
+       if ((unsigned long) ku_area_pp % sizeof(u64))
+               return  -EINVAL;
+
+       /* check *ku_area_pp is readable and writeable */
+       if (__copyin_var(ku_area_p, ku_area_pp, atomic) ||
+           __copyout_var(ku_area_p, ku_area_pp, atomic))
+               return  -EFAULT;
+
+       error = dprio_check_permission();
+       if (error)
+               return error;
+
+       info = kmalloc(sizeof(*info), GFP_KERNEL);
+       if (info == NULL)
+               return -ENOMEM;
+       info->mask = 0;
+       /*
+        * XXX:
+        *
+        * We may trigger a false recording of PF_SUPERPRIV here by requesting
+        * CAP_SYS_NICE capability we may not actually use later, however
+        * since we cannot modify current->flags during dprio_handle_request()
+        * when called from __schedule(), the alternatives would be either
+        * possibly missing the recording of PF_SUPERPRIV, or (better) splitting
+        * PF_SUPERPRIV from current->flags and moving it to a variable with
+        * atomic access protocol.
+        */
+       info->capable_sys_nice = capable(CAP_SYS_NICE);
+
+       /*
+        * We prevalidate maximum requested priority levels at the time of
+        * prctl set-up instead of validating priority change requests during
+        * their actual processing in __schedule and do_fork in order to:
+        *
+        *    - reduce latency during request processing in __schedule()
+        *
+        *    - avoid blocking in the secirity code when setprio processing
+        *      is performed in _schedule()
+        *
+        *    - avoid EINTR or ERESTARTSYS etc. that may be returned by
+        *      the security code during setprio request processing
+        */
+       for (ne = 0;  ne < nentries;  ne++) {
+               cond_resched();
+               if (__copyin_var(uattr_p, uattr_pp + ne, atomic)) {
+                       error = -EFAULT;
+                       goto out;
+               }
+               error = precheck(info, uattr_p);
+               if (error)
+                       goto out;
+       }
+
+       /*
+        * If there was a previous active dprio ku area, try to process
+        * any pending request in it and detach from it.
+        */
+       dprio_handle_request();
+       dprio_detach(current);
+
+       preempt_disable();
+       current->dprio_ku_area_pp = ku_area_pp;
+       current->dprio_info = info;
+       preempt_enable();
+
+out:
+       if (error && info)
+               kfree(info);
+
+       return error;
+}
+
+/*
+ * Check if "deferred set priority" request from the userland is pending.
+ * Returns @true if request has been detected, @false if not.
+ *
+ * If page pointed by dprio_ku_area_pp is not currently accessible (e.g. not
+ * valid or paged out), return @false.
+ */
+bool dprio_check_for_request(struct task_struct *prev)
+{
+       struct dprio_ku_area __user *ku_area_p;
+       bool atomic = true;
+
+#ifdef CONFIG_DEBUG_DEFERRED_SETPRIO
+       /*
+        * We are only called if prev->dprio_ku_area_pp != NULL,
+        * thus prev cannot be a kernel thread
+        */
+       if (unlikely(prev->active_mm != prev->mm)) {
+               WARN_ONCE(1, KERN_ERR "BUG: dprio: address space not mapped\n");
+               return false;
+       }
+#endif /* CONFIG_DEBUG_DEFERRED_SETPRIO */
+
+       if (__copyin_var(ku_area_p, prev->dprio_ku_area_pp, atomic))
+               return false;
+
+       return ku_area_p != NULL;
+}
+
+/*
+ * Handle pending "deferred set priority" request from the userland.
+ */
+void dprio_handle_request(void)
+{
+       struct dprio_ku_area __user *ku;
+       struct dprio_ku_area __user *ku_null;
+       struct sched_attr attr;
+       bool atomic;
+       u32 resp, error;
+       int ierror = 0;
+       unsigned long rlim_rtprio;
+       long rlim_nice;
+       struct dprio_info *info;
+
+       /* attached to ku area? */
+       if (current->dprio_ku_area_pp == NULL)
+               return;
+
+       /* called from __schedule? */
+       atomic = preempt_count() != 0;
+
+       /* fetch ku request area address from the userspace */
+       if (__copyin_var(ku, current->dprio_ku_area_pp, atomic))
+               return;
+
+       /* check if request is pending */
+       if (unlikely(ku == NULL))
+               return;
+
+       /* remark to the userspace:
+          request processing has been started/attempted */
+       resp = DPRIO_RESP_UNKNOWN;
+       if (__copyout_var(resp, &ku->resp, atomic))
+               return;
+
+       /* reset pending request */
+       ku_null = NULL;
+       if (__copyout_var(ku_null, current->dprio_ku_area_pp, atomic))
+               return;
+
+       /* fetch request parameters from the userspace */
+       if (dprio_copyin_sched_attr(&ku->sched_attr, &attr, atomic))
+               return;
+
+       /* impose uniform interpretation of sched_attr */
+       uniform_attr(&attr);
+
+       if (attr.sched_flags & ~SCHED_FLAG_RESET_ON_FORK) {
+               ierror = -EINVAL;
+               goto out;
+       }
+
+       /*
+        * check if request has been pre-authorized
+        */
+       info = current->dprio_info;
+       switch (attr.sched_policy) {
+       case SCHED_NORMAL:
+               if (!(info->mask & (1 << SCHED_NORMAL)) ||
+                   attr.sched_nice < info->normal_sched_nice)
+                       ierror = -EPERM;
+               /*
+                * check whether RLIMIT_NICE has been reduced
+                * by setrlimit or prlimit
+                */
+               if (ierror == 0 && !info->capable_sys_nice) {
+                       rlim_nice = 20 - task_rlimit(current, RLIMIT_NICE);
+                       if (attr.sched_nice < rlim_nice)
+                               ierror = -EPERM;
+               }
+               break;
+
+       case SCHED_BATCH:
+               if (!(info->mask & (1 << SCHED_BATCH)) ||
+                   attr.sched_nice < info->batch_sched_nice)
+                       ierror = -EPERM;
+               /*
+                * check whether RLIMIT_NICE has been reduced
+                * by setrlimit or prlimit
+                */
+               if (ierror == 0 && !info->capable_sys_nice) {
+                       rlim_nice = 20 - task_rlimit(current, RLIMIT_NICE);
+                       if (attr.sched_nice < rlim_nice)
+                               ierror = -EPERM;
+               }
+               break;
+
+       case SCHED_FIFO:
+               if (!(info->mask & (1 << SCHED_FIFO)) ||
+                   attr.sched_priority > info->fifo_sched_priority)
+                       ierror = -EPERM;
+               /*
+                * check whether RLIMIT_RTPRIO has been reduced
+                * by setrlimit or prlimit
+                */
+               if (ierror == 0 && !info->capable_sys_nice) {
+                       rlim_rtprio = task_rlimit(current, RLIMIT_RTPRIO);
+                       if (rlim_rtprio == 0 || attr.sched_priority >
rlim_rtprio)
+                               ierror = -EPERM;
+               }
+               break;
+
+       case SCHED_RR:
+               if (!(info->mask & (1 << SCHED_RR)) ||
+                   attr.sched_priority > info->rr_sched_priority)
+                       ierror = -EPERM;
+               /*
+                * check whether RLIMIT_RTPRIO has been reduced
+                * by setrlimit or prlimit
+                */
+               if (ierror == 0 && !info->capable_sys_nice) {
+                       rlim_rtprio = task_rlimit(current, RLIMIT_RTPRIO);
+                       if (rlim_rtprio == 0 || attr.sched_priority >
rlim_rtprio)
+                               ierror = -EPERM;
+               }
+               break;
+
+       case SCHED_IDLE:
+               if (!(info->mask & (1 << SCHED_IDLE)))
+                       ierror = -EPERM;
+               break;
+
+       default:
+               ierror = -EINVAL;
+               break;
+       }
+
+       /* execute the request */
+       if (ierror == 0)
+               ierror = sched_setattr_prechecked(current, &attr, true);
+
+out:
+       if (ierror) {
+               error = (u32) -ierror;
+               resp = DPRIO_RESP_ERROR;
+               if (0 == __copyout_var(error, &ku->error, atomic))
+                       __copyout_var(resp, &ku->resp, atomic);
+       } else {
+               resp = DPRIO_RESP_OK;
+               __copyout_var(resp, &ku->resp, atomic);
+       }
+}
+
+/*
+ * Verify if the current task is authorized to use
prctl(PR_SET_DEFERRED_SETPRIO).
+ */
+int dprio_check_permission(void)
+{
+       const struct cred *cred = current_cred();
+       int error = authlist_check_permission(&dprio_authlist, cred);
+
+       if (error != 0 && uid_eq(cred->euid, GLOBAL_ROOT_UID)) {
+               current->flags |= PF_SUPERPRIV;
+               error = 0;
+       }
+
+       return error;
+}
+
diff --git a/kernel/sys.c b/kernel/sys.c
index fba0f29..5b2ccc1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -52,6 +52,7 @@
 #include <linux/rcupdate.h>
 #include <linux/uidgid.h>
 #include <linux/cred.h>
+#include <linux/dprio.h>

 #include <linux/kmsg_dump.h>
 /* Move somewhere else to avoid recompiling? */
@@ -2011,6 +2012,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned
long, arg2, unsigned long, arg3,
                        me->mm->def_flags &= ~VM_NOHUGEPAGE;
                up_write(&me->mm->mmap_sem);
                break;
+#ifdef CONFIG_DEFERRED_SETPRIO
+       case PR_SET_DEFERRED_SETPRIO:
+               error = dprio_prctl(option, arg2, arg3, arg4, arg5);
+               break;
+#endif
        default:
                error = -EINVAL;
                break;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 74f5b58..03bcc36 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -63,6 +63,8 @@
 #include <linux/binfmts.h>
 #include <linux/sched/sysctl.h>
 #include <linux/kexec.h>
+#include <linux/authlist.h>
+#include <linux/dprio.h>

 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -430,6 +432,15 @@ static struct ctl_table kern_table[] = {
                .extra2         = &one,
        },
 #endif
+#ifdef CONFIG_DEFERRED_SETPRIO
+       {
+               .procname       = "dprio_authlist",
+               .data           = &dprio_authlist,
+               .maxlen         = 0,
+               .mode           = 0600,
+               .proc_handler   = proc_doauthlist,
+       },
+#endif
 #ifdef CONFIG_CFS_BANDWIDTH
        {
                .procname       = "sched_cfs_bandwidth_slice_us",
--

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-25 19:45 [PATCH RFC] sched: deferred set priority (dprio) Sergey Oboguev
@ 2014-07-25 20:12 ` Andy Lutomirski
  2014-07-26  7:56   ` Sergey Oboguev
  2014-07-26  8:58 ` Mike Galbraith
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Andy Lutomirski @ 2014-07-25 20:12 UTC (permalink / raw)
  To: Sergey Oboguev, linux-kernel

On 07/25/2014 12:45 PM, Sergey Oboguev wrote:
> [This is a repost of the message from few day ago, with patch file
> inline instead of being pointed by the URL.]
> 
> This patch is intended to improve the support for fine-grain parallel
> applications that may sometimes need to change the priority of their threads at
> a very high rate, hundreds or even thousands of times per scheduling timeslice.

What is the authlist stuff for?  It looks complicated and it seems like
it should be unnecessary.

What's with the save/restore code?  What's it saving and restoring?

--Andy

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-25 20:12 ` Andy Lutomirski
@ 2014-07-26  7:56   ` Sergey Oboguev
  0 siblings, 0 replies; 34+ messages in thread
From: Sergey Oboguev @ 2014-07-26  7:56 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: linux-kernel

On Fri, Jul 25, 2014 at 1:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On 07/25/2014 12:45 PM, Sergey Oboguev wrote:
>> [This is a repost of the message from few day ago, with patch file
>> inline instead of being pointed by the URL.]
>>
>> This patch is intended to improve the support for fine-grain parallel
>> applications that may sometimes need to change the priority of their threads at
>> a very high rate, hundreds or even thousands of times per scheduling timeslice.
>
> What is the authlist stuff for?  It looks complicated and it seems like
> it should be unnecessary.

The authlist controls who is allowed to use the DPRIO facility.

DPRIO works by making a check at __schedule() time for whether a deferred set
priority request is pending, and if so, then performing an equivalent of
sched_setattr(), minus security checks, before the rescheduling.

This introduces additional latency at task switch time when a deferred set
priority request is pending -- albeit normally a very small latency, but
non-zero one and a malicious user can also manoeuvre into increasing it. There
are two parts to this latency. One is more or less constant part in the order
of 1 usec, that's basic __sched_setscheduler() code. The other part is due to
possible priority inheritance chain adjustment in rt_mutex_adjust_pi() and
depends on the length of the chain. Malicious user might conceivably construct
a very long chain, long enough for processing of this chain at the time of
__schedule to cause an objectionable latency. On systems where this might be a
concern, administrator may therefore want to restrict the use of DPRIO to
legitimate trusted applications (or rather users of those).

If a common feeling is that such a safeguard is overly paranoid, I would be
happy to drop it, but I feel some security control option may be desirable
for the mentioned reason.

Rethinking it now though, perhaps a simpler alternative could be adding a
capability for DPRIO plus a system-wide setting as to whether this capability
is required for the use of DPRIO or DPRIO is allowed for everyone on the
system. The benefit of this approach might also be that the administrator can
use setcap on trusted executable files installed in the system to grant them
this capability.

> What's with the save/restore code?  What's it saving and restoring?

It is used inside execve.

There are two DPRIO-related elements stored in task_struct (plus one more for
the debug code only).

One is an address of a location in the userspace that is used for
the userspace <-> kernel communication.

Another element stored in task_struct is the pointer to per-task DPRIO
information block kept inside the kernel. This block holds pre-authorized
priorities.

(The address of userspace location is stored in task_struct itself, rather than
in DPRIO info block to avoid extra indirection in the frequent path.)

Successful execve must result in shutting down DPRIO for the task, i.e.
resetting these two pointers to NULL (this must be performed before new
executable image is loaded, otherwise a corruption of new image memory can
result) and also the deallocation of DPRIO information block. If execve fails
however and control returns to the original image, DPRIO settings should be
retained.

Before calling image loader, execve (i.e. do_execve_common) invokes
dprio_save_reset_context() to save these two pointers in on-stack backup
structure and to reset their values in task_struct to NULL. If new image loads
fine, DPRIO information block is deallocated by dprio_free_context() and
control is passed on to the new image, with pointers in task_struct already
reset to NULL. If image loading fails, error recovery path invokes
dprio_restore_context() to restore the pointers from the backup structure back
into task_struct.

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-25 19:45 [PATCH RFC] sched: deferred set priority (dprio) Sergey Oboguev
  2014-07-25 20:12 ` Andy Lutomirski
@ 2014-07-26  8:58 ` Mike Galbraith
  2014-07-26 18:30   ` Sergey Oboguev
  2014-07-28  1:19 ` Andi Kleen
  2014-07-30 13:02 ` Pavel Machek
  3 siblings, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2014-07-26  8:58 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote: 
> [This is a repost of the message from few day ago, with patch file
> inline instead of being pointed by the URL.]
> 
> This patch is intended to improve the support for fine-grain parallel
> applications that may sometimes need to change the priority of their threads at
> a very high rate, hundreds or even thousands of times per scheduling timeslice.
> 
> These are typically applications that have to execute short or very short
> lock-holding critical or otherwise time-urgent sections of code at a very high
> frequency and need to protect these sections with "set priority" system calls,
> one "set priority" call to elevate current thread priority before entering the
> critical or time-urgent section, followed by another call to downgrade thread
> priority at the completion of the section. Due to the high frequency of
> entering and leaving critical or time-urgent sections, the cost of these "set
> priority" system calls may raise to a noticeable part of an application's
> overall expended CPU time. Proposed "deferred set priority" facility allows to
> largely eliminate the cost of these system calls.

So you essentially want to ship preempt_disable() off to userspace?

(smiles wickedly, adds CCs)

-Mike

> Instead of executing a system call to elevate its thread priority, an
> application simply writes its desired priority level to a designated memory
> location in the userspace. When the kernel attempts to preempt the thread...


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-26  8:58 ` Mike Galbraith
@ 2014-07-26 18:30   ` Sergey Oboguev
  2014-07-27  4:02     ` Mike Galbraith
  0 siblings, 1 reply; 34+ messages in thread
From: Sergey Oboguev @ 2014-07-26 18:30 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Sat, Jul 26, 2014 at 1:58 AM, Mike Galbraith
<umgwanakikbuti@gmail.com> wrote:
> On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote:
>> [This is a repost of the message from few day ago, with patch file
>> inline instead of being pointed by the URL.]
>>
>> This patch is intended to improve the support for fine-grain parallel
>> applications that may sometimes need to change the priority of their threads at
>> a very high rate, hundreds or even thousands of times per scheduling timeslice.
>>
>> These are typically applications that have to execute short or very short
>> lock-holding critical or otherwise time-urgent sections of code at a very high
>> frequency and need to protect these sections with "set priority" system calls,
>> one "set priority" call to elevate current thread priority before entering the
>> critical or time-urgent section, followed by another call to downgrade thread
>> priority at the completion of the section. Due to the high frequency of
>> entering and leaving critical or time-urgent sections, the cost of these "set
>> priority" system calls may raise to a noticeable part of an application's
>> overall expended CPU time. Proposed "deferred set priority" facility allows to
>> largely eliminate the cost of these system calls.
>
> So you essentially want to ship preempt_disable() off to userspace?
>

Only to the extent preemption control is already exported to the userspace and
a task is already authorized to control its preemption by its RLIMIT_RTPRIO,
RLIMIT_NICE and capable(CAP_SYS_NICE).

DPRIO does not amplify a taks's capability to elevate its priority and block
other tasks, it just reduces the computational cost of frequest
sched_setattr(2) calls.

>
> -Mike
>
>> Instead of executing a system call to elevate its thread priority, an
>> application simply writes its desired priority level to a designated memory
>> location in the userspace. When the kernel attempts to preempt the thread...

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-26 18:30   ` Sergey Oboguev
@ 2014-07-27  4:02     ` Mike Galbraith
  2014-07-27  9:09       ` Sergey Oboguev
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2014-07-27  4:02 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Sat, 2014-07-26 at 11:30 -0700, Sergey Oboguev wrote: 
> On Sat, Jul 26, 2014 at 1:58 AM, Mike Galbraith
> <umgwanakikbuti@gmail.com> wrote:
> > On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote:
> >> [This is a repost of the message from few day ago, with patch file
> >> inline instead of being pointed by the URL.]
> >>
> >> This patch is intended to improve the support for fine-grain parallel
> >> applications that may sometimes need to change the priority of their threads at
> >> a very high rate, hundreds or even thousands of times per scheduling timeslice.
> >>
> >> These are typically applications that have to execute short or very short
> >> lock-holding critical or otherwise time-urgent sections of code at a very high
> >> frequency and need to protect these sections with "set priority" system calls,
> >> one "set priority" call to elevate current thread priority before entering the
> >> critical or time-urgent section, followed by another call to downgrade thread
> >> priority at the completion of the section. Due to the high frequency of
> >> entering and leaving critical or time-urgent sections, the cost of these "set
> >> priority" system calls may raise to a noticeable part of an application's
> >> overall expended CPU time. Proposed "deferred set priority" facility allows to
> >> largely eliminate the cost of these system calls.
> >
> > So you essentially want to ship preempt_disable() off to userspace?
> >
> 
> Only to the extent preemption control is already exported to the userspace and
> a task is already authorized to control its preemption by its RLIMIT_RTPRIO,
> RLIMIT_NICE and capable(CAP_SYS_NICE).
> 
> DPRIO does not amplify a taks's capability to elevate its priority and block
> other tasks, it just reduces the computational cost of frequest
> sched_setattr(2) calls.

Exactly.  You are abusing realtime, and you are not the only guy out
there doing that.  What you want is control over a userspace critical
section, and you are willing to do whatever is necessary to get that.  I
think your code is a really good example of how far people are willing
to go, but I hope this goes nowhere beyond getting people to think about
what you and others want.

I would say cut to the chase, if what you want/need is a privileged
userspace lock, make one, and put _all_ of the ugliness inside it.
Forget about this "Hello Mr. kernel, here's what I would have done to
get what I want if I weren't such a cheap bastard, if you think about
preempting me, pretend I actually did that instead" business.  Forget
about all that RLIMIT_RTPRIO and access list stuff too, either you're
privileged or you're not, it's not like multiple users could coexist
peacefully anyway.  Maybe you could make a flavor of futex that makes
the owner non-preemptible, checks upon release or such.

Note: if you do touch futex.c, you'll definitely want to document that
you eliminated every last remote possibility of breaking anything, and
donning Nomex underwear before posting would not be a bad idea ;-)

-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-27  4:02     ` Mike Galbraith
@ 2014-07-27  9:09       ` Sergey Oboguev
  2014-07-27 10:29         ` Mike Galbraith
  0 siblings, 1 reply; 34+ messages in thread
From: Sergey Oboguev @ 2014-07-27  9:09 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Sat, Jul 26, 2014 at 9:02 PM, Mike Galbraith
<umgwanakikbuti@gmail.com> wrote:
> On Sat, 2014-07-26 at 11:30 -0700, Sergey Oboguev wrote:
>> On Sat, Jul 26, 2014 at 1:58 AM, Mike Galbraith
>> <umgwanakikbuti@gmail.com> wrote:
>> > On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote:
>> >> [This is a repost of the message from few day ago, with patch file
>> >> inline instead of being pointed by the URL.]
>> >>
>> >> This patch is intended to improve the support for fine-grain parallel
>> >> applications that may sometimes need to change the priority of their threads at
>> >> a very high rate, hundreds or even thousands of times per scheduling timeslice.
>> >>
>> >> These are typically applications that have to execute short or very short
>> >> lock-holding critical or otherwise time-urgent sections of code at a very high
>> >> frequency and need to protect these sections with "set priority" system calls,
>> >> one "set priority" call to elevate current thread priority before entering the
>> >> critical or time-urgent section, followed by another call to downgrade thread
>> >> priority at the completion of the section. Due to the high frequency of
>> >> entering and leaving critical or time-urgent sections, the cost of these "set
>> >> priority" system calls may raise to a noticeable part of an application's
>> >> overall expended CPU time. Proposed "deferred set priority" facility allows to
>> >> largely eliminate the cost of these system calls.
>> >
>> > So you essentially want to ship preempt_disable() off to userspace?
>> >
>>
>> Only to the extent preemption control is already exported to the userspace and
>> a task is already authorized to control its preemption by its RLIMIT_RTPRIO,
>> RLIMIT_NICE and capable(CAP_SYS_NICE).
>>
>> DPRIO does not amplify a taks's capability to elevate its priority and block
>> other tasks, it just reduces the computational cost of frequest
>> sched_setattr(2) calls.

> You are abusing realtime

I am unsure why you would label priority ceiling for locks and priority
protection for other forms of time-urgent sections as an "abuse".

It would appear you start from a presumption that the sole valid purpose for
ranging task priorities should ever be only hard real-time applications such as
plant process control etc., but that's not a valid or provable presumption, but
rather an article of faith -- a faith, as you acknowledge, a lot of developers
do not share, and a rational argument to the contrary of this faith is that
there are no all-fitting satisfactory and practical alternative solutions to
the problems that are being solved with those tools, that's the key reason why
they are used. The issue then distills to a more basic question of whether this
faith should be imposed on the dissenting application developers, and whether
Linux should provide a mechanism or a policy.

As for DPRIO specifically, while it may encourage somewhat the use of priority
ceiling and priority protection, but it does not provide an additional basic
mechanism beyond one already exported by the kernel (i.e. "set priority"), it
just makes this pre-existing basic mechanism cheaper to use in certain use
cases.

> if what you want/need is a privileged userspace lock

The problem is not reducible to locks. Applications also have time-urgent
critical section that arise from wait and interaction chains not expressible
via locking notation.

> you could make a flavor of futex that makes the owner non-preemptible

Lock owner should definitely be preemptible by more time-urgent tasks.

> it's not like multiple users could coexist peacefully anyway

It depends. A common sense suggests not to run an air traffic control system on
the same machine as an airline CRM database system, but perhaps one might
co-host CRM and ERP database instances on the same machine.

Indeed, applications that are installed with the rights granting them an access
to an elevated priority are generally those that are important for the purpose
of the system they are deployed on. The machine they are installed on may
either be dedicated to running this particular application, or it may be used
for running a set of primary-importance applications that can coexist.

As an obvious rule of thumb, applications using elevated priorities for the
sake of deterministic response time should not be combined "on equal footing"
with non-deterministic applications using elevated priorities for the reasons of
better overall system throughput and responsiveness. If they are ever combined
at all, the former category should use priority levels about the latter. It is
however may often be possible -- as far as priority use is concerned -- to
combine multiple applications of the latter (non-deterministic) category, as
long as their critical sections combined take less than a total of CPU time.

If applications are fundamentally incompatible by their aggregate demand for
resources exceeding available system resources, be it CPU or memory resources,
then of course they cannot be successfully combined.

It is undoubtful one can easily construct a mix of applications that are not
compatible with each other (as an airline example mentioned earlier
exemplifies) or overcommit the system beyond the acceptable service
level terms, but that's self-obvious, so what this should be a point to?

As far as DPRIO is concerned, it just gives some CPU time that otherwise
would have been expended essentially wastefully back and thus adds some
margin to available system resources, not less, not more.

The purpose of DPRIO is not to instruct system owners what applications they
should or should not combine, these decisions are completely independent of
DPRIO and the latter is irrelevant for these decisions.

Nor to instruct application developers as to how they should structure their
applications -- these decisions are normally driven by factors of much greater
magnitude and force than petty factors such as available system calls.

Its only purpose is to let a developer make an application somewhat more
performant once the decision on the structure has been made, or even forced on
the developer a priori as the only fitting solution by the sheer nature of the
task being solved.

> getting people to think about what you and others want

It's not like anything of this is really very new.
The thiking on these matters has been going on since the 1980's.

- Sergey


P.S. As a related non-technical consideration from the real world...
I have a friend who makes living as a scalability expert for one of two
companies in Russia that provide Oracle support. Oracle installations
in Russia are typically high-end, larger than installations in comparable
industry sectors in the U.S., and some of the largest Oracle installations
in the world are in Russia (for unhealthy economic reasons unfortunately).
They are deployed and serviced by the company my friend works for, and
once upon a time we have been going with him over various scalability
issues and stories. Their customers generally prefer Solaris or AIX,
rather than Linux or Windows. There is a multitude of reasons for this,
of course. But one technical reason on the list (I would not exagerate
its importance, it's a long list, and then there are business factors
that matter even more, but it is on the list) is that Solaris and AIX provide
a form of preemption control for critical sections that translates to a
better performance and cheaper cost per transaction, let us say may be 3-5%
better at high load, which in turn translates to ROI better by may be 2%.
People who make business decisions may not understand system calls, but
they do understand ROI. The question then is, is it favorable for Linux
to have "minus" on such lists?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-27  9:09       ` Sergey Oboguev
@ 2014-07-27 10:29         ` Mike Galbraith
  0 siblings, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2014-07-27 10:29 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: linux-kernel, Peter Zijlstra, Ingo Molnar

On Sun, 2014-07-27 at 02:09 -0700, Sergey Oboguev wrote: 
> On Sat, Jul 26, 2014 at 9:02 PM, Mike Galbraith
> <umgwanakikbuti@gmail.com> wrote:
> > On Sat, 2014-07-26 at 11:30 -0700, Sergey Oboguev wrote:
> >> On Sat, Jul 26, 2014 at 1:58 AM, Mike Galbraith
> >> <umgwanakikbuti@gmail.com> wrote:
> >> > On Fri, 2014-07-25 at 12:45 -0700, Sergey Oboguev wrote:
> >> >> [This is a repost of the message from few day ago, with patch file
> >> >> inline instead of being pointed by the URL.]
> >> >>
> >> >> This patch is intended to improve the support for fine-grain parallel
> >> >> applications that may sometimes need to change the priority of their threads at
> >> >> a very high rate, hundreds or even thousands of times per scheduling timeslice.
> >> >>
> >> >> These are typically applications that have to execute short or very short
> >> >> lock-holding critical or otherwise time-urgent sections of code at a very high
> >> >> frequency and need to protect these sections with "set priority" system calls,
> >> >> one "set priority" call to elevate current thread priority before entering the
> >> >> critical or time-urgent section, followed by another call to downgrade thread
> >> >> priority at the completion of the section. Due to the high frequency of
> >> >> entering and leaving critical or time-urgent sections, the cost of these "set
> >> >> priority" system calls may raise to a noticeable part of an application's
> >> >> overall expended CPU time. Proposed "deferred set priority" facility allows to
> >> >> largely eliminate the cost of these system calls.
> >> >
> >> > So you essentially want to ship preempt_disable() off to userspace?
> >> >
> >>
> >> Only to the extent preemption control is already exported to the userspace and
> >> a task is already authorized to control its preemption by its RLIMIT_RTPRIO,
> >> RLIMIT_NICE and capable(CAP_SYS_NICE).
> >>
> >> DPRIO does not amplify a taks's capability to elevate its priority and block
> >> other tasks, it just reduces the computational cost of frequest
> >> sched_setattr(2) calls.
> 
> > You are abusing realtime
> 
> I am unsure why you would label priority ceiling for locks and priority
> protection for other forms of time-urgent sections as an "abuse".

Ok, maybe "abuse" is too strong.  I know there are reasons why people do
what they do, even when it may look silly to me.  I didn't like what I
saw in case you couldn't tell, but lucky you, you're not selling it to
me, you're selling it to maintainers.  I CCd them, so having voiced my
opinion, I'll shut up and listen.

-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-25 19:45 [PATCH RFC] sched: deferred set priority (dprio) Sergey Oboguev
  2014-07-25 20:12 ` Andy Lutomirski
  2014-07-26  8:58 ` Mike Galbraith
@ 2014-07-28  1:19 ` Andi Kleen
  2014-07-28  4:16   ` Sergey Oboguev
  2014-07-28  7:24   ` Mike Galbraith
  2014-07-30 13:02 ` Pavel Machek
  3 siblings, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2014-07-28  1:19 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: linux-kernel, khalid.aziz

Sergey Oboguev <oboguev.public@gmail.com> writes:

> [This is a repost of the message from few day ago, with patch file
> inline instead of being pointed by the URL.]

Have you checked out the preemption control that was posted some time
ago? It did essentially the same thing, but somewhat simpler than your 
patch.

http://lkml.iu.edu/hypermail/linux/kernel/1403.0/00780.html

Yes I agree with you that lock preemption is a serious issue that needs solving.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-28  1:19 ` Andi Kleen
@ 2014-07-28  4:16   ` Sergey Oboguev
  2014-07-28  7:24   ` Mike Galbraith
  1 sibling, 0 replies; 34+ messages in thread
From: Sergey Oboguev @ 2014-07-28  4:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, khalid.aziz

On Sun, Jul 27, 2014 at 6:19 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> [This is a repost of the message from few day ago, with patch file
>> inline instead of being pointed by the URL.]
>
> Have you checked out the preemption control that was posted some time
> ago? It did essentially the same thing, but somewhat simpler than your
> patch.
>
> http://lkml.iu.edu/hypermail/linux/kernel/1403.0/00780.html

Yes, I have seen this discussion. The patch suggested by Khalid implements a
solution very much resembling Solaris/AIX schedctl. Schedctl is less generic
and powerful than dprio. I compared dprio vs. schedctl in the write-up
https://raw.githubusercontent.com/oboguev/dprio/master/dprio.txt

To quote from there,

[--- Quote ---]

The Solaris schedctl [...]
does not provide a way to associate a priority with the resource
whose lock is being held (or, more generally, with thread application-specific
logical state; see the footnote below). An application is likely to have a
range of locks with different criticality levels and different needs for
holder protection [*]. For some locks, holder preemption may be tolerated
somewhat, while other locks are highly critical, furthermore for some lock
holders preemption by a high-priority thread is acceptable but not a preemption
by a low-priority thread. The Solaris/AIX schedctl does not provide a
capability for priority ranging relative to the context of the whole
application and other processes in the system.

    [*] We refer just to locks here for simplicity, but the need of a thread
        for preemption control does not reduce to locks held alone, and may
        result from other intra-application state conditions, such as executing
        a time-urgent fragment of code in response to a high-priority event
        (that may potentially be blocking for other threads) or other code
        paths that can lead to wait chains unless completed promptly.

Second, in some cases application may need to perform time-urgent processing
without knowing in advance how long it will take. In the majority of cases the
processing may be very short (a fraction of a scheduling timeslice), but
occasionally may take much longer (such as a fraction of a second). Since
schedctl  would not be effective in the latter case, an application would have
to resort to system calls for thread priority control in all cases [*], even
in the majority of "short processing" cases, with all the overhead of this
approach.

    [*] Or introduce extra complexity, most likely very cumbersome, by trying
        to gauge and monitor the accumulated duration of the processing, with
        the intention to transition from schedctl to thread priority elevation
        once a threshold has been reached.

[--- End of quote ---]

Even so, I felt somewhat puzzled by the response to Khalid's
delay-preempt patch.
While some arguments put forth against it were certainly valid in their own
right, but somehow their focus seemed to be that the solution won't interoperate
well with all the conceivable setups and application mixes, won't solve all the
concurrency issues, and the worst of all won't slice bread either. Whereas my
perception (perhaps incorrectly) was that this patch was not meant to solve a
whole range of problems and to be a feature enabled by default in a generic
system, but rather a specialized  feature configurable in special-purpose
systems (e.g. database servers, Khalid was doing it for Oracle, and his JVM
use case I believe is also in this context) dedicated to running a
primary-importance application that utilizes this mechanism and meant to solve
a very particular problem of this specific category of system deployment cases.
It appeared to me that the participants to delay-preempt patch discussion
might have had different idea of the implied use scope of the suggested
feature, and it might have influenced the direction of the discussion.

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-28  1:19 ` Andi Kleen
  2014-07-28  4:16   ` Sergey Oboguev
@ 2014-07-28  7:24   ` Mike Galbraith
  2014-08-03  0:43     ` Sergey Oboguev
  1 sibling, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2014-07-28  7:24 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Sergey Oboguev, linux-kernel, khalid.aziz

On Sun, 2014-07-27 at 18:19 -0700, Andi Kleen wrote: 
> Sergey Oboguev <oboguev.public@gmail.com> writes:
> 
> > [This is a repost of the message from few day ago, with patch file
> > inline instead of being pointed by the URL.]
> 
> Have you checked out the preemption control that was posted some time
> ago? It did essentially the same thing, but somewhat simpler than your 
> patch.
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1403.0/00780.html
> 
> Yes I agree with you that lock preemption is a serious issue that needs solving.

Yeah, it's a problem, and well known.

One mitigation mechanism that exists in the stock kernel today is the
LAST_BUDDY scheduler feature.  That took pgsql benchmarks from "shite"
to "shiny", and specifically targeted this issue.

Another is SCHED_BATCH, which can solve a lot of the lock problem by
eliminating wakeup preemption within an application.  One could also
create an extended batch class which is not only immune from other
SCHED_BATCH and/or SCHED_IDLE tasks, but all SCHED_NORMAL wakeup
preemption.  Trouble is that killing wakeup preemption precludes very
fast very light tasks competing with hogs for CPU time.  If your load
depends upon these performing well, you have a problem.

Mechanism #3 is use of realtime scheduler classes.  This one isn't
really a mitigation mechanism, it's more like donning a super suit.

So three mechanisms exist, the third being supremely effective, but high
frequency usage is expensive, ergo huge patch.

The lock holder preemption problem being identical to the problem RT
faces with kernel locks...

A lazy preempt implementation ala RT wouldn't have the SCHED_BATCH
problem, but would have a problem in that should critical sections not
be as tiny as it should be, every time you dodge preemption you're
fighting the fair engine, may pay heavily in terms of scheduling
latency.  Not a big hairy deal, if it hurts, don't do that.  Bigger
issue is that you have to pop into the kernel on lock acquisition and
release to avoid jabbering with the kernel via some public phone.
Popping into the kernel, if say some futex were victimized, also erases
the "f" in futex, and restricting cost to consumer won't be any easier.

The difference wrt cost acceptability is that the RT issue is not a
corner case, it's core issue resulting from the nature of the RT beast
itself, so the feature not being free is less annoying.  A corner case
fix OTOH should not impact the general case at all.

Whatever outcome, I hope it'll be tiny. 1886 ain't tiny.

-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-25 19:45 [PATCH RFC] sched: deferred set priority (dprio) Sergey Oboguev
                   ` (2 preceding siblings ...)
  2014-07-28  1:19 ` Andi Kleen
@ 2014-07-30 13:02 ` Pavel Machek
  2014-08-03  0:47   ` Sergey Oboguev
  3 siblings, 1 reply; 34+ messages in thread
From: Pavel Machek @ 2014-07-30 13:02 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: linux-kernel

Hi!

> One of the intended purposes of this facility (but its not sole purpose) is to
> render a lightweight mechanism for priority protection of lock-holding critical
> sections that would be an adequate match for lightweight locking primitives
> such as futex, with both featuring a fast path completing within the
> userspace.

Do we get a manpage describing the interface...?

Would it make sense to make set_priority a "vsyscall" so it is fast
enough, and delayed_set_priority does not need to be exposed to
userspace?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-28  7:24   ` Mike Galbraith
@ 2014-08-03  0:43     ` Sergey Oboguev
  2014-08-03  9:56       ` Mike Galbraith
  2014-08-03 17:30       ` Andi Kleen
  0 siblings, 2 replies; 34+ messages in thread
From: Sergey Oboguev @ 2014-08-03  0:43 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Mon, Jul 28, 2014 at 12:24 AM, Mike Galbraith
<umgwanakikbuti@gmail.com> wrote:
> On Sun, 2014-07-27 at 18:19 -0700, Andi Kleen wrote:
>> Sergey Oboguev <oboguev.public@gmail.com> writes:
>>
>> > [This is a repost of the message from few day ago, with patch file
>> > inline instead of being pointed by the URL.]
>>
>> Have you checked out the preemption control that was posted some time
>> ago? It did essentially the same thing, but somewhat simpler than your
>> patch.
>>
>> http://lkml.iu.edu/hypermail/linux/kernel/1403.0/00780.html
>>
>> Yes I agree with you that lock preemption is a serious issue that needs solving.
>
> Yeah, it's a problem, and well known.
>
> One mitigation mechanism that exists in the stock kernel today is the
> LAST_BUDDY scheduler feature.  That took pgsql benchmarks from "shite"
> to "shiny", and specifically targeted this issue.
>
> Another is SCHED_BATCH, which can solve a lot of the lock problem by
> eliminating wakeup preemption within an application.  One could also
> create an extended batch class which is not only immune from other
> SCHED_BATCH and/or SCHED_IDLE tasks, but all SCHED_NORMAL wakeup
> preemption.  Trouble is that killing wakeup preemption precludes very
> fast very light tasks competing with hogs for CPU time.  If your load
> depends upon these performing well, you have a problem.
>
> Mechanism #3 is use of realtime scheduler classes.  This one isn't
> really a mitigation mechanism, it's more like donning a super suit.
>
> So three mechanisms exist, the third being supremely effective, but high
> frequency usage is expensive, ergo huge patch.
>
> The lock holder preemption problem being identical to the problem RT
> faces with kernel locks...
>
> A lazy preempt implementation ala RT wouldn't have the SCHED_BATCH
> problem, but would have a problem in that should critical sections not
> be as tiny as it should be, every time you dodge preemption you're
> fighting the fair engine, may pay heavily in terms of scheduling
> latency.  Not a big hairy deal, if it hurts, don't do that.  Bigger
> issue is that you have to pop into the kernel on lock acquisition and
> release to avoid jabbering with the kernel via some public phone.
> Popping into the kernel, if say some futex were victimized, also erases
> the "f" in futex, and restricting cost to consumer won't be any easier.
>
> The difference wrt cost acceptability is that the RT issue is not a
> corner case, it's core issue resulting from the nature of the RT beast
> itself, so the feature not being free is less annoying.  A corner case
> fix OTOH should not impact the general case at all.


When reasoning about concurrency management it may be helpful to keep in mind
the fundamental perspective that the problem space and solution space in this
area are fragmented -- just as your message exemplifies as well, but it also
applies across the board to all other solution techniques. There is no
all-unifying solution that works in all use cases and for all purposes.

This applies even to seemingly well-defined problems such as lock holder
preemption avoidance.

One of the divisions on a broad scale is between cases when wait/dependency
chain can be made explicit at run time (e.g. application/system can tell that
thread A is waiting on thread B which is waiting on thread C) and those cases
when the dependency cannot be explicitly expressed and information on the
dependency is lacking.

In the former case some form of priority inheritance or proxy execution might
often (but not always) work well.

In the latter case PI/PE cannot be applied. This case occurs when a component
needs to implement time-urgent section acting on behalf of (e.g. in response to
an event in) another component in the application and the latter is not (or
often cannot practically be) instrumented with the code that expresses its
inner dependencies and thus the information about the dependencies is lacking.

(One example might be virtual machine that runs guest operating system that is
not paravirtualized or can be paravirtualized only to a limited extent. The VM
might guess that preemption of VCPU thread that is processing events such as
IPI interrupts, clock interrupts or certain device interrupts is likely to
cause overall performance degradation due to other VCPUs spinning for an IPI
response or spinning waiting for a spinlock, and thought the general kinds of
these dependencies may be foreseen, but actual dependency chains between VCPUs
cannot be established at run time.)

In cases when dependency information is lacking, priority protection remains
the only effective recourse.

Furthermore, even in cases when dependency information is available, PI/PE per
se is not always a satisfactory or sufficient solution. Consider for instance a
classic case of a userspace spinlock or hybrid spin-then-block lock that is
highly contended and often requested by the threads, and a running thread
spends a notable part of a timeslice holding the lock (not necessarily grabbing
and holding it, the thread may grab and release it many times during a
timeslice, but the total of holding time is a notable fraction of a timeslice
-- notable being anywhere from a fraction of a percent and up). The probability
of a thread being preempted while holding the lock may thus be significant. In
a simple classic scheme the waiters will then continue to spin [*] and
subsequently some of them start entering blocking wait after a timeout, at this
point releasing the CPU(s) and letting the lock holder to continue, but by that
time the incurred cost would involve several context switches and scheduler
calculation cycles and the waste due to spinning... and all the while this is
happening and takes time, new contenders arrive and start spinning, wasting
CPU resources. What was supposed to be a short critical section turns into
something very else. Yes, eventually the scheme based on PI/PE or some form of
yield_to will push the holder through, but by the time this happens a high cost
had already been paid.

    [*] Unless the kernel provides a facility that will write "holder preempted"
        flag into the lock structure when the thread is being preempted so the
        spinners can spot this flag and transition to blocking wait. I am not
        aware of any OS that actually provides such a facility, but even if it
        were available, it would only partially reduce the overall incurred
        cost described above.

Furthermore, if preempted lock holder and the task that was willing to yield
were executing on different CPUs (as is likely), then with per-CPU scheduling
queues (as opposed to single global queue) it may be quite a while -- around the
queues rebalancing interval -- before the preempted holder would get a chance
to run and release the lock (all the while arriving waiters are spinning),
and on top of that the cost of task cross-CPU migration may have to be paid.

The wisdom of these observations is that problem and solution space is
fragmented and there is no "holy grail" solution that would cover all use
cases. Solutions are specific to use cases and their priorities.

Given this, the best OS can do is to provide a range of solutions --
instruments -- and let an application developer (and/or system owner) pick
those that are most fitting for the given application's purposes.

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-30 13:02 ` Pavel Machek
@ 2014-08-03  0:47   ` Sergey Oboguev
  2014-08-03  8:30     ` Pavel Machek
  0 siblings, 1 reply; 34+ messages in thread
From: Sergey Oboguev @ 2014-08-03  0:47 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel

On Wed, Jul 30, 2014 at 6:02 AM, Pavel Machek <pavel@ucw.cz> wrote:
> Hi!
>
>> One of the intended purposes of this facility (but its not sole purpose) is to
>> render a lightweight mechanism for priority protection of lock-holding critical
>> sections that would be an adequate match for lightweight locking primitives
>> such as futex, with both featuring a fast path completing within the
>> userspace.

> Do we get a manpage describing the interface...?

At this point it is just an RFC, and the only available write-up is an article
(URL is in the original message).

There has been no word from the maintainers yet whether the proposal appears
to be a "go" or "no-go" in general.

Assuming the response was favorable, final polishing on the patch can be done
(such as perhaps replacing or augmenting authlist by a capability), and man page
can be added at this point as well.

> Would it make sense to make set_priority a "vsyscall" so it is fast enough,
and delayed_set_priority does not need to be exposed to userspace?

Regular "set priority" cannot be wrapped around "deferred set priority".

"Deferred set priority" acts only on current thread.

Even more importantly, its use also relies on a thread caching the application's
current knowledge of the thread priority in the userspace, and if the thread
priority had been changed from the outside of the application (or even by
another thread within the same application), this knowledge becomes invalid,
and then the application is responsible for performing whatever recovery action
is appropriate.

Thus DPRIO is not a replacement for fully-functional "set priority",
but rather a
specialized tool for certain use cases.

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-03  0:47   ` Sergey Oboguev
@ 2014-08-03  8:30     ` Pavel Machek
  2014-08-05 23:03       ` Sergey Oboguev
  0 siblings, 1 reply; 34+ messages in thread
From: Pavel Machek @ 2014-08-03  8:30 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: linux-kernel

On Sat 2014-08-02 17:47:52, Sergey Oboguev wrote:
> On Wed, Jul 30, 2014 at 6:02 AM, Pavel Machek <pavel@ucw.cz> wrote:
> > Hi!
> >
> >> One of the intended purposes of this facility (but its not sole purpose) is to
> >> render a lightweight mechanism for priority protection of lock-holding critical
> >> sections that would be an adequate match for lightweight locking primitives
> >> such as futex, with both featuring a fast path completing within the
> >> userspace.
> 
> > Do we get a manpage describing the interface...?
> 
> At this point it is just an RFC, and the only available write-up is an article
> (URL is in the original message).
> 
> There has been no word from the maintainers yet whether the proposal appears
> to be a "go" or "no-go" in general.

It appears to be "no-go" to me.

> Regular "set priority" cannot be wrapped around "deferred set priority".

Umm. Why not?

int getpriority(int which, int who);
int setpriority(int which, int who, int prio);

Description


 The scheduling priority of the process, process group, or user, as
 indicated by which and who is obtained with the getpriority() call
 and set with the setpriority() call.
The value which is one of PRIO_PROCESS, PRIO_PGRP, or PRIO_USER, and
 who is interpreted relative to which (a process identifier for
 PRIO_PROCESS, process group identifier for PRIO_PGRP, and a user ID
 for PRIO_USER). A zero value for who denotes (respectively) the
 calling process, the process group of the calling process, or the
 real user ID of the calling process. Prio is a value in the range -20
 to 19 (but see the Notes below). The default priority is 0; lower
 priorities cause more favorable scheduling.

In vsyscall area:

   if (which==PRIO_PROCESS && who==0) {
      perform your optimized priority set
   } else {
      perform syscall
}

Now, you have to make sure to keep reasonably close semantics, but
that would be good idea, anyway.
   
> Even more importantly, its use also relies on a thread caching the application's
> current knowledge of the thread priority in the userspace, and if the thread
> priority had been changed from the outside of the application (or even by
> another thread within the same application), this knowledge becomes invalid,
> and then the application is responsible for performing whatever recovery action
> is appropriate.

You mean "we rely on applications handling the situation they can't
and will not handle"?

Actually, it seems to be a security issue to me.

If root renices the application to high nice value, application should
not be able to work around it by the DPRIO interface.

IOW the priority should be always the lower of DPRIO and normal one,
at the very least.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-03  0:43     ` Sergey Oboguev
@ 2014-08-03  9:56       ` Mike Galbraith
  2014-08-05 23:28         ` Sergey Oboguev
  2014-08-03 17:30       ` Andi Kleen
  1 sibling, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2014-08-03  9:56 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Sat, 2014-08-02 at 17:43 -0700, Sergey Oboguev wrote:

> When reasoning about concurrency management it may be helpful to keep in mind
> the fundamental perspective that the problem space and solution space in this
> area are fragmented -- just as your message exemplifies as well, but it also
> applies across the board to all other solution techniques. There is no
> all-unifying solution that works in all use cases and for all purposes.

I'm just not seeing the beauty in your patch.  Ignoring SCHED_NORMAL
where priority escalation does not work as preemption proofing, for
realtime classes, what I see is a programmer using a mechanism designed
to initiate preemption arming his tasks with countermeasures to the very
thing he initiates.  Deferred preempt seems to be what you want, but you
invented something very different.

As noted though, you don't have to convince me.  The thing to do is chop
your patch up into a nice reviewable series, and submit it.

-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-03  0:43     ` Sergey Oboguev
  2014-08-03  9:56       ` Mike Galbraith
@ 2014-08-03 17:30       ` Andi Kleen
  2014-08-05 23:13         ` Sergey Oboguev
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2014-08-03 17:30 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: Mike Galbraith, Andi Kleen, linux-kernel, khalid.aziz

> (One example might be virtual machine that runs guest operating system that is
> not paravirtualized or can be paravirtualized only to a limited extent. The VM
> might guess that preemption of VCPU thread that is processing events such as
> IPI interrupts, clock interrupts or certain device interrupts is likely to
> cause overall performance degradation due to other VCPUs spinning for an IPI
> response or spinning waiting for a spinlock, and thought the general kinds of
> these dependencies may be foreseen, but actual dependency chains between VCPUs
> cannot be established at run time.)

PAUSE loop exiting (PLE) can handle it in limited fashion on VMs
even without paravirtualization.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-03  8:30     ` Pavel Machek
@ 2014-08-05 23:03       ` Sergey Oboguev
  0 siblings, 0 replies; 34+ messages in thread
From: Sergey Oboguev @ 2014-08-05 23:03 UTC (permalink / raw)
  To: Pavel Machek; +Cc: linux-kernel

On Sun, Aug 3, 2014 at 1:30 AM, Pavel Machek <pavel@ucw.cz> wrote:

> it seems to be a security issue to me.
> If root renices the application to high nice value, application should
> not be able to work around it by the DPRIO interface.

There is no such issue.

Since 2.6.12, Linux does allow a task that had been renice'd to increase its
priority back as long as it is within RLIMIT_NICE. See see man page for nice(2)
and the reference there under EPERM to RLIMIT_NICE. Or sys_nice(...) and
can_nice(...) in kernel/sched/core.c.

If the administrator wants to clamp down the task in a way it would be unable
to come back, he should change the task's rlimit for RLIMIT_NICE.

DPRIO does honor the change in RLIMIT_NICE (as well as in RLIMIT_RTPRIO) and
won't let a task cross those limits.

> You mean "we rely on applications handling the situation they can't and will
not handle"?

Not really.

There are two different cases to be considered.

One is when an application's thread priority is changed from inside the
application, in a coordinated code, and the code is specifically set up to
handle asynchronous thread priority changes.

(For example, if the application is a VM, and a virtual device sends an
interrupt to a VCPU, the device handler may want to bump up the VCPU thread
priority so the interrupt gets processed promptly. When VCPU notices and
dequeues the sent interrupt, it reevaluates the thread priority based on the
totality of synchronous intra-VCPU conditions and currently visible
asynchronous conditions such as a set of pending interrupts and sets new thread
priority accordingly.)

Another case is when an application's thread priority is changed from the
outside of the application in an arbitrary way. There is no radical difference
in this case between DPRIO and regular set_priority.

Such an external priority change can indeed be disruptive for an application,
but it is disruptive for an application that uses regular set_priority aw well.
Suppose the thread was running some critical tasks and/or holding some critical
locks, and used regular set_priority to that end, and then was knocked down.
This would be disruptive for an application using regular set_priority just as
it would be for one using DPRIO. The exact mechanics of the disruption would be
somewhat different, but the disruption would be present in both cases.

Likewise, an application has means for recovery both in regular set_priority
and DPRIO cases. In the case of an application using regular set_priority the
recovery will automatically happen on the next set_priority call. In DPRIO case
it may take a bunch of dprio_set calls, but given that they are meant to be
used in high-frequency invocation case, the recovery is likely to happen pretty
fast as well, and after a certain number of cycles the "writeback" priority
change cached in the userspace is likely to get "written through" to the
kernel, albeit this process is somewhat "stochastic" and sequences can be
constructed when it won't be for quite a while. If the application wished to
give it some guaranteed predictability, it could use dprio_setnow(prio,
DPRIO_FORCE) in every N-th invocation instead of dprio_set(prio).

Nevertheless this indirection is one reason why I do not think making regular
set_priority a wrapper around DPRIO is a good idea. It would strip application
developer of direct control. unless DPRIO_FORCE flag was used, but then it
becomes regular set_priority again.

Another reason is that error reporting in DPRIO case is delayed (e.g. via a
callback in DPRIO library implementation), and that's different from the
semantics of regular set_priority interface.

To summarize it, regular sety_priority (e.g. sched_setattr) defines the
interface that is immediate (synchronous) and uncached, whereas DPRIO is
deferred (asynchronous) and cached. Semantics is really different to let the
former be wrapped around the latter without a distortion of the semantics.

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-03 17:30       ` Andi Kleen
@ 2014-08-05 23:13         ` Sergey Oboguev
  0 siblings, 0 replies; 34+ messages in thread
From: Sergey Oboguev @ 2014-08-05 23:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Mike Galbraith, linux-kernel, khalid.aziz

On Sun, Aug 3, 2014 at 10:30 AM, Andi Kleen <andi@firstfloor.org> wrote:
>> (One example might be virtual machine that runs guest operating system that is
>> not paravirtualized or can be paravirtualized only to a limited extent. The VM
>> might guess that preemption of VCPU thread that is processing events such as
>> IPI interrupts, clock interrupts or certain device interrupts is likely to
>> cause overall performance degradation due to other VCPUs spinning for an IPI
>> response or spinning waiting for a spinlock, and thought the general kinds of
>> these dependencies may be foreseen, but actual dependency chains between VCPUs
>> cannot be established at run time.)
>
> PAUSE loop exiting (PLE) can handle it in limited fashion on VMs
> even without paravirtualization.


The key word is "limited". Sometimes the use of PLE can help to a limited
indeed (see below) extent, and sometimes it can even be counter-productive,
which in turn can also be mitigated (http://lwn.net/Articles/424960/) but again
in a limited fashion.

This indeed is similar to the spinlock example discussed in the previous
message. Similarly to PI/PE, the use of PLE and other spin-loop detection
techniques is about trying to minimize the cost of after-
inopportune-preemption actions to whatever (unavoidably limited, that's the key
thing) extent they can be minimized, whereas priority protection is about the
avoidance of incurring these costs in the first place.

And then, x86 VMs is just one use case, whereas there are others with no PLE or
similar recourses. DPRIO as a matter of fact happened to be conceived within
the context of the project running legacy non-x86 OS that does not have pause
in the loops, and the loops are scattered throughout the pre-existing binary
code in huge numbers and are not pragmatically instrumentable. But this just
exemplifies a general pattern of having to do with concurrency in a component
being driven from or driving another component that is for practical purposes
"set in stone" (3rd party, legacy, out of the scope of the project etc.)

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-03  9:56       ` Mike Galbraith
@ 2014-08-05 23:28         ` Sergey Oboguev
  2014-08-06  5:41           ` Mike Galbraith
  0 siblings, 1 reply; 34+ messages in thread
From: Sergey Oboguev @ 2014-08-05 23:28 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Sun, Aug 3, 2014 at 2:56 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:

> SCHED_NORMAL where priority escalation does not work as preemption proofing

Remember, DPRIO is not for lock holders only.

Using DPRIO within SCHED_NORMAL policy would make sense for an application that
has "soft" time-urgent section where it believes strong protection
from preemption
is not really necessary, and just a greater claim to CPU time share would do,
in cases where the application does not know beforehand if the section will be
short or long, and in majority of cases is short (sub-millisecond), but
occasionally can take longer.

> what I see is a programmer using a mechanism designed
> to initiate preemption arming his tasks with countermeasures to the very
> thing he initiates.

I disagree. The exact problem is that it is not a developer who initiates the
preemption, but the kernel or another part of application code that is unaware
of other thread's condition and doing it blindly, lacking the information about
the state of the thread being preempted and the expected cost of its preemption
in this state. DPRIO is a way to communicate this information.

> Deferred preempt seems to be what you want, but you
> invented something very different.

"Simple" deferred preempt is one use case.

More complex case is "ranked" deferred preemption, when there are multiple
contending contexts, and there is a need to express relative costs of
victimizing one vs. another.

> I'm just not seeing the beauty in your patch.

Perhaps I might have a dissenting sense of beauty.
But then, it is not only about the beauty (which is subjective anyway), but
even more so about the pudding.

Seriously though, it's really simple: the whole range of available remedies is
divided across post-preemption solutions and preemption-avoidance solutions
(and of course designing app structure for minimizing the contention in the
first place, but for the sake of this discussion we can assume this had been
done to the extent possible). Post-preemption solutions unavoidably incur cost
(and a part of this cost is incurred even before the solution can be engaged).
If this cost can be maintained insignificant for the given use case, great.
However what do you propose to do with those use cases where it cannot? To tell
a developer (or IT manager) "we do not care about your 5% or 20% losses, and if
you do not like it, use another OS that would work better for you"? This would
not sound too productive to me.

This leaves then preemption-avoidance solutions which are about communicating
the cost of the preemption to the kernel in one way or another. DPRIO is one of
such solutions, and I believe is as good or better than the others (I explained
earlier in which cases it is better than schedctl).

(Seemingly transcending this divide may only be the delegation of scheduling
decisions from the kernel to the userspace via mechanisms attempted in a number
of experimental OSes and likely to see the light of the day again with the
expected raise of library OSes/unikernels, but at the end of the day this is
still a set of preemption-avoidance vs. post-preemption solutions, just boxed
differently, and utilizing greater application state knowledge bandwidth
available within the single address space.)

I hear you dislike, but I think it is misplaced.

I am unsure about the exact origins of your outlook, but I suspect it might
possibly be arising at least in part from an aspiration for finding a "holy
grail" solution that would cover all the cases (after all, kernel developers
tend to be psychologically perfectionists, at least within this little hobby of
kernel development), and while such an aspiration is admirable per se, but
unfortunately this holy grail just does not exist.

Likewise, an aspiration to make sure that everything "mixes and matches" with
absolutely everything else in all cases while it may be admirable but is
futile in its absolutistic form because there are obvious limits to it, and
furthermore applications that need to rely on preemption-avoidance are most
often primary-importance applications for the installation, and as such
mixing and matching with absolutely everything else is not essential and
definitely not a priority compared to letting the primary application to run
better. Furthermore, many if not most of primary-importance applications
deployments today are already single-purposed, and with continued VM sprawl
this trend would only grow, and with it the emphasis on letting a primary
application or practical well-defined application set run better over an
absolutistic arbitrary mixing and matching will grow as well. This of course
certainly not to deny the importance of mix-and-match environments such as
desktop systems or personal devices, but it is important to recognize the
difference in requirements/priorities between the environments.

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-05 23:28         ` Sergey Oboguev
@ 2014-08-06  5:41           ` Mike Galbraith
  2014-08-06  7:42             ` Mike Galbraith
  2014-08-07  1:26             ` Sergey Oboguev
  0 siblings, 2 replies; 34+ messages in thread
From: Mike Galbraith @ 2014-08-06  5:41 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Tue, 2014-08-05 at 16:28 -0700, Sergey Oboguev wrote: 
> On Sun, Aug 3, 2014 at 2:56 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> 
> > SCHED_NORMAL where priority escalation does not work as preemption proofing
> 
> Remember, DPRIO is not for lock holders only.
> 
> Using DPRIO within SCHED_NORMAL policy would make sense for an application that
> has "soft" time-urgent section where it believes strong protection
> from preemption
> is not really necessary, and just a greater claim to CPU time share would do,
> in cases where the application does not know beforehand if the section will be
> short or long, and in majority of cases is short (sub-millisecond), but
> occasionally can take longer.

Every single time that SCHED_NORMAL task boosts its priority (nice)
during a preemption, the math has already been done, vruntime has
already been adjusted.  The boat has already sailed, and your task is
waving from the dock.  All this does is waste cycles and add yet more
math to the done deal.  Sure, when it gets the CPU back, its usage will
be weighed differently, it will become more resistant to preemption, but
in no way immune.  There is nothing remotely deterministic about this,
making it somewhat of an oxymoron when combined with critical section.

Add group scheduling to the mix, and this becomes even more pointless.

> > what I see is a programmer using a mechanism designed
> > to initiate preemption arming his tasks with countermeasures to the very
> > thing he initiates.
> 
> I disagree. The exact problem is that it is not a developer who initiates the
> preemption, but the kernel or another part of application code that is unaware
> of other thread's condition and doing it blindly, lacking the information about
> the state of the thread being preempted and the expected cost of its preemption
> in this state. DPRIO is a way to communicate this information.

Nope, I'm still not buying.  It's a bad idea to play with realtime
classes/priorities blindly, if someone does that, tough titty.

What DPRIO clearly does NOT do is to describe critical sections to the
kernel.  If some kthread prioritizes _itself_ and mucks up application
performance, file a bug report, that kthread is busted.  Anything a user
or application does with realtime priorities is on them.

Stuffing a priority card up selected tasks sleeves does not describe
critical sections to the kernel, it only lets the programmer stymie
himself, some other programmer... or perhaps the user.

> > Deferred preempt seems to be what you want, but you
> > invented something very different.
> 
> "Simple" deferred preempt is one use case.
> 
> More complex case is "ranked" deferred preemption, when there are multiple
> contending contexts, and there is a need to express relative costs of
> victimizing one vs. another.

You didn't invent either of those.

Hm.  I don't even recall seeing the task pulling a card from its sleeve
checking to see if the other player didn't have a joker up its sleeve.

> > I'm just not seeing the beauty in your patch.
> 
> Perhaps I might have a dissenting sense of beauty.
> But then, it is not only about the beauty (which is subjective anyway), but
> even more so about the pudding.
> 
> Seriously though, it's really simple: the whole range of available remedies is
> divided across post-preemption solutions and preemption-avoidance solutions
> (and of course designing app structure for minimizing the contention in the
> first place, but for the sake of this discussion we can assume this had been
> done to the extent possible). Post-preemption solutions unavoidably incur cost
> (and a part of this cost is incurred even before the solution can be engaged).
> If this cost can be maintained insignificant for the given use case, great.
> However what do you propose to do with those use cases where it cannot? To tell
> a developer (or IT manager) "we do not care about your 5% or 20% losses, and if
> you do not like it, use another OS that would work better for you"? This would
> not sound too productive to me.

I didn't suggest ignoring of any problems.  I know full well that folks
doing Oracle/SAP stuff can in fact make prettier numbers using realtime
class.  I also know they can get themselves into trouble, those troubles
having inspired bugzilla to pull my chain.  I didn't say "go away", I
created a hack to _enable_ them to turn their pet piggies loose in god
mode.  I didn't try to disarm or dissuade them from aiming the rt pistol
at their own toes, I gave them toe-seeking bullets.

You're reading me entirely wrong, I'm not trying to discourage you from
inventing a better bullet, I just think this particular bullet is a dud.

-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-06  5:41           ` Mike Galbraith
@ 2014-08-06  7:42             ` Mike Galbraith
  2014-08-07  1:26             ` Sergey Oboguev
  1 sibling, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2014-08-06  7:42 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Wed, 2014-08-06 at 07:41 +0200, Mike Galbraith wrote:

> You're reading me entirely wrong, I'm not trying to discourage you from
> inventing a better bullet, I just think this particular bullet is a dud.

Anyway, I'll try to assume you're talking to the _reasonable_ people on
this list in any reply.  The chance of penetrating my thick skull once
I've made up my mind approaches zero ;-)

-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-06  5:41           ` Mike Galbraith
  2014-08-06  7:42             ` Mike Galbraith
@ 2014-08-07  1:26             ` Sergey Oboguev
  2014-08-07  9:03               ` Mike Galbraith
  1 sibling, 1 reply; 34+ messages in thread
From: Sergey Oboguev @ 2014-08-07  1:26 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Tue, Aug 5, 2014 at 10:41 PM, Mike Galbraith
<umgwanakikbuti@gmail.com> wrote:

>> > SCHED_NORMAL where priority escalation does not work as preemption proofing
>>
>> Remember, DPRIO is not for lock holders only.
>>
>> Using DPRIO within SCHED_NORMAL policy would make sense for an application that
>> has "soft" time-urgent section where it believes strong protection
>> from preemption
>> is not really necessary, and just a greater claim to CPU time share would do,
>> in cases where the application does not know beforehand if the section will be
>> short or long, and in majority of cases is short (sub-millisecond), but
>> occasionally can take longer.
>
> Every single time that SCHED_NORMAL task boosts its priority (nice)
> during a preemption, the math has already been done, vruntime has
> already been adjusted.
> Sure, when it gets the CPU back, its usage will
> be weighed differently, it will become more resistant to preemption, but
> in no way immune.  There is nothing remotely deterministic about this,
> making it somewhat of an oxymoron when combined with critical section.

But you overlooked the point I was trying to convey in the paragraph you
are responding to.

Apart from SCHED_NORMAL being a marginal use case, if it is used at all, I do
not see it being used for lock-holding or similar critical section where an
application wants to avoid the preemption.

I can see DPRIO(SCHED_NORMAL) being used in the same cases as an application
would use nice for a temporary section, i.e. when it has a job that needs to be
processed relatively promptly over some time interval but not really
super-urgently and hard guarantees are not needed, i.e. when the application
simply wants to have an improved claim for CPU resources compared to normal
threads over let us say the next half-second or so. It is ok if the application
gets preempted, all it cares about is a longer timeframe ("next half-second")
rather than shorter and immediate timeframe ("next millisecond").

The only reason why anyone would want to use DPRIO instead of regular nice in
this case is because it might be unknown beforehand whether the job will be
short or might take a longer time, with majority of work items being very short
but occasionally taking longer. In this case using DPRIO would let to cut the
overhead for majority of section instances. To reiterate, this is a marginal
and most likely rare use case, but given the existence of uniform interface
I just do not see why to block it on purpose.

> If some kthread prioritizes _itself_ and mucks up application
> performance, file a bug report, that kthread is busted.  Anything a user
> or application does with realtime priorities is on them.

kthreads do not need RT, they just use spinlocks ;-)

On a serious note though, I am certainly not saying that injudicious use of RT
(or even nice) cannot disrupt the system, but is it reason enough to summarily
condemn the judicious use as well?

>> I disagree. The exact problem is that it is not a developer who initiates the
>> preemption, but the kernel or another part of application code that is unaware
>> of other thread's condition and doing it blindly, lacking the information about
>> the state of the thread being preempted and the expected cost of its preemption
>> in this state. DPRIO is a way to communicate this information.

> What DPRIO clearly does NOT do is to describe critical sections to the
> kernel.

First of all let's reflect that your argument is not with DPRIO as such. DPRIO
after all is not a separate scheduling mode, but just a method to reduce the
overhead of regular set_priority calls (i.e. sched_setattr & friends).

You argument is with the use of elevated priority as such, and you are saying
that using RT priority range (or high nice) does not convey to the kernel the
information about the critical section.

I do not agree with this, not wholly anyway. First of all, it is obvious that
set_priority does convey some information about the section, so perhaps a more
accurate re-formulation of your argument could be that it is imperfect,
insufficient information.

Let's try to imagine then what could make more perfect information. It
obviously should be some cost function describing the cost that would be
incurred if the task gets preempted. Something that would say (if we take the
simplest form) "if you preempt me within the next T microseconds (unless I
cancel or modify this mode), this preemption would incur cost X upfront further
accruing at a rate Y".

One issue I see with this approach is that in real life it might be very hard
for a developer to quantify the values for X, Y and T. Developer can easily
know that he wants to avoid the preemption in a given section, but actually
quantifying the cost of preemption (X, Y) would take a lot of effort
(benchmarking) and furthermore really cannot be assigned statically, as the
cost varies depending on the load pattern and site-specific configuration.
Furthermore, when dealing with multiple competing contexts, developer can
typically tell that task A is more important than task B, but quantifying the
measure of their relative importance might be quite difficult.

Likewise, quantifying "T" is likely to be similarly difficult.

(And then even suppose the developer knew that the section completes within let
us say 5 ms at three sigmas, is this reason good enough to preempt the task
at 6 ms for the sake of a normal timesharing thread? - I am uncertain.)

Thus it appears to me that even such an interface existed today, developers
would be daunted by it and prefer to use RT instead as something more
manageable/usable, controllable and predictable.

But then, suppose such an interface existed and task expressing their critical
section information through it were -- within their authorized quotas for T and
X/Y -- given precedence over normal threads but preemptible by RT or DL tasks.
Would not it pretty much amount to the existence of low-RT range sitting just
below regular RT range, low-RT range that tasks could enter for a time? Just
like they can enter regular RT range now with set_priority, also for a time.
Would it be really different from judicious use of existing RT, where tasks
controlling "chainsaws" run at prio range 50-90, while database engine threads
utilize prio range 1-10 in their critical sections?

(The only difference being that after the expiration of interval T task
priority is knocked down -- which a judiciously written application does
anyway, so the difference is just a protection against bugs and runaways -- and
then task becomes more subject to preemption, after which other threads are free
to use PI/PE to resolve the dependency when they know it, and if they
do not then
in a subset of those use cases when spinning of old or incoming waiters cannot
be shut off, it is either back to using plain RT or sustaining uncontrollable
losses.)

I would be most glad to see a usable interface providing information to the
scheduler about a task's critical sections emerge (other than RT), but for the
considerations outlined I am doubtful of the possibility.

Apart from this and coming back to DPRIO, even if solution more satisfactory
than judicious use of RT existed, how long might it take for it to be worked
out? If the history of EDF from ReTiS concept to merging into 3.14 mainline is
a guide, it may take quite a while, so stop-gap solution would have a value
even because of timing considerations until something better emerges... that
is, assuming it can and ever does.

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-07  1:26             ` Sergey Oboguev
@ 2014-08-07  9:03               ` Mike Galbraith
  2014-08-08 20:11                 ` Sergey Oboguev
  2014-08-09  8:38                 ` Sergey Oboguev
  0 siblings, 2 replies; 34+ messages in thread
From: Mike Galbraith @ 2014-08-07  9:03 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Wed, 2014-08-06 at 18:26 -0700, Sergey Oboguev wrote: 
> On Tue, Aug 5, 2014 at 10:41 PM, Mike Galbraith
> <umgwanakikbuti@gmail.com> wrote:

(ok, seems you're not addressing the reasonable, rather me;)

> The only reason why anyone would want to use DPRIO instead of regular nice in
> this case is because it might be unknown beforehand whether the job will be
> short or might take a longer time, with majority of work items being very short
> but occasionally taking longer. In this case using DPRIO would let to cut the
> overhead for majority of section instances. To reiterate, this is a marginal
> and most likely rare use case, but given the existence of uniform interface
> I just do not see why to block it on purpose.

Hey, if I had a NAK stamp handy, your patch would be wearing one ;-)

> > If some kthread prioritizes _itself_ and mucks up application
> > performance, file a bug report, that kthread is busted.  Anything a user
> > or application does with realtime priorities is on them.
> 
> kthreads do not need RT, they just use spinlocks ;-)

That's wrong if kthreads feed your load in any way, but your saying that
implies to me that your argument that the kernel is the initiator is
void.  You are not fighting a kernel issue, you have userspace issues.

The prioritization mechanism works fine, but you want to subvert it to
do something other than what it is designed and specified to do.

Whereas you wrote this patch, see "enhancement", I see "subversion" when
I read it.  Here lies our disagreement.

> On a serious note though, I am certainly not saying that injudicious use of RT
> (or even nice) cannot disrupt the system, but is it reason enough to summarily
> condemn the judicious use as well?

Bah, if Joe Users decides setiathome is worthy of SCHED_FIFO:99, it is
by definition worthy of SCHED_FIFO:99.  I couldn't care less, it's none
of my business what anybody tells their box to do. 

> >> I disagree. The exact problem is that it is not a developer who initiates the
> >> preemption, but the kernel or another part of application code that is unaware
> >> of other thread's condition and doing it blindly, lacking the information about
> >> the state of the thread being preempted and the expected cost of its preemption
> >> in this state. DPRIO is a way to communicate this information.
> 
> > What DPRIO clearly does NOT do is to describe critical sections to the
> > kernel.
> 
> First of all let's reflect that your argument is not with DPRIO as such. DPRIO
> after all is not a separate scheduling mode, but just a method to reduce the
> overhead of regular set_priority calls (i.e. sched_setattr & friends).

I see subversion of a perfectly functional and specified mechanism.

> You argument is with the use of elevated priority as such, and you are saying
> that using RT priority range (or high nice) does not convey to the kernel the
> information about the critical section.

I maintain that task priority does not contain any critical section
information whatsoever.  Trying to use priority to delineate critical
sections is a FAIL, you must HAVE the CPU to change your priority.

"Time for me to boost my<POW>self.. hey, wtf happened to the time!?!"

That's why you subverted the mechanism to not perform the specified
action.. at all, not merely not at this precise instant, because task
priority cannot be used by any task to describe a critical section.

> I do not agree with this, not wholly anyway. First of all, it is obvious that
> set_priority does convey some information about the section, so perhaps a more
> accurate re-formulation of your argument could be that it is imperfect,
> insufficient information.

I assert that is that there is _zero_ critical section information
present.  Create a task, you create a generic can, giving it a priority
puts that can on a shelf.  Can content is invisible to the kernel, it
can't see a critical section of the content, or whether the content as a
whole is a nicely wrapped up critical section of a larger whole.  There
is no section anything about this, what you have is a generic can of FOO
on a shelf BAR.

If you need a way to describe userspace critical sections, make a way to
identify userspace critical sections.  IMHO, task priority ain't it,
that's taken, and has specified semantics.

> Let's try to imagine then what could make more perfect information. It
> obviously should be some cost function describing the cost that would be
> incurred if the task gets preempted. Something that would say (if we take the
> simplest form) "if you preempt me within the next T microseconds (unless I
> cancel or modify this mode), this preemption would incur cost X upfront further
> accruing at a rate Y".

You can build something more complex, but the basic bit of missing
information appears to me to be is plain old enter/exit.

> One issue I see with this approach is that in real life it might be very hard
> for a developer to quantify the values for X, Y and T. Developer can easily
> know that he wants to avoid the preemption in a given section, but actually
> quantifying the cost of preemption (X, Y) would take a lot of effort
> (benchmarking) and furthermore really cannot be assigned statically, as the
> cost varies depending on the load pattern and site-specific configuration.
> Furthermore, when dealing with multiple competing contexts, developer can
> typically tell that task A is more important than task B, but quantifying the
> measure of their relative importance might be quite difficult.
> 
> Likewise, quantifying "T" is likely to be similarly difficult.
> 
> (And then even suppose the developer knew that the section completes within let
> us say 5 ms at three sigmas, is this reason good enough to preempt the task
> at 6 ms for the sake of a normal timesharing thread? - I am uncertain.)
> 
> Thus it appears to me that even such an interface existed today, developers
> would be daunted by it and prefer to use RT instead as something more
> manageable/usable, controllable and predictable.
> 
> But then, suppose such an interface existed and task expressing their critical
> section information through it were -- within their authorized quotas for T and
> X/Y -- given precedence over normal threads but preemptible by RT or DL tasks.
> Would not it pretty much amount to the existence of low-RT range sitting just
> below regular RT range, low-RT range that tasks could enter for a time? Just
> like they can enter regular RT range now with set_priority, also for a time.
> Would it be really different from judicious use of existing RT, where tasks
> controlling "chainsaws" run at prio range 50-90, while database engine threads
> utilize prio range 1-10 in their critical sections?

I still see the deferred preemption thing as possibly being useful, or,
some completely new scheduling class could do whatever you want it to.

> (The only difference being that after the expiration of interval T task
> priority is knocked down -- which a judiciously written application does
> anyway, so the difference is just a protection against bugs and runaways -- and
> then task becomes more subject to preemption, after which other threads are free
> to use PI/PE to resolve the dependency when they know it, and if they
> do not then
> in a subset of those use cases when spinning of old or incoming waiters cannot
> be shut off, it is either back to using plain RT or sustaining uncontrollable
> losses.)
> 
> I would be most glad to see a usable interface providing information to the
> scheduler about a task's critical sections emerge (other than RT), but for the
> considerations outlined I am doubtful of the possibility.
> 
> Apart from this and coming back to DPRIO, even if solution more satisfactory
> than judicious use of RT existed, how long might it take for it to be worked
> out? If the history of EDF from ReTiS concept to merging into 3.14 mainline is
> a guide, it may take quite a while, so stop-gap solution would have a value
> even because of timing considerations until something better emerges... that
> is, assuming it can and ever does.

Lord knows.

-Mike 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-07  9:03               ` Mike Galbraith
@ 2014-08-08 20:11                 ` Sergey Oboguev
  2014-08-09 13:04                   ` Mike Galbraith
  2014-08-09  8:38                 ` Sergey Oboguev
  1 sibling, 1 reply; 34+ messages in thread
From: Sergey Oboguev @ 2014-08-08 20:11 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Thu, Aug 7, 2014 at 2:03 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:

> task priority cannot be used by any task to describe a critical section.
> I assert that is that there is _zero_ critical section information present.

This appears to be the crux of our disagreement.

This assertion is incorrect. The use of RT to bracket a critical section
obviously _does_ provide the following information:

1) designation of entry point for the start of critical activity

2) designation of exit point, albeit with timing not known in advance at entry
time

3) priority value which embodies a developer's assessment of the importance of
this critical activity relative to other critical activities within the
application or coherent application set, and thus a statement about the cost of
the activity's preemption for this application or application set

What priority protection _does not_ provide is:

1) advance (at entry time) estimate of the duration of the activity

2) the measure of preemption cost in "objective" (uniform) units that would
span across unrelated applications

3) and in units that can be easily stacked against the policy-specified cost of
deferring and penalizing normal tasks for too long (although to an extent in RT
use case this is done by rt_bandwidth)

The implication of (2) and (3) is that one cannot easily have a system
management slider saying "penalize application A for the benefit of application
B (or default undesignated applications) by giving weights so-and-so to their
cost factors"... Well, in a way and to an extent one can do it by remapping
priority ranges for tasks A and B, however since priority interface was not
designed for it grounds up that would be cumbersome.

The provisioning of this missing information however is not realistically
possible outside of the simplest use cases ran in a fixed configuration under a
fixed workload, as I elaborated in the previous message.  Outside of such
irrealistic "lab" samples, even in the case of the simplest cost function the
estimates of T and X are not pragmatically obtainable. The estimates of Y and
T2 (cut-off point for Y accrual) are likewise hard or even harder to obtain.
Thus even the simplest cost function cannot be pragmatically provided, let
alone any more complex cost function.

The issue is not with inventing some sophisticated cost function and system
interface to plug it in. The issue is that valid input data that the cost
function would need to rely on is not pragmatically/economically obtainable.
(And this issue is not pragmatically solvable through dynamically collecting
such data at run time as well.)

Therefore there is no any pragmatically usable "better" solution in the domain
of preemption-deferral solutions (rather than post-preemption solutions) that
may be discovered yet.

It stands to reason that the choice in this domain is really, and will ever
pragmatically stay, between using the long-existing techniques (i.e. priority
protection or schedctl-style preempt delay, with the latter having the
limitations I outlined earlier) or not using anything at all.

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-07  9:03               ` Mike Galbraith
  2014-08-08 20:11                 ` Sergey Oboguev
@ 2014-08-09  8:38                 ` Sergey Oboguev
  2014-08-09 14:13                   ` Mike Galbraith
  1 sibling, 1 reply; 34+ messages in thread
From: Sergey Oboguev @ 2014-08-09  8:38 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Thu, Aug 7, 2014 at 2:03 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:

> I see subversion of a perfectly functional and specified mechanism

Just wondering if the following line of thinking would sound just as much an
anathema from your perspective or perhaps a bit less terrible...

Proceeding from the observations (e.g. https://lkml.org/lkml/2014/8/8/492) that
representative critical section information is not pragmatically expressible at
development time or dynamically collectable by the application at run time, the
option still remains to put the weight of managing such information on the
shoulders of the final link in the chain, the system administrator, providing
him with application-specific guidelines and also with monitoring tools.

It might look approximately like this.

It might be possible to define the scheduling class or some other kind of
scheduling data entity for the tasks utilizing preemption control. The tasks
belonging to this class and having critical section currently active are
preemptible by RT or DL tasks just like normal threads, however they are
granted a limited and controlled degree of protection against preemption by
normal threads, and also limited ability to urgently preempt normal threads on
a wakeup.

Tasks inside this class may belong to one of the groups somewhat akin to
cgroups (perhaps may be even implemented as an extension to cgroups).

The properties of a group are:

* Maximum critical section duration (ms). This is not based on actual duration
of critical sections for the application and may exceed it manyfold. The
purpose is merely to be a safeguard against the runaways. If a task stays
inside a critical section longer than the specified time limit, it loses the
protection against the preemption and becomes for practical purposes a normal
thread.  The group keeps a statistics of how often the tasks in the group
overstay in critical section and exceed the specified limit.

* Percentage of CPU time that members of the group can collectively spend
inside their critical sections over some sampling interval while enjoying the
protection from preemption. This is the critical parameter. If group members
collectively spend larger share of CPU time in their critical sections
exceeding the specified limit, they start losing protection from preemption by
normal threads, to keep their protected time within the quota.

For example the administrator may designate that threads in group "MyDB" can
spend no more than 20% of system CPU time combined in the state of being
protected from preemption, while threads in group "MyVideoEncoder" can spend
not more than 10% of system CPU time in preemption-protected state.

If actual aggregate critical-section time spent by threads in all the groups
and also by RT tasks starts pushing some system-wide limit (perhaps
rt_bandwidth), available group percentages are dynamically scaled down, to
reserve some breathing space for normal tasks, and to depress groups in some
proportional way. Scaling down can be either proportional to the group quotas,
or can be controlled by separate scale-down weights.

A monitoring tool can display how often the tasks in the group requesting
protection from preemption are not granted it or lose it because of
overdrafting the group quota. System administrator may then either choose to
enlarge the group's quota, or leave it be and accept the application running
sub-optimally.

An application can also monitor the statistics on rejection of preemption
protection for its threads (and also actual preemption frequency while inside a
declared critical section state) and if the rate is high then issue an advisory
message to the administrator.

Furthermore:

Threads within a group need to have relative priorities. There should be a
way for a thread processing a highly critical section to be favored over a
thread processing medium-significance critical section.

There should also be a way to change a thread's group-relative priority both
from the inside and from the outside of a thread. If thread A queues an
important request for processing by thread B, A should be able to bump B's
group-relative priority.

Thread having non-zero group-relative priority is considered to be within a
critical section.

If thread having non-zero group-relative priority is woken up, it preempts
normal thread, as long as the group's critical section time usage is within
the group's quota.

The tricky thing is how to correlate priority ranges of different groups. I.e.
suppose there is a thread T1 belonging to group APP1 with group-relative
priority 10 within APP1 and a thread T2 belonging to group APP2 with
group-relative priority 20 within APP2. Which thread is more important and
should run first? Perhaps this can be left to system administrator who can
adjust "base priority" property of a group thus sliding groups upwards or
downwards relative to each other (or, generally, using some form of intergroup
priority mapping control).

This is not suggest any particular interface of course, but just a crude sketch
of a basic approach. I am wondering if you would find it more agreeable within
your perspective than the use of RT priorities, or still fundamentally
disagreeable.

(Personally I am not particularly thrilled by the complexity that would have
to be added and managed.)

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-08 20:11                 ` Sergey Oboguev
@ 2014-08-09 13:04                   ` Mike Galbraith
  2014-08-09 18:04                     ` Andi Kleen
  2014-08-13 23:52                     ` Sergey Oboguev
  0 siblings, 2 replies; 34+ messages in thread
From: Mike Galbraith @ 2014-08-09 13:04 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Fri, 2014-08-08 at 13:11 -0700, Sergey Oboguev wrote: 
> On Thu, Aug 7, 2014 at 2:03 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> 
> > task priority cannot be used by any task to describe a critical section.
> > I assert that is that there is _zero_ critical section information present.
> 
> This appears to be the crux of our disagreement.
> 
> This assertion is incorrect. The use of RT to bracket a critical section
> obviously _does_ provide the following information:

You sure don't give up easy.

> 1) designation of entry point for the start of critical activity

Yup, a task can elevate its priority upon entering scram_reactor(), iff
it gets there, scram might still be considered a critical activity.

> 2) designation of exit point, albeit with timing not known in advance at entry
> time

Yeah, exit works ok if enter happens.

You are not going to convince me that it is cool to assign an imaginary
priority to a SCHED_FIFO class task, and still call the resulting mutant
a SCHED_FIFO class task.  Those things have defines semantics.  It is
not ok for a SCHED_FIFO task of a lower priority to completely ignore a
SCHED_FIFO task of a higher priority because it's part of an application
which has one or more wild cards, or maybe even a get out of jail free
card it can pull out of its butt.

NAK.  There it is, my imaginary NAK to imaginary realtime priorities :)

-Mike



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-09  8:38                 ` Sergey Oboguev
@ 2014-08-09 14:13                   ` Mike Galbraith
  0 siblings, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2014-08-09 14:13 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Sat, 2014-08-09 at 01:38 -0700, Sergey Oboguev wrote: 
> On Thu, Aug 7, 2014 at 2:03 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:
> 
> > I see subversion of a perfectly functional and specified mechanism
> 
> Just wondering if the following line of thinking would sound just as much an
> anathema from your perspective or perhaps a bit less terrible...
> 
> Proceeding from the observations (e.g. https://lkml.org/lkml/2014/8/8/492) that
> representative critical section information is not pragmatically expressible at
> development time or dynamically collectable by the application at run time, the
> option still remains to put the weight of managing such information on the
> shoulders of the final link in the chain, the system administrator, providing
> him with application-specific guidelines and also with monitoring tools.
> 
> It might look approximately like this.
> 
> It might be possible to define the scheduling class or some other kind of
> scheduling data entity for the tasks utilizing preemption control. The tasks
> belonging to this class and having critical section currently active are
> preemptible by RT or DL tasks just like normal threads, however they are
> granted a limited and controlled degree of protection against preemption by
> normal threads, and also limited ability to urgently preempt normal threads on
> a wakeup.

Sure, a completely different scheduling class can implement whatever
semantics it wants, just has to be useful and not break the world. 

(spins lots of complexity)

> This is not suggest any particular interface of course, but just a crude sketch
> of a basic approach. I am wondering if you would find it more agreeable within
> your perspective than the use of RT priorities, or still fundamentally
> disagreeable.

Yes, the crux of my objection is the subversion in my view of RT.

> (Personally I am not particularly thrilled by the complexity that would have
> to be added and managed.)

(yeah, you described lots of that)

-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-09 13:04                   ` Mike Galbraith
@ 2014-08-09 18:04                     ` Andi Kleen
  2014-08-10  3:13                       ` Mike Galbraith
  2014-08-13 23:52                     ` Sergey Oboguev
  1 sibling, 1 reply; 34+ messages in thread
From: Andi Kleen @ 2014-08-09 18:04 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Sergey Oboguev, Andi Kleen, linux-kernel, khalid.aziz

> NAK.  There it is, my imaginary NAK to imaginary realtime priorities :)

Ok, but do you have any alternative proposal yourself how to solve the
lockholder preemption problem? I assume you agree it's a real problem.

Just being negative is not very constructive.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-09 18:04                     ` Andi Kleen
@ 2014-08-10  3:13                       ` Mike Galbraith
  2014-08-10  3:41                         ` Mike Galbraith
  0 siblings, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2014-08-10  3:13 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Sergey Oboguev, linux-kernel, khalid.aziz

On Sat, 2014-08-09 at 20:04 +0200, Andi Kleen wrote: 
> > NAK.  There it is, my imaginary NAK to imaginary realtime priorities :)
> 
> Ok, but do you have any alternative proposal yourself how to solve the
> lockholder preemption problem? I assume you agree it's a real problem.
> 
> Just being negative is not very constructive.

I both acknowledged the problem problem, and made alternative
suggestions.

-Mike


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-10  3:13                       ` Mike Galbraith
@ 2014-08-10  3:41                         ` Mike Galbraith
  0 siblings, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2014-08-10  3:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Sergey Oboguev, linux-kernel, khalid.aziz

On Sun, 2014-08-10 at 05:13 +0200, Mike Galbraith wrote: 
> On Sat, 2014-08-09 at 20:04 +0200, Andi Kleen wrote: 
> > > NAK.  There it is, my imaginary NAK to imaginary realtime priorities :)
> > 
> > Ok, but do you have any alternative proposal yourself how to solve the
> > lockholder preemption problem? I assume you agree it's a real problem.
> > 
> > Just being negative is not very constructive.
> 
> I both acknowledged the problem problem, and made alternative
> suggestions.

Note who I replied to

https://lkml.org/lkml/2014/7/28/77


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-08-09 13:04                   ` Mike Galbraith
  2014-08-09 18:04                     ` Andi Kleen
@ 2014-08-13 23:52                     ` Sergey Oboguev
  1 sibling, 0 replies; 34+ messages in thread
From: Sergey Oboguev @ 2014-08-13 23:52 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Andi Kleen, linux-kernel, khalid.aziz

On Sat, Aug 9, 2014 at 6:04 AM, Mike Galbraith <umgwanakikbuti@gmail.com> wrote:

> You are not going to convince me that it is cool to assign an imaginary
> priority to a SCHED_FIFO class task, and still call the resulting mutant
> a SCHED_FIFO class task. Those things have defines semantics. It is
> not ok for a SCHED_FIFO task of a lower priority to completely ignore a
> SCHED_FIFO task of a higher priority because it's part of an application
> which has one or more wild cards

I am not quite sure what you were trying to say by this, but I am grateful to
you for expressing your perspective and arguments in this discussion.

For one thing, it stimulated me to perform systematic review of the solution
space/tree, square inch by square inch, resulting in exhaustive logical
coverage of the solution space and thus an informal logical proof that there
is no any uncovered subspace that might hold any undiscovered yet solution.

Just so the results do not remain scattered, I will try to summarize and
categorize the findings about available solutions to the general "critical
section (or critical activity) preemption" problem category, and then draw
a conclusion stemming from this summary.

**********************

1) Post-preemption solutions.

(Such as priority inheritance or proxy execution, or even humble yield_to).

By the time a post-preemption solution is engaged, a cost had already been
paid, and further cost is paid to execute the solution. Sometimes this
aggregate cost can be small and acceptable within the overall context of an
application, sometimes it can be too high. Sometimes dependency chain cannot be
expressed and thus post-preemption solution is not possible at all (other than
just waiting for the scheduler to eventually resume the blocking thread and
keeping paying the cost in the meanwhile).

I won't be further enumerating this subcategory here and will focus instead on
the space of preemption avoidance/deferral solutions.

**********************

2) Reduction in the incidence of thread preemption.

There are chief ways to reduce the incidence of thread preemption.

One way is by enlarging the scheduling quantum. Its simplistic use however may
well backfire: if a lock holder is preempted while holding a lock (or thread
executing non-lock-related critical activity is preempted while inside this
activity) enlarged scheduling quantum would result in longer delay before the
thread will get resumed. Therefore this measure if employed should really be
coupled with one of post-preemption solutions.

Another way is the reduction of thread preemption incidence due to another
thread being woken up. Currently existing schemes of this kind (use of
!WAKEUP_PREEMPTION or SCHED_BATCH) are indiscriminate and victimize woken up
thread for the sake of executing thread regardless of whether the latter is
inside a critical section or not, however means can be provided to refine this
behavior and victimize the wakee only in case if currently executing thread is
actually inside a critical section. (And also when the section is exited, to
switch over to the postponed wakee ASAP).

A variation on this theme is an attempt at mitigation of preemption effects via
LAST_BUDDY.

The extent to which this scheme will be effective or not is obviously
determined by the following factors.

First, the probability of thread preemption while it is holding a lock, which
in turn is determined by:

(a) Percentage of time a worker thread holds a lock (or executes non-lock
related critical activity). It does not matter too much whether a thread enters
a critical section one time, stays in it for a while and then exits or whether
the thread enters and exits a critical section many times in a row staying
within it each time only a for very short instant. What matters is the
aggregate percentage of execution time a thread spends while inside a critical
section. This yields a rough probability any random preemption of a thread
(either through natural expiration of scheduling timeslice or due to the
preemption by a wakee) will result in lock holder or other critical activity
preemption.

(b) To an extent, request handler execution time. Very short times (well under
typical scheduling quantum) separated by voluntary yielding (such as when
waiting on the queue to pick up the next processing request) would result in
reduced preemption probability due to end-of-quantum events, however I won't be
further considering this special case here. (It is also not a particularly good
idea to use FIFO ordering for dequeueing of events or requests from a queue by
worker threads, as opposed to LIFO ordering, as it negatively impacts cache
locality and increases context switch probability.)

(c) Reduction in preemption attempt frequency. That's where the impact of the
discussed scheme comes in. With the scheme on, thread gets chiefly preempted on
quantum expiration, but not intra-quantum in favor of wakee threads. The
product of probabilities (a) and (c) determines how likely the thread is to be
preempted while inside a critical section/activity.

Second factor is the contention rate on a lock a preempted thread is holding
(or dependency occurrence rate for non-lock-related critical activities). It
defines whether the preemption is important in the aftermath of the
pre-emption, i.e. how likely it is to lead to an actual blocking of other
threads forward progress by the preemptee.

Third factor is the cost of post-preemption recovery response, and whether such
response is possible at all (i.e. whether dependency chain can be expressed at
run time).

Factor (c) depends on the frequency of the wakeups.

If a worker thread performs multiple synchronous IO requests or other OS-level
blocking operations within a timeslice, then probability of those completing
within a buddy's timeslice is significant. Thus for example database engine
with low cache pool size and not utilizing async processing model (akin to
nginx or node.js models) may have high incidence of wakeups within a buddy's
timeslice due to the completion of synchronous IO requests or other blocking
waits.

However a worker thread on the same database but configured with larger cache
size (so IO rate is low), or a worker thread in in-memory database, or a worker
thread in conventional database but utilizing async execution model will have
low incidence of wakeups within a buddy's timeslice, and in those case the use
of the discussed scheme won't make much improvement since all the improvement
it could have produced is already "in".

After any possible improvement yield from the discussed scheme is factored in,
the residual remains about the preemption at the end of timeslice, with
performance impact defined by the probability of preempted thread holding a
lock or executing other critical activity, i.e. factor (a), contention rate
(factor 2) and the cost of post-preemption recovery response (factor 3), and
whether such response is possible at all. This situation can essentially be
exemplified by a classic case of user-mode spinlock or hybrid spin-then-block
lock discussed earlier (https://lkml.org/lkml/2014/8/2/130). If the product of
listed factors is significant then the solution that attempts to reduce the
incidence of preemption may be effective in this goal or not, but the residual
cost still remains high and system responsiveness possibly choppy.

If reliance on post-preemption solution [*] is impossible because the structure
of the application does not allow to express the dependency chain at run time
(due to the use of 3rd party components, legacy components, or components out
of the scope of the project etc.), the inflicted cost may be particularly high.

    [*] Or in LAST_BUDDY case, if the wakee thread does not yield voluntarily
        very fast.

The discussed solution (reduction in the incidence of thread preemption) also
does not address those applications that have critical section of different
importance, some highly critical and others moderately critical or of low
criticality. Within the discussed solution scheme, there is no provisioning for
multiple levels of protection of application threads relative to each other,
nor relative to other threads in the system.

This solution also does not support "urgent handoff" scenario, in which thread
A sends to thread B a message requiring urgent processing, and thus a critical
section in thread  B is effectively initiated from the outside of the thread.
(An example is virtual machine application in which VCPU thread sends an IPI
interrupt to another VCPU thread, or virtual device handler sends an interrupt
to a VCPU thread, an interrupt that requires prompt processing or a cost will
be incurred.)

**********************

3) schedctl / preempt_delay / remaining_timeslice

This solution allows a thread to get a temporary reprieve from preemption by
another thread of similar scheduling policy. Protection interval can be
specified explicitly (in proposed prempt_delay patch) or defaulted to a
scheduling quantum (in Solaris/AIX schedctl), the purpose of the latter being
to avoid a preemption of a critical section of sub-timeslice duration that was
started close to the end of the timeslice.

Another variation of the same theme is to let a thread know how much time it
has before the expiration of the current timeslice. If the remaining time is
insufficient to complete a critical section, thread can yield and execute the
section when yield(...) returns, starting at the beginning of a new timeslice.
(The obvious drawbacks of this latter approach being more frequent yields,
scheduler calculations and context switches, more heavy reliance on the
estimate of the critical section duration, and more difficult management of
nested critical sections.)

Three shared drawbacks/limitations of this kind of solutions:

a) They do not support applications having critical sections with multiple
levels of criticality, i.e. capability for section priority ranging relative to
the context of the whole application nor other processes in the system.

b) They do not support "urgent handoff" scenarios.

c) An application may need to perform time-urgent processing without knowing in
advance how long it will take. In the majority of cases the processing may be
very short (a fraction of a scheduling timeslice), but occasionally may take
longer (several timeslices). Preempt_delay does provide a way to address it,
but schedctl and remaining_timeslice do not.

**********************

4) Conventional old-fashioned priority protection.

Drawbacks:

Potentially stronger interference with other tasks than in the solutions
discussed earlier, but more importantly not confined to a "jail"
well-manageable by system owner/administrator.

Strengths:

- efficient and well-defined intra-application preemption control
- support for multiple levels of criticality for locks or time-urgent sections
- support for urgent handoff

**********************

5) Scheduling based on a cost function expressing critical section information
assigned at development time or dynamically collected at run time and
representative of actual cost of preemption

https://lkml.org/lkml/2014/8/6/593
https://lkml.org/lkml/2014/8/8/492

As discussed, not pragmatically attainable.

**********************

6) Using a scheduling class with the properties and controls matching the
demands of the problem and relying on manual tuning of the system by its
owner/administrator to balance the interference between multiple application to
a relative degree desired by the administrator, while at the same time
maintaining the applications inopportune preemption frequency (displayed in
system statistics) within the bounds of an acceptable for the administrator.
The system provides control and monitoring tools, but it is left up to the
administrator to strike the balance between the applications which seems best
to him.

The description here
https://lkml.org/lkml/2014/8/9/37
summarizes the requirements and outlines a comprehensive solution, but as
formulated there (based on the assumption of adding a new scheduling class --
which is not really necessary, see below), requires high added complexity.

**********************

This exhausts the solution space. Everything logically possible has been
covered, and no other solution not falling into one of described categories can
logically exist.

There are certainly cases that can be satisfactorily addressed with options
(1), (2) or (3), but there are also those that cannot. For those, the remaining
options are (4) or (6) or do nothing.

Which brings us back to option (6) ...

At a closer look, option (6) as formulated in the original message is
tantamount to creation of RT-like range sitting just below regular RT range but
"caged" and better controllable, i.e. having additional controls vs. impact on
other applications and also additional statistics. The creation of such range
is where most of the complexity would have to come from if it were to be
created indeed.

The obvious question however is why to create separate "low RT-like" range
instead of simply using low part of existing RT range, but with added controls
and statistics.

The most important control -- a limit on the aggregate time application's
threads can spend in preemption-protected mode as the share (percentage) of
overall system time -- is already implemented by RT, as rt_bandwidth of a task
group.

The following pieces are missing, but can be provided:

a) Dynamic proportional scale-down of the bandwidths of multiple groups when
their aggregate demand starts to exceed system-wide limit.

b) When a group's RT use limit is exceeded, perhaps group's RT threads should
be allowed to continue as SCHED_NORMAL threads, rather than being stalled till
the next bandwidth period.

c) Per-group statistics on denial of set_priority(RT) to the group's threads
and on throttling of the group's threads because of the group exceeding its
quota of RT use. This statistics should be accessible to administrative tools
and to the application itself.

d) [Non-essential:] Maximum duration a thread within a group can stay at
elevated priority. This is merely a safeguard against the runaways threads and
is not essential, as group's RT bandwidth control allows an administrator to
take a corrective action.

e) Finally, the last issue is how priority range of APP1 can be stacked against
the priority range of APP2 and the latter against the priority range of APP3
and so on. The decision about their relative positioning is, by the definition
of (6), a judgment call of a system administrator/owner. Making such judgment
may be hard, but this difficulty is inherent in co-running multiple
applications each further possibly having critical section of multiple levels
of criticality. Option (5) would have helped to address this ranging/balancing
issue but unfortunately it is not pragmatically/economically attainable. Thus
the only recourse is to lay the decision on the shoulders of a system
administrator, however there is a question of the supporting mechanisms an
administrator can employ in making and implementing his decision.

One instrument is monitoring tools as outlined in (c).

The other is some way of fitting applications' priority ranges together.  One
possible approach could be to define cgroup-specific priority map mapping task
priority levels between group-relative level scale and system-wide level scale.
However maintaining multiple scales can be confusing and also cause ambiguity
during reverse translation when performing get_priority(), so perhaps an easier
and more straightforward way would be a requirement that a compliant
application should simply define its priority levels in an editable
configuration file, and perhaps also provide a method to reload this file if an
administrator wants to perform a dynamic adjustment without restarting the
application.

**********************

Conclusion:

I disagree with blanket indictment of the use of RT priority for the sake of
managing critical or time-urgent sections as "subversion" of RT mechanism -- I
do not see a rational argument for such a blanket indictment.  The judicious
use of RT matches the primary express purpose of the facilities intended to
support the execution of critical sections: temporarily victimize
lower-importance threads for the sake of higher-importance threads (with their
relative importance being dynamic).

However I do agree that a better controlled (including better
manageable/tunable by system owner), "caged" use of "tamed" RT would be
desirable. That's what option (6) would provide if implemented, and what can
now be partially provided by option (4) + per-cgroup quota on rt_bandwidth.

General conclusion of this overview is that solutions space for the overall
issue of executing critical/time-urgent sections is fragmented (and even
problem space itself is fragmented too). There is no single "holy grail"
solution that would address all use cases and use modes.

Given this, the right approach for an operating system would be to provide a
set of alternative mechanisms that expose various solutions (some of which may
also happen to be complementary in certain use cases) that in their entirety
address the whole space of use cases, and then let an application developer and
system owner decide between themselves which exact mechanisms are most suitable
for a particular purpose.

As for DPRIO in this context, DPRIO is a supplementary cost-reduction mechanism
usable with options (4) and (6) in high-frequency scenarios.

- Sergey

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH RFC] sched: deferred set priority (dprio)
  2014-07-21 12:33 Sergey Oboguev
@ 2014-07-21 18:14 ` Thomas Gleixner
  0 siblings, 0 replies; 34+ messages in thread
From: Thomas Gleixner @ 2014-07-21 18:14 UTC (permalink / raw)
  To: Sergey Oboguev; +Cc: LKML

On Mon, 21 Jul 2014, Sergey Oboguev wrote:

> This patch is intended to improve the support for fine-grain parallel

This patch is no patch at all. Please read Documentation/SubmittingPatches

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH RFC] sched: deferred set priority (dprio)
@ 2014-07-21 12:33 Sergey Oboguev
  2014-07-21 18:14 ` Thomas Gleixner
  0 siblings, 1 reply; 34+ messages in thread
From: Sergey Oboguev @ 2014-07-21 12:33 UTC (permalink / raw)
  To: linux-kernel

This patch is intended to improve the support for fine-grain parallel
applications that may sometimes need to change the priority of their threads at
a very high rate, hundreds or even thousands of times per scheduling timeslice.

These are typically applications that have to execute short or very short
lock-holding critical or otherwise time-urgent sections of code at a very high
frequency and need to protect these sections with "set priority" system calls,
one "set priority" call to elevate current thread priority before entering the
critical or time-urgent section, followed by another call to downgrade thread
priority at the completion of the section. Due to the high frequency of
entering and leaving critical or time-urgent sections, the cost of these "set
priority" system calls may raise to a noticeable part of an application's
overall expended CPU time. Proposed "deferred set priority" facility allows to
largely eliminate the cost of these system calls.

Instead of executing a system call to elevate its thread priority, an
application simply writes its desired priority level to a designated memory
location in the userspace. When the kernel attempts to preempt the thread, it
first checks the content of this location, and if the application's stated
request to change its priority has been posted in the designated memory area,
the kernel will execute this request and alter the priority of the thread being
preempted before performing a rescheduling, and then make a scheduling decision
based on the new thread priority level thus implementing the priority
protection of the critical or time-urgent section desired by the application.
In a predominant number of cases however, an application will complete the
critical section before the end of the current timeslice and cancel or alter
the request held in the userspace area. Thus a vast majority of an
application's change priority requests will be handled and mutually cancelled
or coalesced within the userspace, at a very low overhead and without incurring
the cost of a system call, while maintaining safe preemption control. The cost
of an actual kernel-level "set priority" operation is incurred only if an
application is actually being preempted while inside the critical section, i.e.
typically at most once per scheduling timeslice instead of hundreds or
thousands "set priority" system calls in the same timeslice.

One of the intended purposes of this facility (but its not sole purpose) is to
render a lightweight mechanism for priority protection of lock-holding critical
sections that would be an adequate match for lightweight locking primitives
such as futex, with both featuring a fast path completing within the userspace.

More detailed description can be found in:
https://raw.githubusercontent.com/oboguev/dprio/master/dprio.txt

The patch is currently based on 3.15.2.

Patch file:
https://github.com/oboguev/dprio/blob/master/patch/linux-3.15.2-dprio.patch
https://raw.githubusercontent.com/oboguev/dprio/master/patch/linux-3.15.2-dprio.patch

Modified source files:
https://github.com/oboguev/dprio/tree/master/src/linux-3.15.2

User-level library implementing userspace-side boilerplate code:
https://github.com/oboguev/dprio/tree/master/src/userlib

Test set:
https://github.com/oboguev/dprio/tree/master/src/test

The patch is enabled with CONFIG_DEFERRED_SETPRIO.

There is also a config setting for the debug code and a setting that controls
the initial value of authorization list restricting the use of the facility
based on user or group ids. Please see dprio.txt for details.

Comments would be appreciated.

Thanks,
Sergey

Signed-off-by: Sergey Oboguev <oboguev@yahoo.com>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2014-08-13 23:52 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-25 19:45 [PATCH RFC] sched: deferred set priority (dprio) Sergey Oboguev
2014-07-25 20:12 ` Andy Lutomirski
2014-07-26  7:56   ` Sergey Oboguev
2014-07-26  8:58 ` Mike Galbraith
2014-07-26 18:30   ` Sergey Oboguev
2014-07-27  4:02     ` Mike Galbraith
2014-07-27  9:09       ` Sergey Oboguev
2014-07-27 10:29         ` Mike Galbraith
2014-07-28  1:19 ` Andi Kleen
2014-07-28  4:16   ` Sergey Oboguev
2014-07-28  7:24   ` Mike Galbraith
2014-08-03  0:43     ` Sergey Oboguev
2014-08-03  9:56       ` Mike Galbraith
2014-08-05 23:28         ` Sergey Oboguev
2014-08-06  5:41           ` Mike Galbraith
2014-08-06  7:42             ` Mike Galbraith
2014-08-07  1:26             ` Sergey Oboguev
2014-08-07  9:03               ` Mike Galbraith
2014-08-08 20:11                 ` Sergey Oboguev
2014-08-09 13:04                   ` Mike Galbraith
2014-08-09 18:04                     ` Andi Kleen
2014-08-10  3:13                       ` Mike Galbraith
2014-08-10  3:41                         ` Mike Galbraith
2014-08-13 23:52                     ` Sergey Oboguev
2014-08-09  8:38                 ` Sergey Oboguev
2014-08-09 14:13                   ` Mike Galbraith
2014-08-03 17:30       ` Andi Kleen
2014-08-05 23:13         ` Sergey Oboguev
2014-07-30 13:02 ` Pavel Machek
2014-08-03  0:47   ` Sergey Oboguev
2014-08-03  8:30     ` Pavel Machek
2014-08-05 23:03       ` Sergey Oboguev
  -- strict thread matches above, loose matches on Subject: below --
2014-07-21 12:33 Sergey Oboguev
2014-07-21 18:14 ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.