* [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints @ 2020-01-16 12:02 Parth Shah 2020-01-16 12:02 ` [PATCH v3 1/3] sched: Introduce latency-nice as a per-task attribute Parth Shah ` (3 more replies) 0 siblings, 4 replies; 23+ messages in thread From: Parth Shah @ 2020-01-16 12:02 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, vincent.guittot, patrick.bellasi, valentin.schneider, qais.yousef, pavel, dhaval.giani, qperret, David.Laight, pjt, tj, dietmar.eggemann This is the 3rd revision of the patch set to introduce latency_{nice/tolerance} as a per task attribute. The previous version can be found at: v1: https://lkml.org/lkml/2019/11/25/151 v2: https://lkml.org/lkml/2019/12/8/10 Changes in this revision are: v2 -> v3: - This series changes the longer attribute name to "latency_nice" as per the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394 v1 -> v2: - Addressed comments from Qais Yousef - As per suggestion from Dietmar, moved content from newly created include/linux/sched/latency_tolerance.h to kernel/sched/sched.h - Extend sched_setattr() to support latency_tolerance in tools headers UAPI Introduction: ============== This patch series introduces a new per-task attribute latency_nice to provide the scheduler hints about the latency requirements of the task [1]. Latency_nice is a ranged attribute of a task with the value ranging from [-20, 19] both inclusive which makes it align with the task nice value. The value should provide scheduler hints about the relative latency requirements of tasks, meaning the task with "latency_nice = -20" should have lower latency requirements than compared to those tasks with higher values. Similarly a task with "latency_nice = 19" can have higher latency and hence such tasks may not care much about latency. The default value is set to 0. The usecases discussed below can use this range of [-20, 19] for latency_nice for the specific purpose. This patch does not implement any use cases for such attribute so that any change in naming or range does not affect much to the other (future) patches using this. The actual use of latency_nice during task wakeup and load-balancing is yet to be coded for each of those usecases. As per my view, this defined attribute can be used in following ways for a some of the usecases: 1 Reduce search scan time for select_idle_cpu(): - Reduce search scans for finding idle CPU for a waking task with lower latency_nice values. 2 TurboSched: - Classify the tasks with higher latency_nice values as a small background task given that its historic utilization is very low, for which the scheduler can search for more number of cores to do task packing. A task with a latency_nice >= some_threshold (e.g, == 19) and util <= 12.5% can be background tasks. 3 Optimize AVX512 based workload: - Bias scheduler to not put a task having (latency_nice == -20) on a core occupying AVX512 based workload. Series Organization: ==================== - Patch 1: Add new attribute latency_nice to task_struct. - Patch 2: Clone parent task's attribute to the child task on fork - Patch 3: Add support for sched_{set,get}attr syscall to modify latency_nice of the task The patch series can be applied on tip/sched/core at the commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware") References: ============ [1]. Usecases for the per-task latency-nice attribute, https://lkml.org/lkml/2019/9/30/215 [2]. Task Latency-nice, "Subhra Mazumdar", https://lkml.org/lkml/2019/8/30/829 [3]. Introduce per-task latency_tolerance for scheduler hints, https://lkml.org/lkml/2019/12/8/10 Parth Shah (3): sched: Introduce latency-nice as a per-task attribute sched/core: Propagate parent task's latency requirements to the child task sched: Allow sched_{get,set}attr to change latency_nice of the task include/linux/sched.h | 1 + include/uapi/linux/sched.h | 4 +++- include/uapi/linux/sched/types.h | 19 +++++++++++++++++++ kernel/sched/core.c | 21 +++++++++++++++++++++ kernel/sched/sched.h | 18 ++++++++++++++++++ tools/include/uapi/linux/sched.h | 4 +++- 6 files changed, 65 insertions(+), 2 deletions(-) -- 2.17.2 ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH v3 1/3] sched: Introduce latency-nice as a per-task attribute 2020-01-16 12:02 [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints Parth Shah @ 2020-01-16 12:02 ` Parth Shah 2020-01-16 12:02 ` [PATCH v3 2/3] sched/core: Propagate parent task's latency requirements to the child task Parth Shah ` (2 subsequent siblings) 3 siblings, 0 replies; 23+ messages in thread From: Parth Shah @ 2020-01-16 12:02 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, vincent.guittot, patrick.bellasi, valentin.schneider, qais.yousef, pavel, dhaval.giani, qperret, David.Laight, pjt, tj, dietmar.eggemann Latency-nice indicates the latency requirements of a task with respect to the other tasks in the system. The value of the attribute can be within the range of [-20, 19] both inclusive to be in-line with the values just like task nice values. latency_nice = -20 indicates the task to have the least latency as compared to the tasks having latency_nice = +19. The latency_nice may affect only the CFS SCHED_CLASS by getting latency requirements from the userspace. Signed-off-by: Parth Shah <parth@linux.ibm.com> Reviewed-by: Qais Yousef <qais.yousef@arm.com> --- include/linux/sched.h | 1 + kernel/sched/sched.h | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 467d26046416..0668948fddcd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -676,6 +676,7 @@ struct task_struct { int static_prio; int normal_prio; unsigned int rt_priority; + int latency_nice; const struct sched_class *sched_class; struct sched_entity se; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1a88dc8ad11b..edae9277e48d 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -101,6 +101,24 @@ extern long calc_load_fold_active(struct rq *this_rq, long adjust); */ #define NS_TO_JIFFIES(TIME) ((unsigned long)(TIME) / (NSEC_PER_SEC / HZ)) +/* + * Latency nice is meant to provide scheduler hints about the relative + * latency requirements of a task with respect to other tasks. + * Thus a task with latency_nice == 19 can be hinted as the task with no + * latency requirements, in contrast to the task with latency_nice == -20 + * which should be given priority in terms of lower latency. + */ +#define MAX_LATENCY_NICE 19 +#define MIN_LATENCY_NICE -20 + +#define LATENCY_NICE_WIDTH \ + (MAX_LATENCY_NICE - MIN_LATENCY_NICE + 1) + +/* + * Default tasks should be treated as a task with latency_nice = 0. + */ +#define DEFAULT_LATENCY_NICE 0 + /* * Increase resolution of nice-level calculations for 64-bit architectures. * The extra resolution improves shares distribution and load balancing of -- 2.17.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 2/3] sched/core: Propagate parent task's latency requirements to the child task 2020-01-16 12:02 [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints Parth Shah 2020-01-16 12:02 ` [PATCH v3 1/3] sched: Introduce latency-nice as a per-task attribute Parth Shah @ 2020-01-16 12:02 ` Parth Shah 2020-01-16 12:02 ` [PATCH v3 3/3] sched: Allow sched_{get,set}attr to change latency_nice of the task Parth Shah 2020-02-17 8:57 ` [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints Parth Shah 3 siblings, 0 replies; 23+ messages in thread From: Parth Shah @ 2020-01-16 12:02 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, vincent.guittot, patrick.bellasi, valentin.schneider, qais.yousef, pavel, dhaval.giani, qperret, David.Laight, pjt, tj, dietmar.eggemann Clone parent task's latency_nice attribute to the forked child task. Reset the latency_nice value to default value when the child task is set to sched_reset_on_fork. Signed-off-by: Parth Shah <parth@linux.ibm.com> Reviewed-by: Qais Yousef <qais.yousef@arm.com> --- kernel/sched/core.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index e7b08d52db93..a7359836bf17 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2854,6 +2854,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) */ p->prio = current->normal_prio; + /* Propagate the parent's latency requirements to the child as well */ + p->latency_nice = current->latency_nice; + uclamp_fork(p); /* @@ -2870,6 +2873,7 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p) p->prio = p->normal_prio = __normal_prio(p); set_load_weight(p, false); + p->latency_nice = DEFAULT_LATENCY_NICE; /* * We don't need the reset flag anymore after the fork. It has * fulfilled its duty: -- 2.17.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH v3 3/3] sched: Allow sched_{get,set}attr to change latency_nice of the task 2020-01-16 12:02 [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints Parth Shah 2020-01-16 12:02 ` [PATCH v3 1/3] sched: Introduce latency-nice as a per-task attribute Parth Shah 2020-01-16 12:02 ` [PATCH v3 2/3] sched/core: Propagate parent task's latency requirements to the child task Parth Shah @ 2020-01-16 12:02 ` Parth Shah 2020-02-17 8:57 ` [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints Parth Shah 3 siblings, 0 replies; 23+ messages in thread From: Parth Shah @ 2020-01-16 12:02 UTC (permalink / raw) To: linux-kernel Cc: peterz, mingo, vincent.guittot, patrick.bellasi, valentin.schneider, qais.yousef, pavel, dhaval.giani, qperret, David.Laight, pjt, tj, dietmar.eggemann Introduce the latency_nice attribute to sched_attr and provide a mechanism to change the value with the use of sched_setattr/sched_getattr syscall. Also add new flag "SCHED_FLAG_LATENCY_NICE" to hint the change in latency_nice of the task on every sched_setattr syscall. Signed-off-by: Parth Shah <parth@linux.ibm.com> Reviewed-by: Qais Yousef <qais.yousef@arm.com> --- include/uapi/linux/sched.h | 4 +++- include/uapi/linux/sched/types.h | 19 +++++++++++++++++++ kernel/sched/core.c | 17 +++++++++++++++++ tools/include/uapi/linux/sched.h | 4 +++- 4 files changed, 42 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h index 4a0217832464..88ce1e8d7553 100644 --- a/include/uapi/linux/sched.h +++ b/include/uapi/linux/sched.h @@ -121,6 +121,7 @@ struct clone_args { #define SCHED_FLAG_KEEP_PARAMS 0x10 #define SCHED_FLAG_UTIL_CLAMP_MIN 0x20 #define SCHED_FLAG_UTIL_CLAMP_MAX 0x40 +#define SCHED_FLAG_LATENCY_NICE 0x80 #define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \ SCHED_FLAG_KEEP_PARAMS) @@ -132,6 +133,7 @@ struct clone_args { SCHED_FLAG_RECLAIM | \ SCHED_FLAG_DL_OVERRUN | \ SCHED_FLAG_KEEP_ALL | \ - SCHED_FLAG_UTIL_CLAMP) + SCHED_FLAG_UTIL_CLAMP | \ + SCHED_FLAG_LATENCY_NICE) #endif /* _UAPI_LINUX_SCHED_H */ diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h index c852153ddb0d..626ab61c1145 100644 --- a/include/uapi/linux/sched/types.h +++ b/include/uapi/linux/sched/types.h @@ -10,6 +10,7 @@ struct sched_param { #define SCHED_ATTR_SIZE_VER0 48 /* sizeof first published struct */ #define SCHED_ATTR_SIZE_VER1 56 /* add: util_{min,max} */ +#define SCHED_ATTR_SIZE_VER2 60 /* add: latency_nice */ /* * Extended scheduling parameters data structure. @@ -96,6 +97,22 @@ struct sched_param { * on a CPU with a capacity big enough to fit the specified value. * A task with a max utilization value smaller than 1024 is more likely * scheduled on a CPU with no more capacity than the specified value. + * + * Latency Tolerance Attributes + * =========================== + * + * A subset of sched_attr attributes allows to specify the relative latency + * requirements of a task with respect to the other tasks running/queued in the + * system. + * + * @ sched_latency_nice task's latency_nice value + * + * The latency_nice of a task can have any value in a range of + * [LATENCY_NICE_MIN..LATENCY_NICE_MAX]. + * + * A task with latency_nice with the value of LATENCY_NICE_MIN can be + * taken for a task with lower latency requirements as opposed to the task with + * higher latency_nice. */ struct sched_attr { __u32 size; @@ -118,6 +135,8 @@ struct sched_attr { __u32 sched_util_min; __u32 sched_util_max; + /* latency requirement hints */ + __s32 sched_latency_nice; }; #endif /* _UAPI_LINUX_SCHED_TYPES_H */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a7359836bf17..a9e5d157b1a5 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4713,6 +4713,8 @@ static void __setscheduler_params(struct task_struct *p, p->rt_priority = attr->sched_priority; p->normal_prio = normal_prio(p); set_load_weight(p, true); + + p->latency_nice = attr->sched_latency_nice; } /* Actually do priority change: must hold pi & rq lock. */ @@ -4870,6 +4872,13 @@ static int __sched_setscheduler(struct task_struct *p, return retval; } + if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) { + if (attr->sched_latency_nice > MAX_LATENCY_NICE) + return -EINVAL; + if (attr->sched_latency_nice < MIN_LATENCY_NICE) + return -EINVAL; + } + if (pi) cpuset_read_lock(); @@ -4904,6 +4913,9 @@ static int __sched_setscheduler(struct task_struct *p, goto change; if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP) goto change; + if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE && + attr->sched_latency_nice != p->latency_nice) + goto change; p->sched_reset_on_fork = reset_on_fork; retval = 0; @@ -5152,6 +5164,9 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a size < SCHED_ATTR_SIZE_VER1) return -EINVAL; + if ((attr->sched_flags & SCHED_FLAG_LATENCY_NICE) && + size < SCHED_ATTR_SIZE_VER2) + return -EINVAL; /* * XXX: Do we want to be lenient like existing syscalls; or do we want * to be strict and return an error on out-of-bounds values? @@ -5381,6 +5396,8 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr, else kattr.sched_nice = task_nice(p); + kattr.sched_latency_nice = p->latency_nice; + #ifdef CONFIG_UCLAMP_TASK kattr.sched_util_min = p->uclamp_req[UCLAMP_MIN].value; kattr.sched_util_max = p->uclamp_req[UCLAMP_MAX].value; diff --git a/tools/include/uapi/linux/sched.h b/tools/include/uapi/linux/sched.h index 4a0217832464..fa67e2527169 100644 --- a/tools/include/uapi/linux/sched.h +++ b/tools/include/uapi/linux/sched.h @@ -121,6 +121,7 @@ struct clone_args { #define SCHED_FLAG_KEEP_PARAMS 0x10 #define SCHED_FLAG_UTIL_CLAMP_MIN 0x20 #define SCHED_FLAG_UTIL_CLAMP_MAX 0x40 +#define SCHED_FLAG_LATENCY_NICE 0X80 #define SCHED_FLAG_KEEP_ALL (SCHED_FLAG_KEEP_POLICY | \ SCHED_FLAG_KEEP_PARAMS) @@ -132,6 +133,7 @@ struct clone_args { SCHED_FLAG_RECLAIM | \ SCHED_FLAG_DL_OVERRUN | \ SCHED_FLAG_KEEP_ALL | \ - SCHED_FLAG_UTIL_CLAMP) + SCHED_FLAG_UTIL_CLAMP | \ + SCHED_FLAG_LATENCY_NICE) #endif /* _UAPI_LINUX_SCHED_H */ -- 2.17.2 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-01-16 12:02 [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints Parth Shah ` (2 preceding siblings ...) 2020-01-16 12:02 ` [PATCH v3 3/3] sched: Allow sched_{get,set}attr to change latency_nice of the task Parth Shah @ 2020-02-17 8:57 ` Parth Shah 2020-02-18 23:00 ` chris hyser 3 siblings, 1 reply; 23+ messages in thread From: Parth Shah @ 2020-02-17 8:57 UTC (permalink / raw) To: vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, David.Laight, pjt, tj On 1/16/20 5:32 PM, Parth Shah wrote: > This is the 3rd revision of the patch set to introduce > latency_{nice/tolerance} as a per task attribute. > > The previous version can be found at: > v1: https://lkml.org/lkml/2019/11/25/151 > v2: https://lkml.org/lkml/2019/12/8/10 > > Changes in this revision are: > v2 -> v3: > - This series changes the longer attribute name to "latency_nice" as per > the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394 > v1 -> v2: > - Addressed comments from Qais Yousef > - As per suggestion from Dietmar, moved content from newly created > include/linux/sched/latency_tolerance.h to kernel/sched/sched.h > - Extend sched_setattr() to support latency_tolerance in tools headers UAPI > > > Introduction: > ============== > This patch series introduces a new per-task attribute latency_nice to > provide the scheduler hints about the latency requirements of the task [1]. > > Latency_nice is a ranged attribute of a task with the value ranging > from [-20, 19] both inclusive which makes it align with the task nice > value. > > The value should provide scheduler hints about the relative latency > requirements of tasks, meaning the task with "latency_nice = -20" > should have lower latency requirements than compared to those tasks with > higher values. Similarly a task with "latency_nice = 19" can have higher > latency and hence such tasks may not care much about latency. > > The default value is set to 0. The usecases discussed below can use this > range of [-20, 19] for latency_nice for the specific purpose. This > patch does not implement any use cases for such attribute so that any > change in naming or range does not affect much to the other (future) > patches using this. The actual use of latency_nice during task wakeup > and load-balancing is yet to be coded for each of those usecases. > > As per my view, this defined attribute can be used in following ways for a > some of the usecases: > 1 Reduce search scan time for select_idle_cpu(): > - Reduce search scans for finding idle CPU for a waking task with lower > latency_nice values. > > 2 TurboSched: > - Classify the tasks with higher latency_nice values as a small > background task given that its historic utilization is very low, for > which the scheduler can search for more number of cores to do task > packing. A task with a latency_nice >= some_threshold (e.g, == 19) > and util <= 12.5% can be background tasks. > > 3 Optimize AVX512 based workload: > - Bias scheduler to not put a task having (latency_nice == -20) on a > core occupying AVX512 based workload. > > > Series Organization: > ==================== > - Patch 1: Add new attribute latency_nice to task_struct. > - Patch 2: Clone parent task's attribute to the child task on fork > - Patch 3: Add support for sched_{set,get}attr syscall to modify > latency_nice of the task > > > The patch series can be applied on tip/sched/core at the > commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware") > > > References: > ============ > [1]. Usecases for the per-task latency-nice attribute, > https://lkml.org/lkml/2019/9/30/215 > [2]. Task Latency-nice, "Subhra Mazumdar", > https://lkml.org/lkml/2019/8/30/829 > [3]. Introduce per-task latency_tolerance for scheduler hints, > https://lkml.org/lkml/2019/12/8/10 > > > Parth Shah (3): > sched: Introduce latency-nice as a per-task attribute > sched/core: Propagate parent task's latency requirements to the child > task > sched: Allow sched_{get,set}attr to change latency_nice of the task > > include/linux/sched.h | 1 + > include/uapi/linux/sched.h | 4 +++- > include/uapi/linux/sched/types.h | 19 +++++++++++++++++++ > kernel/sched/core.c | 21 +++++++++++++++++++++ > kernel/sched/sched.h | 18 ++++++++++++++++++ > tools/include/uapi/linux/sched.h | 4 +++- > 6 files changed, 65 insertions(+), 2 deletions(-) > Its been a long time and few revisions since the beginning of the discussion around the latency-nice. Hence thought of asking if there is/are any further work that needs to be done for adding latency-nice attribute or am I missing any piece in here? Thanks, Parth ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-17 8:57 ` [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints Parth Shah @ 2020-02-18 23:00 ` chris hyser 2020-02-19 10:09 ` Parth Shah 2020-02-19 11:18 ` David Laight 0 siblings, 2 replies; 23+ messages in thread From: chris hyser @ 2020-02-18 23:00 UTC (permalink / raw) To: Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, David.Laight, pjt, tj On 2/17/20 3:57 AM, Parth Shah wrote: > > > On 1/16/20 5:32 PM, Parth Shah wrote: >> This is the 3rd revision of the patch set to introduce >> latency_{nice/tolerance} as a per task attribute. >> >> The previous version can be found at: >> v1: https://lkml.org/lkml/2019/11/25/151 >> v2: https://lkml.org/lkml/2019/12/8/10 >> >> Changes in this revision are: >> v2 -> v3: >> - This series changes the longer attribute name to "latency_nice" as per >> the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394 >> v1 -> v2: >> - Addressed comments from Qais Yousef >> - As per suggestion from Dietmar, moved content from newly created >> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h >> - Extend sched_setattr() to support latency_tolerance in tools headers UAPI >> >> >> Introduction: >> ============== >> This patch series introduces a new per-task attribute latency_nice to >> provide the scheduler hints about the latency requirements of the task [1]. >> >> Latency_nice is a ranged attribute of a task with the value ranging >> from [-20, 19] both inclusive which makes it align with the task nice >> value. >> >> The value should provide scheduler hints about the relative latency >> requirements of tasks, meaning the task with "latency_nice = -20" >> should have lower latency requirements than compared to those tasks with >> higher values. Similarly a task with "latency_nice = 19" can have higher >> latency and hence such tasks may not care much about latency. >> >> The default value is set to 0. The usecases discussed below can use this >> range of [-20, 19] for latency_nice for the specific purpose. This >> patch does not implement any use cases for such attribute so that any >> change in naming or range does not affect much to the other (future) >> patches using this. The actual use of latency_nice during task wakeup >> and load-balancing is yet to be coded for each of those usecases. >> >> As per my view, this defined attribute can be used in following ways for a >> some of the usecases: >> 1 Reduce search scan time for select_idle_cpu(): >> - Reduce search scans for finding idle CPU for a waking task with lower >> latency_nice values. >> >> 2 TurboSched: >> - Classify the tasks with higher latency_nice values as a small >> background task given that its historic utilization is very low, for >> which the scheduler can search for more number of cores to do task >> packing. A task with a latency_nice >= some_threshold (e.g, == 19) >> and util <= 12.5% can be background tasks. >> >> 3 Optimize AVX512 based workload: >> - Bias scheduler to not put a task having (latency_nice == -20) on a >> core occupying AVX512 based workload. >> >> >> Series Organization: >> ==================== >> - Patch 1: Add new attribute latency_nice to task_struct. >> - Patch 2: Clone parent task's attribute to the child task on fork >> - Patch 3: Add support for sched_{set,get}attr syscall to modify >> latency_nice of the task >> >> >> The patch series can be applied on tip/sched/core at the >> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware") >> >> >> References: >> ============ >> [1]. Usecases for the per-task latency-nice attribute, >> https://lkml.org/lkml/2019/9/30/215 >> [2]. Task Latency-nice, "Subhra Mazumdar", >> https://lkml.org/lkml/2019/8/30/829 >> [3]. Introduce per-task latency_tolerance for scheduler hints, >> https://lkml.org/lkml/2019/12/8/10 >> >> >> Parth Shah (3): >> sched: Introduce latency-nice as a per-task attribute >> sched/core: Propagate parent task's latency requirements to the child >> task >> sched: Allow sched_{get,set}attr to change latency_nice of the task >> >> include/linux/sched.h | 1 + >> include/uapi/linux/sched.h | 4 +++- >> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++ >> kernel/sched/core.c | 21 +++++++++++++++++++++ >> kernel/sched/sched.h | 18 ++++++++++++++++++ >> tools/include/uapi/linux/sched.h | 4 +++- >> 6 files changed, 65 insertions(+), 2 deletions(-) >> > > Its been a long time and few revisions since the beginning of the > discussion around the latency-nice. Hence thought of asking if there is/are > any further work that needs to be done for adding latency-nice attribute or > am I missing any piece in here? All, I was asked to take a look at the original latency_nice patchset. First, to clarify objectives, Oracle is not interested in trading throughput for latency. What we found is that the DB has specific tasks which do very little but need to do this as absolutely quickly as possible, ie extreme latency sensitivity. Second, the key to latency reduction in the task wakeup path seems to be limiting variations of "idle cpu" search. The latter particularly interests me as an example of "platform size based latency" which I believe to be important given all the varying size VMs and containers. Parth, I've been using your v3 patchset as the basis of an investigation into the measurable effects of short-circuiting this search. I'm not quite ready to put anything out, but the patchset is working well. The only feedback I have is that currently non-root can set the value negative which is inconsistent with 'nice' and I would think a security hole. -chrish ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-18 23:00 ` chris hyser @ 2020-02-19 10:09 ` Parth Shah 2020-02-19 14:15 ` chris hyser 2020-02-21 17:52 ` chris hyser 2020-02-19 11:18 ` David Laight 1 sibling, 2 replies; 23+ messages in thread From: Parth Shah @ 2020-02-19 10:09 UTC (permalink / raw) To: chris hyser, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, David.Laight, pjt, tj Hi Chris, On 2/19/20 4:30 AM, chris hyser wrote: > On 2/17/20 3:57 AM, Parth Shah wrote: >> >> >> On 1/16/20 5:32 PM, Parth Shah wrote: >>> This is the 3rd revision of the patch set to introduce >>> latency_{nice/tolerance} as a per task attribute. >>> >>> The previous version can be found at: >>> v1: https://lkml.org/lkml/2019/11/25/151 >>> v2: https://lkml.org/lkml/2019/12/8/10 >>> >>> Changes in this revision are: >>> v2 -> v3: >>> - This series changes the longer attribute name to "latency_nice" as per >>> the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394 >>> v1 -> v2: >>> - Addressed comments from Qais Yousef >>> - As per suggestion from Dietmar, moved content from newly created >>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h >>> - Extend sched_setattr() to support latency_tolerance in tools headers UAPI >>> >>> >>> Introduction: >>> ============== >>> This patch series introduces a new per-task attribute latency_nice to >>> provide the scheduler hints about the latency requirements of the task [1]. >>> >>> Latency_nice is a ranged attribute of a task with the value ranging >>> from [-20, 19] both inclusive which makes it align with the task nice >>> value. >>> >>> The value should provide scheduler hints about the relative latency >>> requirements of tasks, meaning the task with "latency_nice = -20" >>> should have lower latency requirements than compared to those tasks with >>> higher values. Similarly a task with "latency_nice = 19" can have higher >>> latency and hence such tasks may not care much about latency. >>> >>> The default value is set to 0. The usecases discussed below can use this >>> range of [-20, 19] for latency_nice for the specific purpose. This >>> patch does not implement any use cases for such attribute so that any >>> change in naming or range does not affect much to the other (future) >>> patches using this. The actual use of latency_nice during task wakeup >>> and load-balancing is yet to be coded for each of those usecases. >>> >>> As per my view, this defined attribute can be used in following ways for a >>> some of the usecases: >>> 1 Reduce search scan time for select_idle_cpu(): >>> - Reduce search scans for finding idle CPU for a waking task with lower >>> latency_nice values. >>> >>> 2 TurboSched: >>> - Classify the tasks with higher latency_nice values as a small >>> background task given that its historic utilization is very low, for >>> which the scheduler can search for more number of cores to do task >>> packing. A task with a latency_nice >= some_threshold (e.g, == 19) >>> and util <= 12.5% can be background tasks. >>> >>> 3 Optimize AVX512 based workload: >>> - Bias scheduler to not put a task having (latency_nice == -20) on a >>> core occupying AVX512 based workload. >>> >>> >>> Series Organization: >>> ==================== >>> - Patch 1: Add new attribute latency_nice to task_struct. >>> - Patch 2: Clone parent task's attribute to the child task on fork >>> - Patch 3: Add support for sched_{set,get}attr syscall to modify >>> latency_nice of the task >>> >>> >>> The patch series can be applied on tip/sched/core at the >>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware") >>> >>> >>> References: >>> ============ >>> [1]. Usecases for the per-task latency-nice attribute, >>> https://lkml.org/lkml/2019/9/30/215 >>> [2]. Task Latency-nice, "Subhra Mazumdar", >>> https://lkml.org/lkml/2019/8/30/829 >>> [3]. Introduce per-task latency_tolerance for scheduler hints, >>> https://lkml.org/lkml/2019/12/8/10 >>> >>> >>> Parth Shah (3): >>> sched: Introduce latency-nice as a per-task attribute >>> sched/core: Propagate parent task's latency requirements to the child >>> task >>> sched: Allow sched_{get,set}attr to change latency_nice of the task >>> >>> include/linux/sched.h | 1 + >>> include/uapi/linux/sched.h | 4 +++- >>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++ >>> kernel/sched/core.c | 21 +++++++++++++++++++++ >>> kernel/sched/sched.h | 18 ++++++++++++++++++ >>> tools/include/uapi/linux/sched.h | 4 +++- >>> 6 files changed, 65 insertions(+), 2 deletions(-) >>> >> >> Its been a long time and few revisions since the beginning of the >> discussion around the latency-nice. Hence thought of asking if there is/are >> any further work that needs to be done for adding latency-nice attribute or >> am I missing any piece in here? > > All, I was asked to take a look at the original latency_nice patchset. > First, to clarify objectives, Oracle is not interested in trading > throughput for latency. What we found is that the DB has specific tasks > which do very little but need to do this as absolutely quickly as possible, > ie extreme latency sensitivity. Second, the key to latency reduction in the > task wakeup path seems to be limiting variations of "idle cpu" search. The > latter particularly interests me as an example of "platform size based > latency" which I believe to be important given all the varying size VMs and > containers. > > Parth, I've been using your v3 patchset as the basis of an investigation > into the measurable effects of short-circuiting this search. I'm not quite > ready to put anything out, but the patchset is working well. The only That's a good news as you are able to get a usecase of this patch-set. > feedback I have is that currently non-root can set the value negative which > is inconsistent with 'nice' and I would think a security hole. > I would assume you mean 'latency_nice' here. From my testing, I was not able to set values for any root owned task's latency_nice value by the non-root user. Also, my patch-set just piggybacks on the already existing sched_setattr syscall and hence it should not allow non-root user to do any modifications. Can you confirm this by changing nice (renice) value of a root task from non-root user. I have done the sanity check in the code and thinking where it could possibly have gone wrong. So, can you please specify what values were you able to set outside the [-20, 19] range? Thanks, Parth ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-19 10:09 ` Parth Shah @ 2020-02-19 14:15 ` chris hyser 2020-02-19 18:23 ` chris hyser 2020-02-21 17:52 ` chris hyser 1 sibling, 1 reply; 23+ messages in thread From: chris hyser @ 2020-02-19 14:15 UTC (permalink / raw) To: Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, David.Laight, pjt, tj On 2/19/20 5:09 AM, Parth Shah wrote: > Hi Chris, > > On 2/19/20 4:30 AM, chris hyser wrote: >> On 2/17/20 3:57 AM, Parth Shah wrote: >>> >>> >>> On 1/16/20 5:32 PM, Parth Shah wrote: >>>> This is the 3rd revision of the patch set to introduce >>>> latency_{nice/tolerance} as a per task attribute. >>>> >>>> The previous version can be found at: >>>> v1: https://lkml.org/lkml/2019/11/25/151 >>>> v2: https://lkml.org/lkml/2019/12/8/10 >>>> >>>> Changes in this revision are: >>>> v2 -> v3: >>>> - This series changes the longer attribute name to "latency_nice" as per >>>> the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394 >>>> v1 -> v2: >>>> - Addressed comments from Qais Yousef >>>> - As per suggestion from Dietmar, moved content from newly created >>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h >>>> - Extend sched_setattr() to support latency_tolerance in tools headers UAPI >>>> >>>> >>>> Introduction: >>>> ============== >>>> This patch series introduces a new per-task attribute latency_nice to >>>> provide the scheduler hints about the latency requirements of the task [1]. >>>> >>>> Latency_nice is a ranged attribute of a task with the value ranging >>>> from [-20, 19] both inclusive which makes it align with the task nice >>>> value. >>>> >>>> The value should provide scheduler hints about the relative latency >>>> requirements of tasks, meaning the task with "latency_nice = -20" >>>> should have lower latency requirements than compared to those tasks with >>>> higher values. Similarly a task with "latency_nice = 19" can have higher >>>> latency and hence such tasks may not care much about latency. >>>> >>>> The default value is set to 0. The usecases discussed below can use this >>>> range of [-20, 19] for latency_nice for the specific purpose. This >>>> patch does not implement any use cases for such attribute so that any >>>> change in naming or range does not affect much to the other (future) >>>> patches using this. The actual use of latency_nice during task wakeup >>>> and load-balancing is yet to be coded for each of those usecases. >>>> >>>> As per my view, this defined attribute can be used in following ways for a >>>> some of the usecases: >>>> 1 Reduce search scan time for select_idle_cpu(): >>>> - Reduce search scans for finding idle CPU for a waking task with lower >>>> latency_nice values. >>>> >>>> 2 TurboSched: >>>> - Classify the tasks with higher latency_nice values as a small >>>> background task given that its historic utilization is very low, for >>>> which the scheduler can search for more number of cores to do task >>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19) >>>> and util <= 12.5% can be background tasks. >>>> >>>> 3 Optimize AVX512 based workload: >>>> - Bias scheduler to not put a task having (latency_nice == -20) on a >>>> core occupying AVX512 based workload. >>>> >>>> >>>> Series Organization: >>>> ==================== >>>> - Patch 1: Add new attribute latency_nice to task_struct. >>>> - Patch 2: Clone parent task's attribute to the child task on fork >>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify >>>> latency_nice of the task >>>> >>>> >>>> The patch series can be applied on tip/sched/core at the >>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware") >>>> >>>> >>>> References: >>>> ============ >>>> [1]. Usecases for the per-task latency-nice attribute, >>>> https://lkml.org/lkml/2019/9/30/215 >>>> [2]. Task Latency-nice, "Subhra Mazumdar", >>>> https://lkml.org/lkml/2019/8/30/829 >>>> [3]. Introduce per-task latency_tolerance for scheduler hints, >>>> https://lkml.org/lkml/2019/12/8/10 >>>> >>>> >>>> Parth Shah (3): >>>> sched: Introduce latency-nice as a per-task attribute >>>> sched/core: Propagate parent task's latency requirements to the child >>>> task >>>> sched: Allow sched_{get,set}attr to change latency_nice of the task >>>> >>>> include/linux/sched.h | 1 + >>>> include/uapi/linux/sched.h | 4 +++- >>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++ >>>> kernel/sched/core.c | 21 +++++++++++++++++++++ >>>> kernel/sched/sched.h | 18 ++++++++++++++++++ >>>> tools/include/uapi/linux/sched.h | 4 +++- >>>> 6 files changed, 65 insertions(+), 2 deletions(-) >>>> >>> >>> Its been a long time and few revisions since the beginning of the >>> discussion around the latency-nice. Hence thought of asking if there is/are >>> any further work that needs to be done for adding latency-nice attribute or >>> am I missing any piece in here? >> >> All, I was asked to take a look at the original latency_nice patchset. >> First, to clarify objectives, Oracle is not interested in trading >> throughput for latency. What we found is that the DB has specific tasks >> which do very little but need to do this as absolutely quickly as possible, >> ie extreme latency sensitivity. Second, the key to latency reduction in the >> task wakeup path seems to be limiting variations of "idle cpu" search. The >> latter particularly interests me as an example of "platform size based >> latency" which I believe to be important given all the varying size VMs and >> containers. >> >> Parth, I've been using your v3 patchset as the basis of an investigation >> into the measurable effects of short-circuiting this search. I'm not quite >> ready to put anything out, but the patchset is working well. The only > > That's a good news as you are able to get a usecase of this patch-set. > >> feedback I have is that currently non-root can set the value negative which >> is inconsistent with 'nice' and I would think a security hole. >> > > I would assume you mean 'latency_nice' here. > > From my testing, I was not able to set values for any root owned task's > latency_nice value by the non-root user. Also, my patch-set just piggybacks > on the already existing sched_setattr syscall and hence it should not allow > non-root user to do any modifications. Can you confirm this by changing > nice (renice) value of a root task from non-root user. > > I have done the sanity check in the code and thinking where it could > possibly have gone wrong. So, can you please specify what values were you > able to set outside the [-20, 19] range? The checks prevent being outside that range. But negative numbers -20 to -1 did not need root. Let me dig some more. I verified this explicitly before sending the email so something is up. -chrish ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-19 14:15 ` chris hyser @ 2020-02-19 18:23 ` chris hyser 2020-02-20 8:34 ` Parth Shah 0 siblings, 1 reply; 23+ messages in thread From: chris hyser @ 2020-02-19 18:23 UTC (permalink / raw) To: Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, David.Laight, pjt, tj On 2/19/20 9:15 AM, chris hyser wrote: > > > On 2/19/20 5:09 AM, Parth Shah wrote: >> Hi Chris, >> >> On 2/19/20 4:30 AM, chris hyser wrote: >>> On 2/17/20 3:57 AM, Parth Shah wrote: >>>> >>>> >>>> On 1/16/20 5:32 PM, Parth Shah wrote: >>>>> This is the 3rd revision of the patch set to introduce >>>>> latency_{nice/tolerance} as a per task attribute. >>>>> >>>>> The previous version can be found at: >>>>> v1: https://lkml.org/lkml/2019/11/25/151 >>>>> v2: https://lkml.org/lkml/2019/12/8/10 >>>>> >>>>> Changes in this revision are: >>>>> v2 -> v3: >>>>> - This series changes the longer attribute name to "latency_nice" as per >>>>> the comment from Dietmar Eggemann https://lkml.org/lkml/2019/12/5/394 >>>>> v1 -> v2: >>>>> - Addressed comments from Qais Yousef >>>>> - As per suggestion from Dietmar, moved content from newly created >>>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h >>>>> - Extend sched_setattr() to support latency_tolerance in tools headers UAPI >>>>> >>>>> >>>>> Introduction: >>>>> ============== >>>>> This patch series introduces a new per-task attribute latency_nice to >>>>> provide the scheduler hints about the latency requirements of the task [1]. >>>>> >>>>> Latency_nice is a ranged attribute of a task with the value ranging >>>>> from [-20, 19] both inclusive which makes it align with the task nice >>>>> value. >>>>> >>>>> The value should provide scheduler hints about the relative latency >>>>> requirements of tasks, meaning the task with "latency_nice = -20" >>>>> should have lower latency requirements than compared to those tasks with >>>>> higher values. Similarly a task with "latency_nice = 19" can have higher >>>>> latency and hence such tasks may not care much about latency. >>>>> >>>>> The default value is set to 0. The usecases discussed below can use this >>>>> range of [-20, 19] for latency_nice for the specific purpose. This >>>>> patch does not implement any use cases for such attribute so that any >>>>> change in naming or range does not affect much to the other (future) >>>>> patches using this. The actual use of latency_nice during task wakeup >>>>> and load-balancing is yet to be coded for each of those usecases. >>>>> >>>>> As per my view, this defined attribute can be used in following ways for a >>>>> some of the usecases: >>>>> 1 Reduce search scan time for select_idle_cpu(): >>>>> - Reduce search scans for finding idle CPU for a waking task with lower >>>>> latency_nice values. >>>>> >>>>> 2 TurboSched: >>>>> - Classify the tasks with higher latency_nice values as a small >>>>> background task given that its historic utilization is very low, for >>>>> which the scheduler can search for more number of cores to do task >>>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19) >>>>> and util <= 12.5% can be background tasks. >>>>> >>>>> 3 Optimize AVX512 based workload: >>>>> - Bias scheduler to not put a task having (latency_nice == -20) on a >>>>> core occupying AVX512 based workload. >>>>> >>>>> >>>>> Series Organization: >>>>> ==================== >>>>> - Patch 1: Add new attribute latency_nice to task_struct. >>>>> - Patch 2: Clone parent task's attribute to the child task on fork >>>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify >>>>> latency_nice of the task >>>>> >>>>> >>>>> The patch series can be applied on tip/sched/core at the >>>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware") >>>>> >>>>> >>>>> References: >>>>> ============ >>>>> [1]. Usecases for the per-task latency-nice attribute, >>>>> https://lkml.org/lkml/2019/9/30/215 >>>>> [2]. Task Latency-nice, "Subhra Mazumdar", >>>>> https://lkml.org/lkml/2019/8/30/829 >>>>> [3]. Introduce per-task latency_tolerance for scheduler hints, >>>>> https://lkml.org/lkml/2019/12/8/10 >>>>> >>>>> >>>>> Parth Shah (3): >>>>> sched: Introduce latency-nice as a per-task attribute >>>>> sched/core: Propagate parent task's latency requirements to the child >>>>> task >>>>> sched: Allow sched_{get,set}attr to change latency_nice of the task >>>>> >>>>> include/linux/sched.h | 1 + >>>>> include/uapi/linux/sched.h | 4 +++- >>>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++ >>>>> kernel/sched/core.c | 21 +++++++++++++++++++++ >>>>> kernel/sched/sched.h | 18 ++++++++++++++++++ >>>>> tools/include/uapi/linux/sched.h | 4 +++- >>>>> 6 files changed, 65 insertions(+), 2 deletions(-) >>>>> >>>> >>>> Its been a long time and few revisions since the beginning of the >>>> discussion around the latency-nice. Hence thought of asking if there is/are >>>> any further work that needs to be done for adding latency-nice attribute or >>>> am I missing any piece in here? >>> >>> All, I was asked to take a look at the original latency_nice patchset. >>> First, to clarify objectives, Oracle is not interested in trading >>> throughput for latency. What we found is that the DB has specific tasks >>> which do very little but need to do this as absolutely quickly as possible, >>> ie extreme latency sensitivity. Second, the key to latency reduction in the >>> task wakeup path seems to be limiting variations of "idle cpu" search. The >>> latter particularly interests me as an example of "platform size based >>> latency" which I believe to be important given all the varying size VMs and >>> containers. >>> >>> Parth, I've been using your v3 patchset as the basis of an investigation >>> into the measurable effects of short-circuiting this search. I'm not quite >>> ready to put anything out, but the patchset is working well. The only >> >> That's a good news as you are able to get a usecase of this patch-set. >> >>> feedback I have is that currently non-root can set the value negative which >>> is inconsistent with 'nice' and I would think a security hole. >>> >> >> I would assume you mean 'latency_nice' here. >> >> From my testing, I was not able to set values for any root owned task's >> latency_nice value by the non-root user. Also, my patch-set just piggybacks >> on the already existing sched_setattr syscall and hence it should not allow >> non-root user to do any modifications. Can you confirm this by changing >> nice (renice) value of a root task from non-root user. >> >> I have done the sanity check in the code and thinking where it could >> possibly have gone wrong. So, can you please specify what values were you >> able to set outside the [-20, 19] range? > > The checks prevent being outside that range. But negative numbers -20 to -1 did not need root. Let me dig some more. I > verified this explicitly before sending the email so something is up. I went digging. This is absolutely repeatable. I checked that I do not unknowingly have CAP_SYS_NICE as a user. So first, are we tying latency_nice to CAP_SYS_NICE? Seems like a reasonable thing, but not sure I saw this stated anywhere. Second, the only capability checked in __sched_setscheduler() in the patch I have is CAP_SYS_NICE and those checks will not return a -EPERM for a negative latency_tolerance (in the code, aka latency_nice). Do I have the correct version of the code? Am I missing something? -chrish ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-19 18:23 ` chris hyser @ 2020-02-20 8:34 ` Parth Shah 2020-02-20 8:50 ` Parth Shah 0 siblings, 1 reply; 23+ messages in thread From: Parth Shah @ 2020-02-20 8:34 UTC (permalink / raw) To: chris hyser, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, David.Laight, pjt, tj On 2/19/20 11:53 PM, chris hyser wrote: > > > On 2/19/20 9:15 AM, chris hyser wrote: >> >> >> On 2/19/20 5:09 AM, Parth Shah wrote: >>> Hi Chris, >>> >>> On 2/19/20 4:30 AM, chris hyser wrote: >>>> On 2/17/20 3:57 AM, Parth Shah wrote: >>>>> >>>>> >>>>> On 1/16/20 5:32 PM, Parth Shah wrote: >>>>>> This is the 3rd revision of the patch set to introduce >>>>>> latency_{nice/tolerance} as a per task attribute. >>>>>> >>>>>> The previous version can be found at: >>>>>> v1: https://lkml.org/lkml/2019/11/25/151 >>>>>> v2: https://lkml.org/lkml/2019/12/8/10 >>>>>> >>>>>> Changes in this revision are: >>>>>> v2 -> v3: >>>>>> - This series changes the longer attribute name to "latency_nice" as per >>>>>> the comment from Dietmar Eggemann >>>>>> https://lkml.org/lkml/2019/12/5/394 >>>>>> v1 -> v2: >>>>>> - Addressed comments from Qais Yousef >>>>>> - As per suggestion from Dietmar, moved content from newly created >>>>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h >>>>>> - Extend sched_setattr() to support latency_tolerance in tools >>>>>> headers UAPI >>>>>> >>>>>> >>>>>> Introduction: >>>>>> ============== >>>>>> This patch series introduces a new per-task attribute latency_nice to >>>>>> provide the scheduler hints about the latency requirements of the >>>>>> task [1]. >>>>>> >>>>>> Latency_nice is a ranged attribute of a task with the value ranging >>>>>> from [-20, 19] both inclusive which makes it align with the task nice >>>>>> value. >>>>>> >>>>>> The value should provide scheduler hints about the relative latency >>>>>> requirements of tasks, meaning the task with "latency_nice = -20" >>>>>> should have lower latency requirements than compared to those tasks with >>>>>> higher values. Similarly a task with "latency_nice = 19" can have higher >>>>>> latency and hence such tasks may not care much about latency. >>>>>> >>>>>> The default value is set to 0. The usecases discussed below can use this >>>>>> range of [-20, 19] for latency_nice for the specific purpose. This >>>>>> patch does not implement any use cases for such attribute so that any >>>>>> change in naming or range does not affect much to the other (future) >>>>>> patches using this. The actual use of latency_nice during task wakeup >>>>>> and load-balancing is yet to be coded for each of those usecases. >>>>>> >>>>>> As per my view, this defined attribute can be used in following ways >>>>>> for a >>>>>> some of the usecases: >>>>>> 1 Reduce search scan time for select_idle_cpu(): >>>>>> - Reduce search scans for finding idle CPU for a waking task with lower >>>>>> latency_nice values. >>>>>> >>>>>> 2 TurboSched: >>>>>> - Classify the tasks with higher latency_nice values as a small >>>>>> background task given that its historic utilization is very low, for >>>>>> which the scheduler can search for more number of cores to do task >>>>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19) >>>>>> and util <= 12.5% can be background tasks. >>>>>> >>>>>> 3 Optimize AVX512 based workload: >>>>>> - Bias scheduler to not put a task having (latency_nice == -20) on a >>>>>> core occupying AVX512 based workload. >>>>>> >>>>>> >>>>>> Series Organization: >>>>>> ==================== >>>>>> - Patch 1: Add new attribute latency_nice to task_struct. >>>>>> - Patch 2: Clone parent task's attribute to the child task on fork >>>>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify >>>>>> latency_nice of the task >>>>>> >>>>>> >>>>>> The patch series can be applied on tip/sched/core at the >>>>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware") >>>>>> >>>>>> >>>>>> References: >>>>>> ============ >>>>>> [1]. Usecases for the per-task latency-nice attribute, >>>>>> https://lkml.org/lkml/2019/9/30/215 >>>>>> [2]. Task Latency-nice, "Subhra Mazumdar", >>>>>> https://lkml.org/lkml/2019/8/30/829 >>>>>> [3]. Introduce per-task latency_tolerance for scheduler hints, >>>>>> https://lkml.org/lkml/2019/12/8/10 >>>>>> >>>>>> >>>>>> Parth Shah (3): >>>>>> sched: Introduce latency-nice as a per-task attribute >>>>>> sched/core: Propagate parent task's latency requirements to the >>>>>> child >>>>>> task >>>>>> sched: Allow sched_{get,set}attr to change latency_nice of the task >>>>>> >>>>>> include/linux/sched.h | 1 + >>>>>> include/uapi/linux/sched.h | 4 +++- >>>>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++ >>>>>> kernel/sched/core.c | 21 +++++++++++++++++++++ >>>>>> kernel/sched/sched.h | 18 ++++++++++++++++++ >>>>>> tools/include/uapi/linux/sched.h | 4 +++- >>>>>> 6 files changed, 65 insertions(+), 2 deletions(-) >>>>>> >>>>> >>>>> Its been a long time and few revisions since the beginning of the >>>>> discussion around the latency-nice. Hence thought of asking if there >>>>> is/are >>>>> any further work that needs to be done for adding latency-nice >>>>> attribute or >>>>> am I missing any piece in here? >>>> >>>> All, I was asked to take a look at the original latency_nice patchset. >>>> First, to clarify objectives, Oracle is not interested in trading >>>> throughput for latency. What we found is that the DB has specific tasks >>>> which do very little but need to do this as absolutely quickly as >>>> possible, >>>> ie extreme latency sensitivity. Second, the key to latency reduction in >>>> the >>>> task wakeup path seems to be limiting variations of "idle cpu" search. The >>>> latter particularly interests me as an example of "platform size based >>>> latency" which I believe to be important given all the varying size VMs >>>> and >>>> containers. >>>> >>>> Parth, I've been using your v3 patchset as the basis of an investigation >>>> into the measurable effects of short-circuiting this search. I'm not quite >>>> ready to put anything out, but the patchset is working well. The only >>> >>> That's a good news as you are able to get a usecase of this patch-set. >>> >>>> feedback I have is that currently non-root can set the value negative >>>> which >>>> is inconsistent with 'nice' and I would think a security hole. >>>> >>> >>> I would assume you mean 'latency_nice' here. >>> >>> From my testing, I was not able to set values for any root owned task's >>> latency_nice value by the non-root user. Also, my patch-set just piggybacks >>> on the already existing sched_setattr syscall and hence it should not allow >>> non-root user to do any modifications. Can you confirm this by changing >>> nice (renice) value of a root task from non-root user. >>> >>> I have done the sanity check in the code and thinking where it could >>> possibly have gone wrong. So, can you please specify what values were you >>> able to set outside the [-20, 19] range? >> >> The checks prevent being outside that range. But negative numbers -20 to >> -1 did not need root. Let me dig some more. I verified this explicitly >> before sending the email so something is up. > > I went digging. This is absolutely repeatable. I checked that I do not > unknowingly have CAP_SYS_NICE as a user. So first, are we tying > latency_nice to CAP_SYS_NICE? Seems like a reasonable thing, but not sure I > saw this stated anywhere. Second, the only capability checked in > __sched_setscheduler() in the patch I have is CAP_SYS_NICE and those checks > will not return a -EPERM for a negative latency_tolerance (in the code, aka > latency_nice). Do I have the correct version of the code? Am I missing > something? You are right. I have not added permission checks for setting the latency_nice value. For the task_nice, non-root user has no permission to set the value lower than the current value which is not the case with the latency_nice. In order to align with the permission checks like task_nice, I will add the check similar to task_nice and send out the v4 of the series soon. Thanks for pointing out. - Parth > > -chrish > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-20 8:34 ` Parth Shah @ 2020-02-20 8:50 ` Parth Shah 2020-02-20 14:30 ` chris hyser 0 siblings, 1 reply; 23+ messages in thread From: Parth Shah @ 2020-02-20 8:50 UTC (permalink / raw) To: chris hyser, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, David.Laight, pjt, tj On 2/20/20 2:04 PM, Parth Shah wrote: > > > On 2/19/20 11:53 PM, chris hyser wrote: >> >> >> On 2/19/20 9:15 AM, chris hyser wrote: >>> >>> >>> On 2/19/20 5:09 AM, Parth Shah wrote: >>>> Hi Chris, >>>> >>>> On 2/19/20 4:30 AM, chris hyser wrote: >>>>> On 2/17/20 3:57 AM, Parth Shah wrote: >>>>>> >>>>>> >>>>>> On 1/16/20 5:32 PM, Parth Shah wrote: >>>>>>> This is the 3rd revision of the patch set to introduce >>>>>>> latency_{nice/tolerance} as a per task attribute. >>>>>>> >>>>>>> The previous version can be found at: >>>>>>> v1: https://lkml.org/lkml/2019/11/25/151 >>>>>>> v2: https://lkml.org/lkml/2019/12/8/10 >>>>>>> >>>>>>> Changes in this revision are: >>>>>>> v2 -> v3: >>>>>>> - This series changes the longer attribute name to "latency_nice" as per >>>>>>> the comment from Dietmar Eggemann >>>>>>> https://lkml.org/lkml/2019/12/5/394 >>>>>>> v1 -> v2: >>>>>>> - Addressed comments from Qais Yousef >>>>>>> - As per suggestion from Dietmar, moved content from newly created >>>>>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h >>>>>>> - Extend sched_setattr() to support latency_tolerance in tools >>>>>>> headers UAPI >>>>>>> >>>>>>> >>>>>>> Introduction: >>>>>>> ============== >>>>>>> This patch series introduces a new per-task attribute latency_nice to >>>>>>> provide the scheduler hints about the latency requirements of the >>>>>>> task [1]. >>>>>>> >>>>>>> Latency_nice is a ranged attribute of a task with the value ranging >>>>>>> from [-20, 19] both inclusive which makes it align with the task nice >>>>>>> value. >>>>>>> >>>>>>> The value should provide scheduler hints about the relative latency >>>>>>> requirements of tasks, meaning the task with "latency_nice = -20" >>>>>>> should have lower latency requirements than compared to those tasks with >>>>>>> higher values. Similarly a task with "latency_nice = 19" can have higher >>>>>>> latency and hence such tasks may not care much about latency. >>>>>>> >>>>>>> The default value is set to 0. The usecases discussed below can use this >>>>>>> range of [-20, 19] for latency_nice for the specific purpose. This >>>>>>> patch does not implement any use cases for such attribute so that any >>>>>>> change in naming or range does not affect much to the other (future) >>>>>>> patches using this. The actual use of latency_nice during task wakeup >>>>>>> and load-balancing is yet to be coded for each of those usecases. >>>>>>> >>>>>>> As per my view, this defined attribute can be used in following ways >>>>>>> for a >>>>>>> some of the usecases: >>>>>>> 1 Reduce search scan time for select_idle_cpu(): >>>>>>> - Reduce search scans for finding idle CPU for a waking task with lower >>>>>>> latency_nice values. >>>>>>> >>>>>>> 2 TurboSched: >>>>>>> - Classify the tasks with higher latency_nice values as a small >>>>>>> background task given that its historic utilization is very low, for >>>>>>> which the scheduler can search for more number of cores to do task >>>>>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19) >>>>>>> and util <= 12.5% can be background tasks. >>>>>>> >>>>>>> 3 Optimize AVX512 based workload: >>>>>>> - Bias scheduler to not put a task having (latency_nice == -20) on a >>>>>>> core occupying AVX512 based workload. >>>>>>> >>>>>>> >>>>>>> Series Organization: >>>>>>> ==================== >>>>>>> - Patch 1: Add new attribute latency_nice to task_struct. >>>>>>> - Patch 2: Clone parent task's attribute to the child task on fork >>>>>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify >>>>>>> latency_nice of the task >>>>>>> >>>>>>> >>>>>>> The patch series can be applied on tip/sched/core at the >>>>>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware") >>>>>>> >>>>>>> >>>>>>> References: >>>>>>> ============ >>>>>>> [1]. Usecases for the per-task latency-nice attribute, >>>>>>> https://lkml.org/lkml/2019/9/30/215 >>>>>>> [2]. Task Latency-nice, "Subhra Mazumdar", >>>>>>> https://lkml.org/lkml/2019/8/30/829 >>>>>>> [3]. Introduce per-task latency_tolerance for scheduler hints, >>>>>>> https://lkml.org/lkml/2019/12/8/10 >>>>>>> >>>>>>> >>>>>>> Parth Shah (3): >>>>>>> sched: Introduce latency-nice as a per-task attribute >>>>>>> sched/core: Propagate parent task's latency requirements to the >>>>>>> child >>>>>>> task >>>>>>> sched: Allow sched_{get,set}attr to change latency_nice of the task >>>>>>> >>>>>>> include/linux/sched.h | 1 + >>>>>>> include/uapi/linux/sched.h | 4 +++- >>>>>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++ >>>>>>> kernel/sched/core.c | 21 +++++++++++++++++++++ >>>>>>> kernel/sched/sched.h | 18 ++++++++++++++++++ >>>>>>> tools/include/uapi/linux/sched.h | 4 +++- >>>>>>> 6 files changed, 65 insertions(+), 2 deletions(-) >>>>>>> >>>>>> >>>>>> Its been a long time and few revisions since the beginning of the >>>>>> discussion around the latency-nice. Hence thought of asking if there >>>>>> is/are >>>>>> any further work that needs to be done for adding latency-nice >>>>>> attribute or >>>>>> am I missing any piece in here? >>>>> >>>>> All, I was asked to take a look at the original latency_nice patchset. >>>>> First, to clarify objectives, Oracle is not interested in trading >>>>> throughput for latency. What we found is that the DB has specific tasks >>>>> which do very little but need to do this as absolutely quickly as >>>>> possible, >>>>> ie extreme latency sensitivity. Second, the key to latency reduction in >>>>> the >>>>> task wakeup path seems to be limiting variations of "idle cpu" search. The >>>>> latter particularly interests me as an example of "platform size based >>>>> latency" which I believe to be important given all the varying size VMs >>>>> and >>>>> containers. >>>>> >>>>> Parth, I've been using your v3 patchset as the basis of an investigation >>>>> into the measurable effects of short-circuiting this search. I'm not quite >>>>> ready to put anything out, but the patchset is working well. The only >>>> >>>> That's a good news as you are able to get a usecase of this patch-set. >>>> >>>>> feedback I have is that currently non-root can set the value negative >>>>> which >>>>> is inconsistent with 'nice' and I would think a security hole. >>>>> >>>> >>>> I would assume you mean 'latency_nice' here. >>>> >>>> From my testing, I was not able to set values for any root owned task's >>>> latency_nice value by the non-root user. Also, my patch-set just piggybacks >>>> on the already existing sched_setattr syscall and hence it should not allow >>>> non-root user to do any modifications. Can you confirm this by changing >>>> nice (renice) value of a root task from non-root user. >>>> >>>> I have done the sanity check in the code and thinking where it could >>>> possibly have gone wrong. So, can you please specify what values were you >>>> able to set outside the [-20, 19] range? >>> >>> The checks prevent being outside that range. But negative numbers -20 to >>> -1 did not need root. Let me dig some more. I verified this explicitly >>> before sending the email so something is up. >> >> I went digging. This is absolutely repeatable. I checked that I do not >> unknowingly have CAP_SYS_NICE as a user. So first, are we tying >> latency_nice to CAP_SYS_NICE? Seems like a reasonable thing, but not sure I >> saw this stated anywhere. Second, the only capability checked in >> __sched_setscheduler() in the patch I have is CAP_SYS_NICE and those checks >> will not return a -EPERM for a negative latency_tolerance (in the code, aka >> latency_nice). Do I have the correct version of the code? Am I missing >> something? > > You are right. I have not added permission checks for setting the > latency_nice value. For the task_nice, non-root user has no permission to > set the value lower than the current value which is not the case with the > latency_nice. > > In order to align with the permission checks like task_nice, I will add the > check similar to task_nice and send out the v4 of the series soon. > > > Thanks for pointing out. > - Parth > The below diff works out well enough in-order to align permission checks with NICE. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 2bfcff5623f9..ef4a397c9170 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4878,6 +4878,10 @@ static int __sched_setscheduler(struct task_struct *p, return -EINVAL; if (attr->sched_latency_nice < MIN_LATENCY_NICE) return -EINVAL; + /* Use the same security checks as NICE */ + if (attr->sched_latency_nice < p->latency_nice && + !can_nice(p, attr->sched_latency_nice)) + return -EPERM; } if (pi) With the above in effect, A non-root user can only increase the value upto +19, and once increased cannot be decreased. e.g., a user once sets the value latency_nice = 19, the same user cannot set the value latency_nice = 18. This is the same effect as with NICE. Is such permission checks required? Unlike NICE, we are going to use latency_nice for scheduler hints only, and so won't it make more sense to allow a user to increase/decrease the values of their owned tasks? - Parth ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-20 8:50 ` Parth Shah @ 2020-02-20 14:30 ` chris hyser 2020-02-20 15:03 ` Qais Yousef 0 siblings, 1 reply; 23+ messages in thread From: chris hyser @ 2020-02-20 14:30 UTC (permalink / raw) To: Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, David.Laight, pjt, tj On 2/20/20 3:50 AM, Parth Shah wrote: > > > On 2/20/20 2:04 PM, Parth Shah wrote: >> >> >> On 2/19/20 11:53 PM, chris hyser wrote: >>> >>> >>> On 2/19/20 9:15 AM, chris hyser wrote: >>>> >>>> >>>> On 2/19/20 5:09 AM, Parth Shah wrote: >>>>> Hi Chris, >>>>> >>>>> On 2/19/20 4:30 AM, chris hyser wrote: >>>>>> On 2/17/20 3:57 AM, Parth Shah wrote: >>>>>>> >>>>>>> >>>>>>> On 1/16/20 5:32 PM, Parth Shah wrote: >>>>>>>> This is the 3rd revision of the patch set to introduce >>>>>>>> latency_{nice/tolerance} as a per task attribute. >>>>>>>> >>>>>>>> The previous version can be found at: >>>>>>>> v1: https://lkml.org/lkml/2019/11/25/151 >>>>>>>> v2: https://lkml.org/lkml/2019/12/8/10 >>>>>>>> >>>>>>>> Changes in this revision are: >>>>>>>> v2 -> v3: >>>>>>>> - This series changes the longer attribute name to "latency_nice" as per >>>>>>>> the comment from Dietmar Eggemann >>>>>>>> https://lkml.org/lkml/2019/12/5/394 >>>>>>>> v1 -> v2: >>>>>>>> - Addressed comments from Qais Yousef >>>>>>>> - As per suggestion from Dietmar, moved content from newly created >>>>>>>> include/linux/sched/latency_tolerance.h to kernel/sched/sched.h >>>>>>>> - Extend sched_setattr() to support latency_tolerance in tools >>>>>>>> headers UAPI >>>>>>>> >>>>>>>> >>>>>>>> Introduction: >>>>>>>> ============== >>>>>>>> This patch series introduces a new per-task attribute latency_nice to >>>>>>>> provide the scheduler hints about the latency requirements of the >>>>>>>> task [1]. >>>>>>>> >>>>>>>> Latency_nice is a ranged attribute of a task with the value ranging >>>>>>>> from [-20, 19] both inclusive which makes it align with the task nice >>>>>>>> value. >>>>>>>> >>>>>>>> The value should provide scheduler hints about the relative latency >>>>>>>> requirements of tasks, meaning the task with "latency_nice = -20" >>>>>>>> should have lower latency requirements than compared to those tasks with >>>>>>>> higher values. Similarly a task with "latency_nice = 19" can have higher >>>>>>>> latency and hence such tasks may not care much about latency. >>>>>>>> >>>>>>>> The default value is set to 0. The usecases discussed below can use this >>>>>>>> range of [-20, 19] for latency_nice for the specific purpose. This >>>>>>>> patch does not implement any use cases for such attribute so that any >>>>>>>> change in naming or range does not affect much to the other (future) >>>>>>>> patches using this. The actual use of latency_nice during task wakeup >>>>>>>> and load-balancing is yet to be coded for each of those usecases. >>>>>>>> >>>>>>>> As per my view, this defined attribute can be used in following ways >>>>>>>> for a >>>>>>>> some of the usecases: >>>>>>>> 1 Reduce search scan time for select_idle_cpu(): >>>>>>>> - Reduce search scans for finding idle CPU for a waking task with lower >>>>>>>> latency_nice values. >>>>>>>> >>>>>>>> 2 TurboSched: >>>>>>>> - Classify the tasks with higher latency_nice values as a small >>>>>>>> background task given that its historic utilization is very low, for >>>>>>>> which the scheduler can search for more number of cores to do task >>>>>>>> packing. A task with a latency_nice >= some_threshold (e.g, == 19) >>>>>>>> and util <= 12.5% can be background tasks. >>>>>>>> >>>>>>>> 3 Optimize AVX512 based workload: >>>>>>>> - Bias scheduler to not put a task having (latency_nice == -20) on a >>>>>>>> core occupying AVX512 based workload. >>>>>>>> >>>>>>>> >>>>>>>> Series Organization: >>>>>>>> ==================== >>>>>>>> - Patch 1: Add new attribute latency_nice to task_struct. >>>>>>>> - Patch 2: Clone parent task's attribute to the child task on fork >>>>>>>> - Patch 3: Add support for sched_{set,get}attr syscall to modify >>>>>>>> latency_nice of the task >>>>>>>> >>>>>>>> >>>>>>>> The patch series can be applied on tip/sched/core at the >>>>>>>> commit 804d402fb6f6 ("sched/rt: Make RT capacity-aware") >>>>>>>> >>>>>>>> >>>>>>>> References: >>>>>>>> ============ >>>>>>>> [1]. Usecases for the per-task latency-nice attribute, >>>>>>>> https://lkml.org/lkml/2019/9/30/215 >>>>>>>> [2]. Task Latency-nice, "Subhra Mazumdar", >>>>>>>> https://lkml.org/lkml/2019/8/30/829 >>>>>>>> [3]. Introduce per-task latency_tolerance for scheduler hints, >>>>>>>> https://lkml.org/lkml/2019/12/8/10 >>>>>>>> >>>>>>>> >>>>>>>> Parth Shah (3): >>>>>>>> sched: Introduce latency-nice as a per-task attribute >>>>>>>> sched/core: Propagate parent task's latency requirements to the >>>>>>>> child >>>>>>>> task >>>>>>>> sched: Allow sched_{get,set}attr to change latency_nice of the task >>>>>>>> >>>>>>>> include/linux/sched.h | 1 + >>>>>>>> include/uapi/linux/sched.h | 4 +++- >>>>>>>> include/uapi/linux/sched/types.h | 19 +++++++++++++++++++ >>>>>>>> kernel/sched/core.c | 21 +++++++++++++++++++++ >>>>>>>> kernel/sched/sched.h | 18 ++++++++++++++++++ >>>>>>>> tools/include/uapi/linux/sched.h | 4 +++- >>>>>>>> 6 files changed, 65 insertions(+), 2 deletions(-) >>>>>>>> >>>>>>> >>>>>>> Its been a long time and few revisions since the beginning of the >>>>>>> discussion around the latency-nice. Hence thought of asking if there >>>>>>> is/are >>>>>>> any further work that needs to be done for adding latency-nice >>>>>>> attribute or >>>>>>> am I missing any piece in here? >>>>>> >>>>>> All, I was asked to take a look at the original latency_nice patchset. >>>>>> First, to clarify objectives, Oracle is not interested in trading >>>>>> throughput for latency. What we found is that the DB has specific tasks >>>>>> which do very little but need to do this as absolutely quickly as >>>>>> possible, >>>>>> ie extreme latency sensitivity. Second, the key to latency reduction in >>>>>> the >>>>>> task wakeup path seems to be limiting variations of "idle cpu" search. The >>>>>> latter particularly interests me as an example of "platform size based >>>>>> latency" which I believe to be important given all the varying size VMs >>>>>> and >>>>>> containers. >>>>>> >>>>>> Parth, I've been using your v3 patchset as the basis of an investigation >>>>>> into the measurable effects of short-circuiting this search. I'm not quite >>>>>> ready to put anything out, but the patchset is working well. The only >>>>> >>>>> That's a good news as you are able to get a usecase of this patch-set. >>>>> >>>>>> feedback I have is that currently non-root can set the value negative >>>>>> which >>>>>> is inconsistent with 'nice' and I would think a security hole. >>>>>> >>>>> >>>>> I would assume you mean 'latency_nice' here. >>>>> >>>>> From my testing, I was not able to set values for any root owned task's >>>>> latency_nice value by the non-root user. Also, my patch-set just piggybacks >>>>> on the already existing sched_setattr syscall and hence it should not allow >>>>> non-root user to do any modifications. Can you confirm this by changing >>>>> nice (renice) value of a root task from non-root user. >>>>> >>>>> I have done the sanity check in the code and thinking where it could >>>>> possibly have gone wrong. So, can you please specify what values were you >>>>> able to set outside the [-20, 19] range? >>>> >>>> The checks prevent being outside that range. But negative numbers -20 to >>>> -1 did not need root. Let me dig some more. I verified this explicitly >>>> before sending the email so something is up. >>> >>> I went digging. This is absolutely repeatable. I checked that I do not >>> unknowingly have CAP_SYS_NICE as a user. So first, are we tying >>> latency_nice to CAP_SYS_NICE? Seems like a reasonable thing, but not sure I >>> saw this stated anywhere. Second, the only capability checked in >>> __sched_setscheduler() in the patch I have is CAP_SYS_NICE and those checks >>> will not return a -EPERM for a negative latency_tolerance (in the code, aka >>> latency_nice). Do I have the correct version of the code? Am I missing >>> something? >> >> You are right. I have not added permission checks for setting the >> latency_nice value. For the task_nice, non-root user has no permission to >> set the value lower than the current value which is not the case with the >> latency_nice. >> >> In order to align with the permission checks like task_nice, I will add the >> check similar to task_nice and send out the v4 of the series soon. >> >> >> Thanks for pointing out. >> - Parth >> > > The below diff works out well enough in-order to align permission checks > with NICE. > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 2bfcff5623f9..ef4a397c9170 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4878,6 +4878,10 @@ static int __sched_setscheduler(struct task_struct *p, > return -EINVAL; > if (attr->sched_latency_nice < MIN_LATENCY_NICE) > return -EINVAL; > + /* Use the same security checks as NICE */ > + if (attr->sched_latency_nice < p->latency_nice && > + !can_nice(p, attr->sched_latency_nice)) > + return -EPERM; > } > > if (pi) > > With the above in effect, > A non-root user can only increase the value upto +19, and once increased > cannot be decreased. e.g., a user once sets the value latency_nice = 19, > the same user cannot set the value latency_nice = 18. This is the same > effect as with NICE. > > Is such permission checks required? > > Unlike NICE, we are going to use latency_nice for scheduler hints only, and > so won't it make more sense to allow a user to increase/decrease the values > of their owned tasks? Whether called a hint or not, it is a trade-off to reduce latency of select tasks at the expense of the throughput of the other tasks in the the system. If any of the other tasks belong to other users, you would presumably require permission. -chrish ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-20 14:30 ` chris hyser @ 2020-02-20 15:03 ` Qais Yousef 2020-02-20 16:34 ` chris hyser 0 siblings, 1 reply; 23+ messages in thread From: Qais Yousef @ 2020-02-20 15:03 UTC (permalink / raw) To: chris hyser Cc: Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann, linux-kernel, peterz, mingo, pavel, qperret, David.Laight, pjt, tj On 02/20/20 09:30, chris hyser wrote: > > The below diff works out well enough in-order to align permission checks > > with NICE. > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 2bfcff5623f9..ef4a397c9170 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4878,6 +4878,10 @@ static int __sched_setscheduler(struct task_struct *p, > > return -EINVAL; > > if (attr->sched_latency_nice < MIN_LATENCY_NICE) > > return -EINVAL; > > + /* Use the same security checks as NICE */ > > + if (attr->sched_latency_nice < p->latency_nice && > > + !can_nice(p, attr->sched_latency_nice)) > > + return -EPERM; > > } > > > > if (pi) > > > > With the above in effect, > > A non-root user can only increase the value upto +19, and once increased > > cannot be decreased. e.g., a user once sets the value latency_nice = 19, > > the same user cannot set the value latency_nice = 18. This is the same > > effect as with NICE. > > > > Is such permission checks required? > > > > Unlike NICE, we are going to use latency_nice for scheduler hints only, and > > so won't it make more sense to allow a user to increase/decrease the values > > of their owned tasks? > > Whether called a hint or not, it is a trade-off to reduce latency of select > tasks at the expense of the throughput of the other tasks in the the system. Does it actually affect the throughput of the other tasks? I thought this will allow the scheduler to reduce latencies, for instance, when selecting which cpu it should land on. I can't see how this could hurt other tasks. Can you expand on the scenario you have in mind please? > If any of the other tasks belong to other users, you would presumably > require permission. AFAIU security_task_setscheduler() will only allow a change if the task is changing its own attribute value or has SYS_CAP_NICE. If you're able to change the attribute of another task, then its not only latency_nice that's broken here. Thanks -- Qais Yousef ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-20 15:03 ` Qais Yousef @ 2020-02-20 16:34 ` chris hyser 2020-02-21 9:29 ` Qais Yousef 0 siblings, 1 reply; 23+ messages in thread From: chris hyser @ 2020-02-20 16:34 UTC (permalink / raw) To: Qais Yousef Cc: Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann, linux-kernel, peterz, mingo, pavel, qperret, David.Laight, pjt, tj On 2/20/20 10:03 AM, Qais Yousef wrote: > On 02/20/20 09:30, chris hyser wrote: >>> The below diff works out well enough in-order to align permission checks >>> with NICE. >>> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index 2bfcff5623f9..ef4a397c9170 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -4878,6 +4878,10 @@ static int __sched_setscheduler(struct task_struct *p, >>> return -EINVAL; >>> if (attr->sched_latency_nice < MIN_LATENCY_NICE) >>> return -EINVAL; >>> + /* Use the same security checks as NICE */ >>> + if (attr->sched_latency_nice < p->latency_nice && >>> + !can_nice(p, attr->sched_latency_nice)) >>> + return -EPERM; >>> } >>> >>> if (pi) >>> >>> With the above in effect, >>> A non-root user can only increase the value upto +19, and once increased >>> cannot be decreased. e.g., a user once sets the value latency_nice = 19, >>> the same user cannot set the value latency_nice = 18. This is the same >>> effect as with NICE. >>> >>> Is such permission checks required? >>> >>> Unlike NICE, we are going to use latency_nice for scheduler hints only, and >>> so won't it make more sense to allow a user to increase/decrease the values >>> of their owned tasks? >> >> Whether called a hint or not, it is a trade-off to reduce latency of select >> tasks at the expense of the throughput of the other tasks in the the system. > > Does it actually affect the throughput of the other tasks? I thought this will > allow the scheduler to reduce latencies, for instance, when selecting which cpu > it should land on. I can't see how this could hurt other tasks. This is why it is hard to argue about pure abstractions. The primary idea mentioned so far for how these latencies are reduced is by short cutting the brute-force search for something idle. If you don't find an idle cpu because you didn't spend the time to look, then you pre-empted a task, possibly with a large nice warm cache footprint that was cranking away on throughput. It is ultimately going to be the usual latency vs throughput trade off. If latency reduction were "free" we wouldn't need a per task attribute. We would just do the reduction for all tasks, everywhere, all the time. > > Can you expand on the scenario you have in mind please? Hopefully, the above helps. It was my original plan to introduce this with a data laden RFC on the topic, but I felt the need to respond to Parth immediately. I'm not currently pushing any particular change. -chrish ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-20 16:34 ` chris hyser @ 2020-02-21 9:29 ` Qais Yousef 2020-02-21 10:01 ` Parth Shah 2020-02-21 17:08 ` chris hyser 0 siblings, 2 replies; 23+ messages in thread From: Qais Yousef @ 2020-02-21 9:29 UTC (permalink / raw) To: chris hyser Cc: Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann, linux-kernel, peterz, mingo, pavel, qperret, David.Laight, pjt, tj On 02/20/20 11:34, chris hyser wrote: > > > Whether called a hint or not, it is a trade-off to reduce latency of select > > > tasks at the expense of the throughput of the other tasks in the the system. > > > > Does it actually affect the throughput of the other tasks? I thought this will > > allow the scheduler to reduce latencies, for instance, when selecting which cpu > > it should land on. I can't see how this could hurt other tasks. > > This is why it is hard to argue about pure abstractions. The primary idea > mentioned so far for how these latencies are reduced is by short cutting the > brute-force search for something idle. If you don't find an idle cpu because > you didn't spend the time to look, then you pre-empted a task, possibly with > a large nice warm cache footprint that was cranking away on throughput. It > is ultimately going to be the usual latency vs throughput trade off. If > latency reduction were "free" we wouldn't need a per task attribute. We > would just do the reduction for all tasks, everywhere, all the time. This could still happen without the latency nice bias. I'm not sure if this falls under DoS; maybe if you end up spawning a lot of task with high latency nice value, then you might end up cramming a lot of tasks on a small subset of CPUs. But then, shouldn't the logic that uses latency_nice try to handle this case anyway since it could be legit? Not sure if this can be used by someone to trigger timing based attacks on another process. I can't fully see the whole security implications, but regardless. I do agree it is prudent to not allow tasks to set their own latency_nice. Mainly because the meaning of this flag will be system dependent and I think Admins are the better ones to decide how to use this flag for the system they're running on. I don't think application writers should be able to tweak their tasks latency_nice value. Not if they can't get the right privilege at least. > > > > > Can you expand on the scenario you have in mind please? > > Hopefully, the above helps. It was my original plan to introduce this with a > data laden RFC on the topic, but I felt the need to respond to Parth > immediately. I'm not currently pushing any particular change. Thanks! -- Qais Yousef ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-21 9:29 ` Qais Yousef @ 2020-02-21 10:01 ` Parth Shah 2020-02-21 16:51 ` chris hyser 2020-02-21 17:08 ` chris hyser 1 sibling, 1 reply; 23+ messages in thread From: Parth Shah @ 2020-02-21 10:01 UTC (permalink / raw) To: Qais Yousef, chris hyser Cc: vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann, linux-kernel, peterz, mingo, pavel, qperret, David.Laight, pjt, tj On 2/21/20 2:59 PM, Qais Yousef wrote: > On 02/20/20 11:34, chris hyser wrote: >>>> Whether called a hint or not, it is a trade-off to reduce latency of select >>>> tasks at the expense of the throughput of the other tasks in the the system. >>> >>> Does it actually affect the throughput of the other tasks? I thought this will >>> allow the scheduler to reduce latencies, for instance, when selecting which cpu >>> it should land on. I can't see how this could hurt other tasks. >> >> This is why it is hard to argue about pure abstractions. The primary idea >> mentioned so far for how these latencies are reduced is by short cutting the >> brute-force search for something idle. If you don't find an idle cpu because >> you didn't spend the time to look, then you pre-empted a task, possibly with >> a large nice warm cache footprint that was cranking away on throughput. It >> is ultimately going to be the usual latency vs throughput trade off. If >> latency reduction were "free" we wouldn't need a per task attribute. We >> would just do the reduction for all tasks, everywhere, all the time. > > This could still happen without the latency nice bias. I'm not sure if this > falls under DoS; maybe if you end up spawning a lot of task with high latency > nice value, then you might end up cramming a lot of tasks on a small subset of > CPUs. But then, shouldn't the logic that uses latency_nice try to handle this > case anyway since it could be legit? > > Not sure if this can be used by someone to trigger timing based attacks on > another process. > > I can't fully see the whole security implications, but regardless. I do agree > it is prudent to not allow tasks to set their own latency_nice. Mainly because > the meaning of this flag will be system dependent and I think Admins are the > better ones to decide how to use this flag for the system they're running on. > I don't think application writers should be able to tweak their tasks > latency_nice value. Not if they can't get the right privilege at least. > AFAICT if latency_nice is really used for scheduler hints to provide latency requirements, then the worst that can happen is the user can request to have all the created tasks get least latency (which might not even be possible). Hence, unless we bias priority or vruntime, I don't see the DoS possibility with latency_nice. Being conservative for future and considering that only Admin may make correct decision, I do think to not allow non-root user to give -ve values for latency_nice. NICE don't allow reducing the task_nice value even if the value has been increase by the same user in the past. From my understandings, this don't make sense much for the latency_nice and user should be free to choose any value in the range [0,19]. - Parth >> >>> >>> Can you expand on the scenario you have in mind please? >> >> Hopefully, the above helps. It was my original plan to introduce this with a >> data laden RFC on the topic, but I felt the need to respond to Parth >> immediately. I'm not currently pushing any particular change. > > Thanks! > > -- > Qais Yousef > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-21 10:01 ` Parth Shah @ 2020-02-21 16:51 ` chris hyser 0 siblings, 0 replies; 23+ messages in thread From: chris hyser @ 2020-02-21 16:51 UTC (permalink / raw) To: Parth Shah, Qais Yousef Cc: vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann, linux-kernel, peterz, mingo, pavel, qperret, David.Laight, pjt, tj On 2/21/20 5:01 AM, Parth Shah wrote: > > > On 2/21/20 2:59 PM, Qais Yousef wrote: >> On 02/20/20 11:34, chris hyser wrote: >>>>> Whether called a hint or not, it is a trade-off to reduce latency of select >>>>> tasks at the expense of the throughput of the other tasks in the the system. >>>> >>>> Does it actually affect the throughput of the other tasks? I thought this will >>>> allow the scheduler to reduce latencies, for instance, when selecting which cpu >>>> it should land on. I can't see how this could hurt other tasks. >>> >>> This is why it is hard to argue about pure abstractions. The primary idea >>> mentioned so far for how these latencies are reduced is by short cutting the >>> brute-force search for something idle. If you don't find an idle cpu because >>> you didn't spend the time to look, then you pre-empted a task, possibly with >>> a large nice warm cache footprint that was cranking away on throughput. It >>> is ultimately going to be the usual latency vs throughput trade off. If >>> latency reduction were "free" we wouldn't need a per task attribute. We >>> would just do the reduction for all tasks, everywhere, all the time. >> >> This could still happen without the latency nice bias. I'm not sure if this >> falls under DoS; maybe if you end up spawning a lot of task with high latency >> nice value, then you might end up cramming a lot of tasks on a small subset of >> CPUs. But then, shouldn't the logic that uses latency_nice try to handle this >> case anyway since it could be legit? >> >> Not sure if this can be used by someone to trigger timing based attacks on >> another process. >> >> I can't fully see the whole security implications, but regardless. I do agree >> it is prudent to not allow tasks to set their own latency_nice. Mainly because >> the meaning of this flag will be system dependent and I think Admins are the >> better ones to decide how to use this flag for the system they're running on. >> I don't think application writers should be able to tweak their tasks >> latency_nice value. Not if they can't get the right privilege at least. >> > > AFAICT if latency_nice is really used for scheduler hints to provide > latency requirements, then the worst that can happen is the user can > request to have all the created tasks get least latency (which might not > even be possible). Hence, unless we bias priority or vruntime, I don't see > the DoS possibility with latency_nice. A latency nice that does not allow reducing the latency of select tasks at the expense of the throughput of other tasks (ie the classic trade-off) doesn't seem to be all that useful and would fail to meet the requirements. Latency nice in this view is a directive on how to make that trade-off much like nice is a directive on how to trade-off what should get to run next and we don't accept the system optionally treating +19 and -20 tasks the same. Even when we don't get exactly what we want we expect "best effort" and that is not the same as optional. -chrish ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-21 9:29 ` Qais Yousef 2020-02-21 10:01 ` Parth Shah @ 2020-02-21 17:08 ` chris hyser 1 sibling, 0 replies; 23+ messages in thread From: chris hyser @ 2020-02-21 17:08 UTC (permalink / raw) To: Qais Yousef Cc: Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann, linux-kernel, peterz, mingo, pavel, qperret, David.Laight, pjt, tj On 2/21/20 4:29 AM, Qais Yousef wrote: > On 02/20/20 11:34, chris hyser wrote: >>>> Whether called a hint or not, it is a trade-off to reduce latency of select >>>> tasks at the expense of the throughput of the other tasks in the the system. >>> >>> Does it actually affect the throughput of the other tasks? I thought this will >>> allow the scheduler to reduce latencies, for instance, when selecting which cpu >>> it should land on. I can't see how this could hurt other tasks. >> >> This is why it is hard to argue about pure abstractions. The primary idea >> mentioned so far for how these latencies are reduced is by short cutting the >> brute-force search for something idle. If you don't find an idle cpu because >> you didn't spend the time to look, then you pre-empted a task, possibly with >> a large nice warm cache footprint that was cranking away on throughput. It >> is ultimately going to be the usual latency vs throughput trade off. If >> latency reduction were "free" we wouldn't need a per task attribute. We >> would just do the reduction for all tasks, everywhere, all the time. > > This could still happen without the latency nice bias. I'm not sure if this True, but without a bias towards a user at the users request, it's just normal scheduler policy. :-) > falls under DoS; maybe if you end up spawning a lot of task with high latency > nice value, then you might end up cramming a lot of tasks on a small subset of > CPUs. But then, shouldn't the logic that uses latency_nice try to handle this > case anyway since it could be legit? One of the experiments I'm planning is basically that; how badly can this be abused by root. Like pretty much everything else, if root wants to destroy the system, not much you can do, but I find it useful to look at the pathological cases as well. > > Not sure if this can be used by someone to trigger timing based attacks on > another process. > > I can't fully see the whole security implications, but regardless. I do agree > it is prudent to not allow tasks to set their own latency_nice. Mainly because > the meaning of this flag will be system dependent and I think Admins are the > better ones to decide how to use this flag for the system they're running on. > I don't think application writers should be able to tweak their tasks > latency_nice value. Not if they can't get the right privilege at least. Agreed. > >> >>> >>> Can you expand on the scenario you have in mind please? >> >> Hopefully, the above helps. It was my original plan to introduce this with a >> data laden RFC on the topic, but I felt the need to respond to Parth >> immediately. I'm not currently pushing any particular change. > > Thanks! No problem. -chrish ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-19 10:09 ` Parth Shah 2020-02-19 14:15 ` chris hyser @ 2020-02-21 17:52 ` chris hyser 1 sibling, 0 replies; 23+ messages in thread From: chris hyser @ 2020-02-21 17:52 UTC (permalink / raw) To: Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, David.Laight, pjt, tj On 2/19/20 5:09 AM, Parth Shah wrote: > Hi Chris, >> Parth, I've been using your v3 patchset as the basis of an investigation >> into the measurable effects of short-circuiting this search. I'm not quite >> ready to put anything out, but the patchset is working well. The only > > That's a good news as you are able to get a usecase of this patch-set. Parth, I wanted to make sure to thank you for all this effort. It has made it really easy to start experimenting with these ideas and concepts and that's been an enormous help. Thanks again. -chrish ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-18 23:00 ` chris hyser 2020-02-19 10:09 ` Parth Shah @ 2020-02-19 11:18 ` David Laight 2020-02-19 17:16 ` chris hyser 1 sibling, 1 reply; 23+ messages in thread From: David Laight @ 2020-02-19 11:18 UTC (permalink / raw) To: 'chris hyser', Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, pjt, tj From: chris hyser > Sent: 18 February 2020 23:00 ... > All, I was asked to take a look at the original latency_nice patchset. > First, to clarify objectives, Oracle is not > interested in trading throughput for latency. > What we found is that the DB has specific tasks which do very little but > need to do this as absolutely quickly as possible, ie extreme latency > sensitivity. Second, the key to latency reduction > in the task wakeup path seems to be limiting variations of "idle cpu" search. > The latter particularly interests me as an example of "platform size > based latency" which I believe to be important given all the varying size > VMs and containers. From my experiments there are a few things that seem to affect latency of waking up real time (sched fifo) tasks on a normal kernel: 1) The time taken for the (intel x86) cpu to wakeup from monitor/mwait. If the cpu is allowed to enter deeper sleep states this can take 900us. Any changes to this are system-wide not process specific. 2) If the cpu an RT process last ran on (ie the one it is woken on) is running in kernel, the process switch won't happen until cond_reshed() is called. On my system the code to flush the display frame buffer takes 3.3ms. Compiling a kernel with CONFIG_PREEMPT=y will reduce this. 3) If a hardware interrupt happens just after the process is woken then you have to wait until it finishes and any 'softint' work that is scheduled on the same cpu finishes. The ethernet driver transmit completions an receive ring filling can easily take 1ms. Booting with 'threadirq' might help this. 4) If you need to acquire a lock/futex then you need to allow for the process that holds it being delayed by a hardware interrupt (etc). So even if the lock is only held for a few instructions it can take a long time to acquire. (I need to change some linked lists to arrays indexed by an atomically incremented global index.) FWIW I can't imagine how a database can have anything that is that latency sensitive. We are doing lots of channels of audio processing and have a lot of work to do within 10ms to avoid audible errors. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-19 11:18 ` David Laight @ 2020-02-19 17:16 ` chris hyser 2020-02-20 14:39 ` David Laight 0 siblings, 1 reply; 23+ messages in thread From: chris hyser @ 2020-02-19 17:16 UTC (permalink / raw) To: David Laight, Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, pjt, tj On 2/19/20 6:18 AM, David Laight wrote: > From: chris hyser >> Sent: 18 February 2020 23:00 > ... >> All, I was asked to take a look at the original latency_nice patchset. >> First, to clarify objectives, Oracle is not >> interested in trading throughput for latency. >> What we found is that the DB has specific tasks which do very little but >> need to do this as absolutely quickly as possible, ie extreme latency >> sensitivity. Second, the key to latency reduction >> in the task wakeup path seems to be limiting variations of "idle cpu" search. >> The latter particularly interests me as an example of "platform size >> based latency" which I believe to be important given all the varying size >> VMs and containers. > > From my experiments there are a few things that seem to affect latency > of waking up real time (sched fifo) tasks on a normal kernel: Sorry. I was only ever talking about sched_other as per the original patchset. I realize the term extreme latency sensitivity may have caused confusion. What that means to DB people is no doubt different than audio people. :-) > > 1) The time taken for the (intel x86) cpu to wakeup from monitor/mwait. > If the cpu is allowed to enter deeper sleep states this can take 900us. > Any changes to this are system-wide not process specific. > > 2) If the cpu an RT process last ran on (ie the one it is woken on) is > running in kernel, the process switch won't happen until cond_reshed() > is called. > On my system the code to flush the display frame buffer takes 3.3ms. > Compiling a kernel with CONFIG_PREEMPT=y will reduce this. > > 3) If a hardware interrupt happens just after the process is woken > then you have to wait until it finishes and any 'softint' work > that is scheduled on the same cpu finishes. > The ethernet driver transmit completions an receive ring filling > can easily take 1ms. > Booting with 'threadirq' might help this. > > 4) If you need to acquire a lock/futex then you need to allow for the > process that holds it being delayed by a hardware interrupt (etc). > So even if the lock is only held for a few instructions it can take > a long time to acquire. > (I need to change some linked lists to arrays indexed by an atomically > incremented global index.) > > FWIW I can't imagine how a database can have anything that is that > latency sensitive. > We are doing lots of channels of audio processing and have a lot of work > to do within 10ms to avoid audible errors. There are existing internal numbers that I will ultimately have to duplicate that show that simply short-cutting these idle cpu searches has a significant benefit on DB performance on large hardware. However that was for a different patchset involving things I don't like so I'm still exploring how to achieve similar results within the latency_nice framework. -chrish ^ permalink raw reply [flat|nested] 23+ messages in thread
* RE: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-19 17:16 ` chris hyser @ 2020-02-20 14:39 ` David Laight 2020-02-20 15:55 ` chris hyser 0 siblings, 1 reply; 23+ messages in thread From: David Laight @ 2020-02-20 14:39 UTC (permalink / raw) To: 'chris hyser', Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, pjt, tj From: chris hyser <chris.hyser@oracle.com> > Sent: 19 February 2020 17:17 > > On 2/19/20 6:18 AM, David Laight wrote: > > From: chris hyser > >> Sent: 18 February 2020 23:00 > > ... > >> All, I was asked to take a look at the original latency_nice patchset. > >> First, to clarify objectives, Oracle is not > >> interested in trading throughput for latency. > >> What we found is that the DB has specific tasks which do very little but > >> need to do this as absolutely quickly as possible, ie extreme latency > >> sensitivity. Second, the key to latency reduction > >> in the task wakeup path seems to be limiting variations of "idle cpu" search. > >> The latter particularly interests me as an example of "platform size > >> based latency" which I believe to be important given all the varying size > >> VMs and containers. > > > > From my experiments there are a few things that seem to affect latency > > of waking up real time (sched fifo) tasks on a normal kernel: > > Sorry. I was only ever talking about sched_other as per the original patchset. I realize the term > extreme latency > sensitivity may have caused confusion. What that means to DB people is no doubt different than audio > people. :-) Shorter lines..... ISTM you are making some already complicated code even more complex. Better to make it simpler instead. If you need a thread to run as soon as possible after it is woken why not use the RT scheduler (eg SCHED_FIFO) that is what it is for. If there are delays finding an idle cpu to migrate a process to (especially on systems with large numbers of cpu) then that is a general problem that can be addressed without extra knobs. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints 2020-02-20 14:39 ` David Laight @ 2020-02-20 15:55 ` chris hyser 0 siblings, 0 replies; 23+ messages in thread From: chris hyser @ 2020-02-20 15:55 UTC (permalink / raw) To: David Laight, Parth Shah, vincent.guittot, patrick.bellasi, valentin.schneider, dhaval.giani, dietmar.eggemann Cc: linux-kernel, peterz, mingo, qais.yousef, pavel, qperret, pjt, tj On 2/20/20 9:39 AM, David Laight wrote: > From: chris hyser <chris.hyser@oracle.com> >> Sent: 19 February 2020 17:17 >> >> On 2/19/20 6:18 AM, David Laight wrote: >>> From: chris hyser >>>> Sent: 18 February 2020 23:00 >>> ... >>>> All, I was asked to take a look at the original latency_nice patchset. >>>> First, to clarify objectives, Oracle is not >>>> interested in trading throughput for latency. >>>> What we found is that the DB has specific tasks which do very little but >>>> need to do this as absolutely quickly as possible, ie extreme latency >>>> sensitivity. Second, the key to latency reduction >>>> in the task wakeup path seems to be limiting variations of "idle cpu" search. >>>> The latter particularly interests me as an example of "platform size >>>> based latency" which I believe to be important given all the varying size >>>> VMs and containers. >>> >>> From my experiments there are a few things that seem to affect latency >>> of waking up real time (sched fifo) tasks on a normal kernel: >> >> Sorry. I was only ever talking about sched_other as per the original patchset. I realize the term >> extreme latency >> sensitivity may have caused confusion. What that means to DB people is no doubt different than audio >> people. :-) > > Shorter lines..... > > ISTM you are making some already complicated code even more complex. > Better to make it simpler instead. The code already exists to set a limit to bail out of what is sometimes a needlessly excessive search. Setting that based on an integer doesn't seem particularly complex. Now whether that is actually useful is what I'm currently looking at. > > If you need a thread to run as soon as possible after it is woken > why not use the RT scheduler (eg SCHED_FIFO) that is what it is for. Overkill and doesn't play well with cpu cgroup controller. > > If there are delays finding an idle cpu to migrate a process to > (especially on systems with large numbers of cpu) then that is a > general problem that can be addressed without extra knobs. There is no if. It is a brute force search. There are delays proportional to the search domain size. You can optimize the hell of out the brute force, or you use obtained knowledge to bail out early. Getting that knowledge from the user is a time honored tradition. :-) -chrish ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2020-02-21 17:52 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-16 12:02 [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints Parth Shah 2020-01-16 12:02 ` [PATCH v3 1/3] sched: Introduce latency-nice as a per-task attribute Parth Shah 2020-01-16 12:02 ` [PATCH v3 2/3] sched/core: Propagate parent task's latency requirements to the child task Parth Shah 2020-01-16 12:02 ` [PATCH v3 3/3] sched: Allow sched_{get,set}attr to change latency_nice of the task Parth Shah 2020-02-17 8:57 ` [PATCH v3 0/3] Introduce per-task latency_nice for scheduler hints Parth Shah 2020-02-18 23:00 ` chris hyser 2020-02-19 10:09 ` Parth Shah 2020-02-19 14:15 ` chris hyser 2020-02-19 18:23 ` chris hyser 2020-02-20 8:34 ` Parth Shah 2020-02-20 8:50 ` Parth Shah 2020-02-20 14:30 ` chris hyser 2020-02-20 15:03 ` Qais Yousef 2020-02-20 16:34 ` chris hyser 2020-02-21 9:29 ` Qais Yousef 2020-02-21 10:01 ` Parth Shah 2020-02-21 16:51 ` chris hyser 2020-02-21 17:08 ` chris hyser 2020-02-21 17:52 ` chris hyser 2020-02-19 11:18 ` David Laight 2020-02-19 17:16 ` chris hyser 2020-02-20 14:39 ` David Laight 2020-02-20 15:55 ` chris hyser
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).