archive mirror
 help / color / mirror / Atom feed
From: "Pratik R. Sampat" <>
Subject: [RFC 0/5] kernel: Introduce CPU Namespace
Date: Sat,  9 Oct 2021 20:42:38 +0530	[thread overview]
Message-ID: <> (raw)

An early prototype of to demonstrate CPU namespace interface and its

The kernel provides two ways to control CPU resources for tasks
1. cgroup cpuset:
   A control mechanism to restrict CPUs to a task or a
   set of tasks attached to that group
2. syscall sched_setaffinity:
   A system call that can pin tasks to a set of CPUs

The kernel also provides three ways to view the CPU resources available
to the system:
1. sys/procfs:
   CPU system information is divulged through sys and proc fs, it
   exposes online, offline, present as well as load characteristics on
   the CPUs
2. syscall sched_getaffinity:
   A system call interface to get the cpuset affinity of tasks
3. cgroup cpuset:
   While cgroup is more of a control mechanism than a display mechanism,
   it can be viewed to retrieve the CPU restrictions applied on a group
   of tasks

Coherency of information
The control and the display interface is fairly disjoint with each
other. Restrictions can be set through control interfaces like cgroups,
while many applications legacy or otherwise get the view of the system
through sysfs/procfs and allocate resources like number of
threads/processes, memory allocation based on that information.

This can lead to unexpected running behaviors as well as have a high
impact on performance.

Existing solutions to the problem include userspace tools like LXCFS
which can fake the sysfs information by mounting onto the sysfs online
file to be in coherence with the limits set through cgroup cpuset.
However, LXCFS is an external solution and needs to be explicitly setup
for applications that require it. Another concern is also that tools
like LXCFS don't handle all the other display mechanism like procfs load

Therefore, the need of a clean interface could be advocated for.

Security and fair use implications
In a multi-tenant system, multiple containers may exist and information
about the entire system, rather than just the resources that are
restricted upon them can cause security and fair use implications such
1. A case where an actor can be in cognizance of the CPU node topology
   can schedule workloads and select CPUs such that the bus is flooded
   causing a Denial Of Service attack
2. A case wherein identifying the CPU system topology can help identify
   cores that are close to buses and peripherals such as GPUs to get an
   undue latency advantage from the rest of the workloads

A survey RFD discusses other potential solutions and their concerns are
listed here:

This prototype patchset introduces a new kernel namespace mechanism --
CPU namespace.

The CPU namespace isolates CPU information by virtualizing logical CPU
IDs and creating a scrambled virtual CPU map of the same.
It latches onto the task_struct and is the cpu translations designed to
be in a flat hierarchy this means that every virtual namespace CPU maps
to a physical CPU at the creation of the namespace. The advantage of a
flat hierarchy is that translations are O(1) and children do not need
to traverse up the tree to retrieve a translation.

This namespace then allows both control and display interfaces to be
CPU namespace context aware, such that a task within a namespace only
gets the view and therefore control of its and view CPU resources
available to it via a virtual CPU map.

We designed an experiment to benchmark nginx configured with
"worker_processes: auto" (which ensures that the number of processes to
spawn will be derived from resources viewed on the system) and a
benchmark/driver application wrk

Nginx: Nginx is a web server that can also be used as a reverse proxy,
load balancer, mail proxy and HTTP cache
Wrk: wrk is a modern HTTP benchmarking tool capable of generating
significant load when run on a single multi-core CPU

Docker is used as the containerization platform of choice.

The numbers gathered on IBM Power 9 CPU @ 2.979GHz with 176 CPUs and
127GB memory
kernel: 5.14

Case1: vanilla kernel - cpuset 4 cpus, no optimization
Case2: CPU namespace kernel - cpuset 4 cpus

|        Metric         |  Case1   |  Case2   | case2 vs case 1 |
| PIDs                  |      177 |        5 |        172 PIDs |
| mem usage (init) (MB) |    272.8 |    11.12 |          95.92% |
| mem usage (peak) (MB) |    281.3 |    20.62 |          92.66% |
| Latency (avg ms)      |    70.91 |    25.36 |          64.23% |
| Requests/sec          | 47011.05 | 47080.98 |           0.14% |
| Transfer/sec (MB)     |    38.11 |    38.16 |           0.13% |

With the CPU namespace we see the correct number of PIDs spawning
corresponding to the cpuset limits set. The memory utilization drops
over 92-95%, the latency reduces by 64% and the the throughput like
requests and transfer per second is unchanged.

Note: To utilize this new namespace in a container runtime like docker,
the clone CPU namespace flag was modified to coincide with the PID
namespace as they are the building blocks of containers and will always
be invoked.

Current shortcomings in the prototype:
1. Containers also frequently use cfs period and quotas to restrict CPU
   runtime also known as millicores in modern container runtimes.
   The RFC interface currently does not account for this in
   the scheme of things.
2. While /proc/stat is now namespace aware and userspace programs like
   top will see the CPU utilization for their view of virtual CPUs;
   if the system or any other application outside the namespace
   bumps up the CPU utilization it will still show up in sys/user time.
   This should ideally be shown as stolen time instead.
   The current implementation plugs into the display of stats rather
   than accounting which causes incorrect reporting of stolen time.
3. The current implementation assumes that no hotplug operations occur
   within a container and hence the online and present cpus within a CPU
   namespace are always the same and query the same CPU namespace mask
4. As this is a proof of concept, currently we do not differentiate
   between cgroup cpus_allowed and effective_cpus and plugs them into
   the same virtual CPU map of the namespace
5. As described in a fair use implication earlier, knowledge of the
   CPU topology can potentially be taken an misused with a flood.
   While scrambling the CPUset in the namespace can help by
   obfuscation of information, the topology can still be roughly figured
   out with the use of IPI latencies to determine siblings or far away

More information about the design and a video demo of the prototype can
be found here:

Pratik R. Sampat (5):
  ns: Introduce CPU Namespace
  ns: Add scrambling functionality to CPU namespace
  cpuset/cpuns: Make cgroup CPUset CPU namespace aware
  cpu/cpuns: Make sysfs CPU namespace aware
  proc/cpuns: Make procfs load stats CPU namespace aware

 drivers/base/cpu.c             |  35 ++++-
 fs/proc/namespaces.c           |   4 +
 fs/proc/stat.c                 |  50 +++++--
 include/linux/cpu_namespace.h  | 159 ++++++++++++++++++++++
 include/linux/nsproxy.h        |   2 +
 include/linux/proc_ns.h        |   2 +
 include/linux/user_namespace.h |   1 +
 include/uapi/linux/sched.h     |   1 +
 init/Kconfig                   |   8 ++
 kernel/Makefile                |   1 +
 kernel/cgroup/cpuset.c         |  57 +++++++-
 kernel/cpu_namespace.c         | 233 +++++++++++++++++++++++++++++++++
 kernel/fork.c                  |   2 +-
 kernel/nsproxy.c               |  30 ++++-
 kernel/sched/core.c            |  16 ++-
 kernel/ucount.c                |   1 +
 16 files changed, 581 insertions(+), 21 deletions(-)
 create mode 100644 include/linux/cpu_namespace.h
 create mode 100644 kernel/cpu_namespace.c


             reply	other threads:[~2021-10-09 15:13 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-09 15:12 Pratik R. Sampat [this message]
2021-10-09 15:12 ` [RFC 1/5] ns: Introduce CPU Namespace Pratik R. Sampat
2021-10-09 22:37   ` Peter Zijlstra
2021-10-09 15:12 ` [RFC 2/5] ns: Add scrambling functionality to CPU namespace Pratik R. Sampat
2021-10-09 15:12 ` [RFC 3/5] cpuset/cpuns: Make cgroup CPUset CPU namespace aware Pratik R. Sampat
2021-10-09 15:12 ` [RFC 4/5] cpu/cpuns: Make sysfs " Pratik R. Sampat
2021-10-09 15:12 ` [RFC 5/5] proc/cpuns: Make procfs load stats " Pratik R. Sampat
2021-10-09 22:41 ` [RFC 0/5] kernel: Introduce CPU Namespace Peter Zijlstra
2021-10-11 10:11 ` Christian Brauner
2021-10-11 14:17   ` Michal Koutný
2021-10-11 17:42     ` Tejun Heo
2021-10-12  8:42   ` Pratik Sampat
2021-10-14 22:14     ` Tejun Heo
2021-10-18 15:29       ` Pratik Sampat
2021-10-18 16:29         ` Tejun Heo
2021-10-20 10:44           ` Pratik Sampat
2021-10-20 16:35             ` Tejun Heo
2021-10-21  7:44               ` Pratik Sampat
2021-10-21 17:06                 ` Tejun Heo
2021-10-21 17:15             ` Eric W. Biederman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).