[RFC PATCH 00/11] ptq: Per Thread Queues

From: Tom Herbert <tom@herbertland.com>
To: netdev@vger.kernel.org
Cc: Tom Herbert <tom@herbertland.com>
Subject: [RFC PATCH 00/11] ptq: Per Thread Queues
Date: Wed, 24 Jun 2020 10:17:39 -0700	[thread overview]
Message-ID: <20200624171749.11927-1-tom@herbertland.com> (raw)

Per Thread Queues allows application threads to be assigned dedicated
hardware network queues for both transmit and receive. This facility
provides a high degree of traffic isolation between applications and
can also help facilitate high performance due to fine grained packet
steering. An overview and design considerations of Per Thread Queues
has been add to Documentation/networking/scaling.rst.

This patch set provides a basic implementation of Per Thread Queues.
The patch set includes:

	- Minor Infrastructure changes to cgroups (just export a
	  couple of functions)
	- netqueue.h to hold generic definitions for network queues
	- Minor infrastructure in aRFS and net-sysfs to accommodate
	  PTQ
	- Introduce the concept of "global queues". These are used
	  in cgroup configuration of PTQ. Global queues can be
	  mapped to real device queues. A per device queue sysfs
	  parameter is added to configure the mapping of device
	  queue to a global queue
	- Creation of a new cgroup controller, "net_queues", that
	  is used to configure Per Thread Queues
	- Hook up the transmit path. This has two parts: 1) In
	  send socket operations record the transmit queue
	  associated with a task in the sock structure, 2) In
	  netdev_pick_tx, check if the sock structure of the skb
	  has a valid transmit global queue set. If so, convert the
	  queue identifier to a device queue identifier based on the per
	  device mapping table. This selection precedes XPS
	- Hook up the receive path. This has two parts: 1) In
	  rps_record_sock_flow check if a receive global queue is
	  assigned to the running task, if so then set it in the
	  sock_flow_table entry for the flow. Note this in lieu of
	  setting the running CPU in the entry. 2) Change get_rps_cpu to
	  query the sock_flow_table to see if a queue index has been
	  stored (as opposed to a CPU number). If a queue index is
	  present, use it for steering including for it to be the
	  target of ndo_rx_flow_steer.

Related features and concepts:

	- netprio and prio_tc_map: Similar to those, PTQ allows control,
	  via cgroups and per device maps, over mapping applications'
	  packets to transmit queues. However, PTQ is intended to
	  perform fine grained per application mapping to queues such
	  that each application thread, possibly thousands of them, can
	  have its own dedicate transmit queue.
	- aRFS: On the transmit side PTQ extends aRFS to steer packets
	  for a flow based on assigned global queue as opposed to only
	  running CPU for the processing thread. In PTQ, the queue
	  "follows" the thread so that when threads are scheduled to
	  run on a different CPU, the packets for flows of the thread
	  continue to be received on the right queue. This addresses
	  a problem in aRFS where when a thread is rescheduled all
	  of its aRFS steered flows may be to moved to a different queue
	  (i.e. ndo_rx_flow_steer needs to be called for each flow).
	- Busy polling: PTQ provides silo'ing of an application packets
	  into queues and busy polling of those queue can then be
	  applied for high performance. This is likely the fist
	  instantiation of PTQ to combine it with busy polling
	  (moving interrupts for those queues as threads are scheduled
	  is most likely prohibitive). Busy polling is only practical
	  with a few queues, like maybe at most one per CPU, and
	  won't scale to thousands of per thread queues in use.
          (to address that sleeping-busy-poll with completion
	  queues is suggested below).
	- Making Networking Queues a First Class Citizen in the Kernel
	  https://linuxplumbersconf.org/event/4/contributions/462/
	  attachments/241/422/LPC_2019_kernel_queue_manager.pdf:
	  The concept of "global queues" should be a good complement
	  to this proposal. Global queue provide an abstract
	  representation of device queues. the abstraction is resolved
	  when the global queue is mapped to a real hardware queue. This
	  layering allows exposing queues to the user and configuration
	  which might be associated with general attributes (like high
	  priority, QoS characteristics, etc.). The mapping to a
	  specific device queue gives the low level queue that satisfies
	  the implied service of the global queue.  Any attributes and
	  associations are configured and in no way hardcoded, so that
	  the use of queues in the manner is fully extensible and can be
	  driven be arbitrary user defined policy. Since global queues
	  are device agnostic they not just can be managed as local
	  system resource, but also across across the distributed
	  tasks for a job in the datacenter like as a property of a
	  container in Kubernetes (similar to how we might manage
	  network priority as a global DC resource, but global queues
	  provide much more granularity and richness in what they can
	  convey).

There are a number of possible extensions to this work

	- Queue selection could be done on a per process basis
	  or a per socket basis as well as a per thread basis. (per
	  packet basis probably makes little sense due to OOO)
	- The mechanism for selecting a queue to assign to a thread
	  could be programmed. For instance, an eBPF hook could be
	  added that would allow very fine grained policies to do
	  queue selection.
	- "Global queue groups" could be created where a global queue
	  identifier maps to some group of device queues and there is
	  a selection algorithm, possibly another eBPF hook, that
	  maps to a specific device queue for use.
	- Another attribute in the cgroup could be added to enable
	  or disable aRFS on a per thread basis.
	- Extend the net_queues cgroup to allow control over
	  busy-polling on a per cgroup basis. This could further
	  be enhanced by eBPF hooks to control busy-polling for
	  individual sockets of the cgroup per some arbitrary policy
	  (similar to eBPF hook for SO_RESUSEPORT).
	- Elasticity in listener sockets. As described in the
	  Documentation we expect that a filter can be installed to
	  direct packets an application to the set of queues for the
	  applications. The problem is that the application may
	  create threads on demand so that we don't know a priori
	  how many queues the application needs. Optimally, we
	  want a mechanism to dynamically enable/disable a
	  queue in the filter set so that at any given time the
	  application is receive packets only on queues it is
	  actively using. This may entail a new ndo_function.
	- The sleeping-busy-poll with completion queue model
	  described in the documentation could be integrated. This
	  would most entail creating a reverse mapping from queue
	  to threads, and then allowing the thread processing a
	  device completion queue to schedule the threads of interest.

Tom Herbert (11):
  cgroup: Export cgroup_{procs,threads}_start and cgroup_procs_next
  net: Create netqueue.h and define NO_QUEUE
  arfs: Create set_arfs_queue
  net-sysfs: Create rps_create_sock_flow_table
  net: Infrastructure for per queue aRFS
  net: Function to check against maximum number for RPS queues
  net: Introduce global queues
  ptq: Per Thread Queues
  ptq: Hook up transmit side of Per Queue Threads
  ptq: Hook up receive side of Per Queue Threads
  doc: Documentation for Per Thread Queues

 Documentation/networking/scaling.rst | 195 +++++++-
 include/linux/cgroup.h               |   3 +
 include/linux/cgroup_subsys.h        |   4 +
 include/linux/netdevice.h            | 204 +++++++-
 include/linux/netqueue.h             |  25 +
 include/linux/sched.h                |   4 +
 include/net/ptq.h                    |  45 ++
 include/net/sock.h                   |  75 ++-
 kernel/cgroup/cgroup.c               |   9 +-
 kernel/fork.c                        |   4 +
 net/Kconfig                          |  18 +
 net/core/Makefile                    |   1 +
 net/core/dev.c                       | 177 +++++--
 net/core/filter.c                    |   4 +-
 net/core/net-sysfs.c                 | 201 +++++++-
 net/core/ptq.c                       | 688 +++++++++++++++++++++++++++
 net/core/sysctl_net_core.c           | 152 ++++--
 net/ipv4/af_inet.c                   |   6 +
 18 files changed, 1693 insertions(+), 122 deletions(-)
 create mode 100644 include/linux/netqueue.h
 create mode 100644 include/net/ptq.h
 create mode 100644 net/core/ptq.c

-- 
2.25.1