[RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP)

* [RFC bpf-next 0/6] XDP RX device meta data acceleration (WIP)
@ 2018-06-27  2:46 Saeed Mahameed
  2018-06-27  2:46 ` [RFC bpf-next 1/6] net: xdp: Add support for meta data flags requests Saeed Mahameed
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Saeed Mahameed @ 2018-06-27  2:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov, Daniel Borkmann
  Cc: neerav.parikh, pjwaskiewicz, ttoukan.linux, Tariq Toukan,
	alexander.h.duyck, peter.waskiewicz.jr, Opher Reviv, Rony Efraim,
	netdev, Saeed Mahameed

Hello,

Although it is far from being close to completion, this series provides
the means and infrastructure to enable device drivers to share packets
meta data information and accelerations with XDP programs, 
such as hash, stripped vlan, checksum, flow mark, packet header types,
etc ..

The series is still work in progress and sending it now as RFC in order
to get early feedback and to discuss the design, structures and UAPI.

For now the general idea is to help XDP programs accelerate, and provide
them with meta data that already available from the HW or device driver
to save cpu cycles on packet headers and data processing and decision
making, aside of that we want to avoid having a fixed size meta data
structure that wastes a lot of buffer space for stuff that the xdp program
might not need, like the current "elephant" SKB fields, kidding :) ..

So my idea here is to provide a dynamic mechanism to allow XDP programs
on xdp load to request only the specific meta data they need, and
the device driver or netdevice will provide them in a predefined order
in the xdp_buff/xdp_md data meta section.

And here is how it is done and what i would like to discuss:

1. In the first patch: added the infrastructure to request predefined meta data
flags on xdp load indicating that the XDP program is going to need them.

1.1) In this patch I am using the current u32 bit IFLA_XDP_FLAGS,
TODO: this needs to be improved in order to allow more meta data flags,
maybe use a new dedicated flags ?

1.2) Device driver that want to support xdp meta data should implement
new XDP command XDP_QUERY_META_FLAGS and report the meta data flags they
can support.

1.3) the kernel will cross check the requested flags with the device's
supported flags and will fail the xdp load in case of discrepancy.

Question : do we want this ? or is it better to return to the program
the actual supported flags and let it decide ?

2. This is the interesting part: in the 2nd patch Added the meta data
set/get infrastructure to allow device drivers populate them into xdp_buff
data meta in a well defined and structured format and let the xdp program
read them according to the predefined format/structure, the idea here is
that the xdp program and the device driver will share a static information
offsets array that will define where each meta data will sit inside
xdp_buff data meta, such structure will be decided once on xdp load.

Enters struct xdp_md_info and xdp_md_info_arr:

struct xdp_md_info {
       __u16 present:1;
       __u16 offset:15; /* offset from data_meta in xdp_md buffer */
};

/* XDP meta data offsets info array
 * present bit describes if a meta data is or will be present in xdp_md buff
 * offset describes where a meta data is or should be placed in xdp_md buff
 *
 * Kernel builds this array using xdp_md_info_build helper on demand.
 * User space builds it statically in the xdp program.
 */
typedef struct xdp_md_info xdp_md_info_arr[XDP_DATA_META_MAX];

Offsets in xdp_md_info_arr are always in ascending order and only for
requested meta data:
example : for XDP prgram that requested the following flags:
flags = XDP_FLAGS_META_HASH | XDP_FLAGS_META_VLAN;

the offsets array will be:

xdp_md_info_arr mdi = {
        [XDP_DATA_META_HASH] = {.offset = 0, .present = 1},
        [XDP_DATA_META_MARK] = {.offset = 0, .present = 0},
        [XDP_DATA_META_VLAN] = {.offset = sizeof(struct xdp_md_hash), .present = 1},
        [XDP_DATA_META_CSUM] = {.offset = 0, .present = 0},
}

For this example: hash fields will always appear first then the vlan for every
xdp_md.

Once requested to provide xdp meta data, device driver will use a pre-built
xdp_md_info_arr which was built via xdp_md_info_build on xdp setup,
xdp_md_info_arr will tell the driver what is the offset of each meta data.
The user space XDP program will use a similar xdp_md_info_arr to
statically know what is the offset of each meta data.

*For future meta data they will be added to the end of the array with
higher flags value.

This patch also provides helper functions for the device drivers to store
meta data into xdp_buff, and helper function for XDP programs to fetch
specific xdp meta data from xdp_md buffer.

Question: currently the XDP program is responsible to build the static
meta data offsets array "xdp_md_info_arr" and the kernel will build it
for the netdevice, if we are going to choose this direction, should we
somehow share the same xdp_md_info_arr built by the kernel with the xdp
program ?

3. The last mlx5e patch is the actual show case of how the device driver
will add the support for xdp meta data, it is pretty simple and straight
forward ! The first two mlx5e patches are just small refactoring to make
the xdp_md_info_arr and packet completion information available in the rx
xdp handlers data path.

4. Just added a small example to demonstrate how XDP program can request
meta data on xdp load and how it can read them on the critical path.
of course more examples are needed and some performance numbers.
Exciting use cases such as:
  - using flow mark to allow fast white/black listing lookups.
  - flow mark to accelerate DDos prevention.
  - Generic Data path: Use the meta data from the xdp_buff to build SKBs !(Jesper's Idea)
  - using hash type to know the packet headers and encapsulation without
    touching the packet data at all.
  - using packet hash to do RPS and XPS like cpu redirecting.
  - and many more.

More ideas:

>From Jesper: 
=========================
Give XDP more rich information about the hash,
By reducing the 'ht' value to the PKT_HASH_TYPE's we are loosing information.

I happen to know your hardware well; CQE_RSS_HTYPE_IP tell us if this
is IPv4 or IPv6.  And CQE_RSS_HTYPE_L4 tell us if this is TCP, UDP or
IPSEC. (And you have another bit telling of this is IPv6 with extension
headers).

If we don't want to invent our own xdp_hash_types, we can simply adopt
the RSS Hashing Types defined by Microsoft:
 https://docs.microsoft.com/en-us/windows-hardware/drivers/network/rss-hashing-types

An interesting part of the RSS standard, is that the hash type can help
identify if this is a fragment. (XDP could use this info to avoid
touching payload and react, e.g. drop fragments, or redirect all
fragments to another CPU, or skip parsing in XDP and defer to network
stack via XDP_PASS).

By using the RSS standard, we do e.g. loose the ability to say this is
IPSEC traffic, even-though your HW supports this.

I have tried to implemented my own (non-dynamic) XDP RX-types UAPI here:
 https://marc.info/?l=linux-netdev&m=149512213531769
 https://marc.info/?l=linux-netdev&m=149512213631774
=========================

Thanks,
Saeed.

Saeed Mahameed (6):
  net: xdp: Add support for meta data flags requests
  net: xdp: RX meta data infrastructure
  net/mlx5e: Store xdp flags and meta data info
  net/mlx5e: Pass CQE to RX handlers
  net/mlx5e: Add XDP RX meta data support
  samples/bpf: Add meta data hash example to xdp_redirect_cpu

 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  19 ++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  58 +++++----
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  47 ++++++--
 include/linux/netdevice.h                     |  10 ++
 include/net/xdp.h                             |   6 +
 include/uapi/linux/bpf.h                      | 114 ++++++++++++++++++
 include/uapi/linux/if_link.h                  |  20 ++-
 net/core/dev.c                                |  41 +++++++
 samples/bpf/xdp_redirect_cpu_kern.c           |  67 ++++++++++
 samples/bpf/xdp_redirect_cpu_user.c           |   7 ++
 10 files changed, 354 insertions(+), 35 deletions(-)

-- 
2.17.0

^ permalink raw reply	[flat|nested] 36+ messages in thread