From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: [net-next PATCH 2/4] doc/bpf: document interacting with eBPF maps Date: Tue, 07 Feb 2017 15:30:22 +0100 Message-ID: <148647782206.10567.16235671715611188966.stgit@localhost> References: <148647756092.10567.10947541548678801938.stgit@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: Alexei Starovoitov , alexander@alemayhu.com, linux-kernel@vger.kernel.org, quentin.monnet@6wind.com, Jesper Dangaard Brouer , Daniel Borkmann To: netdev@vger.kernel.org, linux-doc@vger.kernel.org, Jonathan Corbet Return-path: In-Reply-To: <148647756092.10567.10947541548678801938.stgit@localhost> Sender: linux-kernel-owner@vger.kernel.org List-Id: netdev.vger.kernel.org Describe what eBPF maps are, how to create them, and how to interact with them. Thanks to Quentin Monnet for improving this document by fixing many typos and early review. Signed-off-by: Jesper Dangaard Brouer --- Documentation/bpf/ebpf_maps.rst | 255 +++++++++++++++++++++++++++++++++++++++ Documentation/bpf/index.rst | 2 2 files changed, 257 insertions(+) create mode 100644 Documentation/bpf/ebpf_maps.rst diff --git a/Documentation/bpf/ebpf_maps.rst b/Documentation/bpf/ebpf_maps.rst new file mode 100644 index 000000000000..b8808f3bc31c --- /dev/null +++ b/Documentation/bpf/ebpf_maps.rst @@ -0,0 +1,255 @@ +========= +eBPF maps +========= + +This document describes what eBPF maps are, how you create them +(`Creating a map`_), and how to interact with them (`Interacting with +maps`_). + +Using eBPF maps is a method to keep state between invocations of the +eBPF program, and allows sharing data between eBPF kernel programs, +and also between kernel and user-space applications. + +Basically a key/value store with arbitrary structure (from man-page +`bpf(2)`_): + + eBPF maps are a generic data structure for storage of different data + types. Data types are generally treated as binary blobs, so a user + just specifies the size of the key and the size of the value at + map-creation time. In other words, a key/value for a given map can + have an arbitrary structure. + +The map handles are file descriptors, and multiple maps can be created +and accessed by multiple programs (from man-page `bpf(2)`_): + + A user process can create multiple maps (with key/value-pairs being + opaque bytes of data) and access them via file descriptors. + Different eBPF programs can access the same maps in parallel. It's + up to the user process and eBPF program to decide what they store + inside maps. + +.. _`Creating a map`: + +Creating a map +============== + +A map is created based on a request from userspace, via the `bpf`_ +syscall (specifically `bpf_cmd`_ BPF_MAP_CREATE), which returns a new +file descriptor that refers to the map. On error, -1 is returned and +errno is set to EINVAL, EPERM, or ENOMEM. These are the struct +``bpf_attr`` setup arguments to use when creating a map via the +syscall: + +.. code-block:: c + + bpf(BPF_MAP_CREATE, &bpf_attr, sizeof(bpf_attr)); + +Notice how this kernel ABI is extensible, as more struct arguments can +easily be added later as the sizeof(bpf_attr) is passed along to the +syscall. This also implies that API users must clear/zero +sizeof(bpf_attr), as compiler can size-align the struct differently, +to avoid garbage data to be interpreted as parameters by future +kernels. + +The following configuration attributes are needed when creating the map: + +.. code-block:: c + + union bpf_attr { + struct { /* anonymous struct used by BPF_MAP_CREATE command */ + __u32 map_type; /* one of enum bpf_map_type */ + __u32 key_size; /* size of key in bytes */ + __u32 value_size; /* size of value in bytes */ + __u32 max_entries; /* max number of entries in a map */ + __u32 map_flags; /* prealloc or not */ + }; + } + +.. _bpf_cmd: http://lxr.free-electrons.com/ident?i=bpf_cmd + + +Kernel sample/bpf ELF convention +-------------------------------- + +For programs under samples/bpf/, defining a map have been integrated +with ELF binary generated by LLVM. This is purely one example of a +userspace convention and not part of the kernel ABI. It still invokes +the bpf syscall. + +Map definitions are done by defining a ``struct bpf_map_def`` with an +elf section __attribute__ ``SEC("maps")``, in the xxx_kern.c file. +The maps file descriptor is available in the userspace xxx_user.c +file, via global array variable ``map_fd[]``, and the array map index +corresponds to the order the maps sections were defined in elf file of +xxx_kern.c file. Behind the scenes it is the ``load_bpf_file()`` call +(from `samples/bpf/bpf_load`_) that takes care of parsing ELF file +compiled by LLVM, pickup 'maps' section and creates maps via the bpf +syscall. + +.. code-block:: c + + struct bpf_map_def { + unsigned int type; + unsigned int key_size; + unsigned int value_size; + unsigned int max_entries; + unsigned int map_flags; + }; + + struct bpf_map_def SEC("maps") my_map = { + .type = BPF_MAP_TYPE_XXX, + .key_size = sizeof(u32), + .value_size = sizeof(u64), + .max_entries = 42, + .map_flags = 0 + }; + +.. section links + +.. _samples/bpf/bpf_load: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/bpf_load.c + +Qdisc Traffic Control convention +-------------------------------- + +It is worth mentioning, that qdisc TC (Traffic Control), also use ELF +files for defining the maps, but it uses another layout. See man-page +`tc-bpf(8)`_ and `tc bpf examples`_ in iproute2.git_ tree. + +.. _iproute2.git: + https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/about/ + +.. _tc bpf examples: + https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/tree/examples/bpf + +.. _tc-bpf(8): http://man7.org/linux/man-pages/man8/tc-bpf.8.html + + +Interacting with maps +===================== + +Interacting with eBPF maps happens through some **lookup/update/delete** +primitives. + +When writing eBFP programs using load helpers and libraries from +samples/bpf/ and tools/lib/bpf/. Common function name API have been +created that hides the details of how kernel vs. userspace access +these primitives (which is quite different). + +The common function names (parameters and return values differs): + +.. code-block:: c + + void bpf_map_lookup_elem(map, void *key. ...); + void bpf_map_update_elem(map, void *key, ..., __u64 flags); + void bpf_map_delete_elem(map, void *key); + +The ``flags`` argument in ``bpf_map_update_elem()`` allows to define +semantics on whether the element exists: + +.. code-block:: c + + /* File: include/uapi/linux/bpf.h */ + /* flags for BPF_MAP_UPDATE_ELEM command */ + #define BPF_ANY 0 /* create new element or update existing */ + #define BPF_NOEXIST 1 /* create new element only if it didn't exist */ + #define BPF_EXIST 2 /* only update existing element */ + +Userspace +--------- +The userspace API map helpers are defined in `tools/lib/bpf/bpf.h`_ +and looks like this: + +.. code-block:: c + + /* Userspace helpers */ + int bpf_map_lookup_elem(int fd, void *key, void *value); + int bpf_map_update_elem(int fd, void *key, void *value, __u64 flags); + int bpf_map_delete_elem(int fd, void *key); + /* Only userspace: */ + int bpf_map_get_next_key(int fd, void *key, void *next_key); + + +Interacting with an eBPF map from **userspace**, happens through the +`bpf`_ syscall and a file descriptor. See how the map handle ``int +fd`` is a file descriptor . On success, zero is returned, on +failures -1 is returned and errno is set. + +Wrappers for the bpf syscall is implemented in `tools/lib/bpf/bpf.c`_, +and ends up calling functions in `kernel/bpf/syscall.c`_, like +`map_lookup_elem`_. + +.. code-block:: c + + /* Corresponding syscall bpf commands from userspace */ + enum bpf_cmd { + [...] + BPF_MAP_LOOKUP_ELEM, + BPF_MAP_UPDATE_ELEM, + BPF_MAP_DELETE_ELEM, + BPF_MAP_GET_NEXT_KEY, + [...] + }; + +Notice how ``void *key`` and ``void *value`` are passed as a void +pointers. Given the memory seperation between kernel and userspace, +this is a copy of the value. Kernel primitives like +``copy_from_user()`` and ``copy_to_user()`` are used, e.g. see +`map_lookup_elem`_, which also kmalloc+kfree memory for a short +period. + +From userspace, there is no function call to atomically increment or +decrement the value 'in-place'. The bpf_map_update_elem() call will +overwrite the existing value, with a copy of the value supplied. +Depending on the map type, the overwrite will happen in an atomic way, +e.g. using locking mechanisms specific to the map type. + +.. section links + +.. _tools/lib/bpf/bpf.h: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/lib/bpf/bpf.h + +.. _tools/lib/bpf/bpf.c: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/lib/bpf/bpf.c + +.. _map_lookup_elem: http://lxr.free-electrons.com/ident?i=map_lookup_elem + +.. _kernel/bpf/syscall.c: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/syscall.c + + +Kernel-side eBPF program +------------------------ + +The API mapping for eBPF programs on the kernel-side is fairly hard to +follow. It related to `samples/bpf/bpf_helpers.h`_ and maps into +`kernel/bpf/helpers.c`_ via macros. + +.. code-block:: c + + /* eBPF program helpers */ + void *bpf_map_lookup_elem(void *map, void *key); + int bpf_map_update_elem(void *map, void *key, void *value, unsigned long long flags); + int bpf_map_delete_elem(void *map, void *key); + +The eBPF-program running kernel-side interacts more directly with the +map data structures. For example the call ``bpf_map_lookup_elem()`` +returns a direct pointer to the 'value' memory-element inside the +kernel (while userspace gets a copy). This allows the eBPF-program to +atomically increment or decrement the value 'in-place', by using +appropiate compiler primitives like ``__sync_fetch_and_add()``, which +is understood by LLVM when generating eBPF instructions. + +.. section links + +.. _samples/bpf/bpf_helpers.h: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/samples/bpf/bpf_helpers.h + +.. _kernel/bpf/helpers.c: + https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/helpers.c + +.. links + +.. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html + +.. _bpf: http://man7.org/linux/man-pages/man2/bpf.2.html diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index f262fe8f9f95..738d17aaf6d7 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -42,6 +42,8 @@ covered by this documentation. .. toctree:: :maxdepth: 1 + ebpf_maps + .. links: .. _article 1992: http://www.tcpdump.org/papers/bpf-usenix93.pdf