All of lore.kernel.org
 help / color / mirror / Atom feed
* FW: ebpf-docs: draft of ISA doc updates in progress
       [not found] <CY5PR21MB377000AC95B475C47B702293A3439@CY5PR21MB3770.namprd21.prod.outlook.com>
@ 2022-09-13  8:12 ` Dave Thaler
  2022-09-14  6:22   ` Shung-Hsi Yu
  2022-09-19 16:58   ` Christoph Hellwig
  0 siblings, 2 replies; 16+ messages in thread
From: Dave Thaler @ 2022-09-13  8:12 UTC (permalink / raw)
  To: bpf

Resending since the list apparently never got the emails I sent last Friday...

From: Dave Thaler 
Sent: Friday, September 9, 2022 9:36 PM
To: bpf <bpf@vger.kernel.org>
Subject: ebpf-docs: draft of ISA doc updates in progress

I've been working on ISA docs, with review by Jim Harris, Quentin Monnet, and Alan
Jowett so wanted to post this before LPC when we can discuss progress and open
questions.

A rendered draft can be viewed at:
https://github.com/dthaler/ebpf-docs/blob/update/isa/kernel.org/instruction-set.rst

For people who prefer github format: https://github.com/dthaler/ebpf-docs/pull/4

For people who use format-patch format, see below.  The instruction-set.rst
changes are directly diffs against the kernel.org copy mirrored
to github for rendering via the link above.

Feedback welcome on the direction so far.

Thanks,
Dave


From 9db8d5bbada4f593f52078ca329702cccf22f656 Mon Sep 17 00:00:00 2001
From: Dave Thaler mailto:dthaler@microsoft.com
Date: Mon, 29 Aug 2022 12:04:06 -0700
Subject: [PATCH] Update ISA documentation

* Add section numbers
* Add glossary formatting for fields
* Use consistent field names
* Add appendix with opcode table
* Add ISA version information
* Add text about underflow, overflow, and division by zero
* Add other text based on info in other documents

Signed-off-by: Dave Thaler mailto:dthaler@microsoft.com
---
README.md                          |  27 +
isa/README.md                      |   2 +
isa/kernel.org/instruction-set.rst | 765 +++++++++++++++++++++--------
3 files changed, 587 insertions(+), 207 deletions(-)
create mode 100644 README.md

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..715496a
--- /dev/null
+++ b/README.md
@@ -0,0 +1,27 @@
+# eBPF Standard Documentation
+
+This repository is a working draft of standard eBPF documentation
+to be published by the eBPF Foundation in PDF format.
+
+The authoritative source from which it is built is expected to be
+in the Linux kernel.org repository, but not be Linux specific.
+
+A GitHub mirror should be used, as is presently done for libbpf and
+bpftool, so that other platforms and tools can easily use it.
+As such, the documentation uses the subset of RST that GitHub
+renders correctly.
+
+The documentation can be IETF RFC style MUST/SHOULD/MAY language
+if desired.  It does not currently do so.
+
+## Questions
+
+1. Any objections to the github mirror approach?
+
+2. Should we include or exclude Linux implementation notes
+   and Clang implementation notes?
+
+3. How do we handle the legacy packet instructions?
+
+4. How do we handle the wide instructions that reference
+   map fd, map indices, BTF ids, and BPF callbacks?
diff --git a/isa/README.md b/isa/README.md
index 98c49da..60cc347 100644
--- a/isa/README.md
+++ b/isa/README.md
@@ -8,6 +8,8 @@ Some documentation links include:
* https://www.kernel.org/doc/Documentation/networking/filter.txt
* https://pchaigno.github.io/bpf/2021/10/20/ebpf-instruction-sets.html
* https://www.kernel.org/doc/html/latest/bpf/bpf_design_QA.html#instruction-level-questions
+* https://lore.kernel.org/bpf/8DA9E260-AE56-4B21-90BF-CF0049CFD04D@intel.com/
+* https://github.com/iovisor/bpf-docs/pull/26/files

 Some implementation links include:

diff --git a/isa/kernel.org/instruction-set.rst b/isa/kernel.org/instruction-set.rst
index 1b0e671..7f8ee00 100644
--- a/isa/kernel.org/instruction-set.rst
+++ b/isa/kernel.org/instruction-set.rst
@@ -1,8 +1,24 @@
+.. contents::
+.. sectnum::

 ====================
eBPF Instruction Set
====================

+The eBPF instruction set consists of eleven 64 bit registers, a program counter,
+and 512 bytes of stack space.
+
+Versions
+========
+
+The current Instruction Set Architecture (ISA) version, sometimes referred to in other documents
+as a "CPU" version, is 3.  This document also covers older versions of the ISA.
+
+   **Note**
+
+   *Clang implementation*: Clang can select the eBPF ISA version using
+   ``-mcpu=v2`` for example to select version 2.
+
Registers and calling convention
================================

@@ -11,198 +27,331 @@ all of which are 64-bits wide.

 The eBPF calling convention is defined as:

- * R0: return value from function calls, and exit value for eBPF programs
- * R1 - R5: arguments for function calls
- * R6 - R9: callee saved registers that function calls will preserve
- * R10: read-only frame pointer to access stack
+* R0: return value from function calls, and exit value for eBPF programs
+* R1 - R5: arguments for function calls
+* R6 - R9: callee saved registers that function calls will preserve
+* R10: read-only frame pointer to access stack
+
+Registers R0 - R5 are scratch registers, meaning the BPF program needs to either
+spill them to the BPF stack or move them to callee saved registers if these
+arguments are to be reused across multiple function calls. Spilling means
+that the value in the register is moved to the BPF stack. The reverse operation
+of moving the variable from the BPF stack to the register is called filling.
+The reason for spilling/filling is due to the limited number of registers.

-R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if
-necessary across calls.
+   **Note**
+
+   *Linux implementation*: In the Linux kernel, the exit value for eBPF
+   programs is passed as a 32 bit value.
+
+Upon entering execution of an eBPF program, registers R1 - R5 initially can contain
+the input arguments for the program (similar to the argc/argv pair for a typical C program).
+The actual number of registers used, and their meaning, is defined by the program type;
+for example, a networking program might have an argument that includes network packet data
+and/or metadata.
+
+   **Note**
+
+   *Linux implementation*: In the Linux kernel, all program types only use
+   R1 which contains the "context", which is typically a structure containing all
+   the inputs needed.  
 
 Instruction encoding
====================

+An eBPF program is a sequence of instructions.
+
eBPF has two instruction encodings:

- * the basic instruction encoding, which uses 64 bits to encode an instruction
- * the wide instruction encoding, which appends a second 64-bit immediate value
-   (imm64) after the basic instruction for a total of 128 bits.
+* the basic instruction encoding, which uses 64 bits to encode an instruction
+* the wide instruction encoding, which appends a second 64-bit immediate (i.e.,
+  constant) value after the basic instruction for a total of 128 bits.
+
+The basic instruction encoding is as follows:

-The basic instruction encoding looks as follows:
+=============  =======  ===============  ====================  ============
+32 bits (MSB)  16 bits  4 bits           4 bits                8 bits (LSB)
+=============  =======  ===============  ====================  ============
+imm            offset   src              dst                   opcode
+=============  =======  ===============  ====================  ============

- =============  =======  ===============  ====================  ============
- 32 bits (MSB)  16 bits  4 bits           4 bits                8 bits (LSB)
- =============  =======  ===============  ====================  ============
- immediate      offset   source register  destination register  opcode
- =============  =======  ===============  ====================  ============
+imm         
+  integer immediate value
+
+offset
+  signed integer offset used with pointer arithmetic
+
+src
+  source register number (0-10)
+
+dst
+  destination register number (0-10)
+
+opcode
+  operation to perform

 Note that most instructions do not use all of the fields.
-Unused fields shall be cleared to zero.
+Unused fields must be set to zero.
+
+As discussed below in `64-bit immediate instructions`_, some basic
+instructions denote that a 64-bit immediate value follows.  Thus
+the wide instruction encoding is as follows:
+
+=================  =============
+64 bits (MSB)      64 bits (LSB)
+=================  =============
+basic instruction  imm64
+=================  =============
+
+where MSB and LSB mean the most significant bits and least significant bits, respectively.
+
+In the remainder of this document 'src' and 'dst' refer to the values of the source
+and destination registers, respectively, rather than the register number.

 Instruction classes
-------------------

-The three LSB bits of the 'opcode' field store the instruction class:
-
-  =========  =====  ===============================
-  class      value  description
-  =========  =====  ===============================
-  BPF_LD     0x00   non-standard load operations
-  BPF_LDX    0x01   load into register operations
-  BPF_ST     0x02   store from immediate operations
-  BPF_STX    0x03   store from register operations
-  BPF_ALU    0x04   32-bit arithmetic operations
-  BPF_JMP    0x05   64-bit jump operations
-  BPF_JMP32  0x06   32-bit jump operations
-  BPF_ALU64  0x07   64-bit arithmetic operations
-  =========  =====  ===============================
+The encoding of the 'opcode' field varies and can be determined from
+the three least significant bits (LSB) of the 'opcode' field which holds
+the "instruction class", as follows:
+
+=========  =====  ===============================  =======  =================
+class      value  description                      version  reference
+=========  =====  ===============================  =======  =================
+BPF_LD     0x00   non-standard load operations     1        `Load and store instructions`_
+BPF_LDX    0x01   load into register operations    1        `Load and store instructions`_
+BPF_ST     0x02   store from immediate operations  1        `Load and store instructions`_
+BPF_STX    0x03   store from register operations   1        `Load and store instructions`_
+BPF_ALU    0x04   32-bit arithmetic operations     3        `Arithmetic and jump instructions`_
+BPF_JMP    0x05   64-bit jump operations           1        `Arithmetic and jump instructions`_
+BPF_JMP32  0x06   32-bit jump operations           3        `Arithmetic and jump instructions`_
+BPF_ALU64  0x07   64-bit arithmetic operations     1        `Arithmetic and jump instructions`_
+=========  =====  ===============================  =======  =================
+
+where 'version' indicates the first ISA version in which support for the value was mandatory.

 Arithmetic and jump instructions
================================

-For arithmetic and jump instructions (BPF_ALU, BPF_ALU64, BPF_JMP and
-BPF_JMP32), the 8-bit 'opcode' field is divided into three parts:
+For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` and
+``BPF_JMP32``), the 8-bit 'opcode' field is divided into three parts:
+
+==============  ======  =================
+4 bits (MSB)    1 bit   3 bits (LSB)
+==============  ======  =================
+code            source  instruction class
+==============  ======  =================

-  ==============  ======  =================
-  4 bits (MSB)    1 bit   3 bits (LSB)
-  ==============  ======  =================
-  operation code  source  instruction class
-  ==============  ======  =================
+code
+  the operation code, whose meaning varies by instruction class

-The 4th bit encodes the source operand:
+source
+  the source operand location, which unless otherwise specified is one of:

   ======  =====  ========================================
   source  value  description
   ======  =====  ========================================
-  BPF_K   0x00   use 32-bit immediate as source operand
-  BPF_X   0x08   use 'src_reg' register as source operand
+  BPF_K   0x00   use 32-bit 'imm' value as source operand
+  BPF_X   0x08   use 'src' register value as source operand
   ======  =====  ========================================

-The four MSB bits store the operation code.
-
+instruction class
+  the instruction class (see `Instruction classes`_)

 Arithmetic instructions
-----------------------

-BPF_ALU uses 32-bit wide operands while BPF_ALU64 uses 64-bit wide operands for
+Instruction class ``BPF_ALU`` uses 32-bit wide operands (zeroing the upper 32 bits
+of the destination register) while ``BPF_ALU64`` uses 64-bit wide operands for
otherwise identical operations.
-The code field encodes the operation as below:

-  ========  =====  =================================================
-  code      value  description
-  ========  =====  =================================================
-  BPF_ADD   0x00   dst += src
-  BPF_SUB   0x10   dst -= src
-  BPF_MUL   0x20   dst \*= src
-  BPF_DIV   0x30   dst /= src
-  BPF_OR    0x40   dst \|= src
-  BPF_AND   0x50   dst &= src
-  BPF_LSH   0x60   dst <<= src
-  BPF_RSH   0x70   dst >>= src
-  BPF_NEG   0x80   dst = ~src
-  BPF_MOD   0x90   dst %= src
-  BPF_XOR   0xa0   dst ^= src
-  BPF_MOV   0xb0   dst = src
-  BPF_ARSH  0xc0   sign extending shift right
-  BPF_END   0xd0   byte swap operations (see separate section below)
-  ========  =====  =================================================
+Support for ``BPF_ALU`` is required in ISA version 3, and optional in earlier
+versions.
+
+   **Note**
+
+   *Clang implementation*:
+   For ISA versions prior to 3, Clang v7.0 and later can enable ``BPF_ALU`` support with
+   ``-Xclang -target-feature -Xclang +alu32``.
+
+The 4-bit 'code' field encodes the operation as follows:
+
+========  =====  =================================================
+code      value  description
+========  =====  =================================================
+BPF_ADD   0x00   dst += src
+BPF_SUB   0x10   dst -= src
+BPF_MUL   0x20   dst \*= src
+BPF_DIV   0x30   dst /= src
+BPF_OR    0x40   dst \|= src
+BPF_AND   0x50   dst &= src
+BPF_LSH   0x60   dst <<= src
+BPF_RSH   0x70   dst >>= src
+BPF_NEG   0x80   dst = ~src
+BPF_MOD   0x90   dst %= src
+BPF_XOR   0xa0   dst ^= src
+BPF_MOV   0xb0   dst = src
+BPF_ARSH  0xc0   sign extending shift right
+BPF_END   0xd0   byte swap operations (see `Byte swap instructions`_ below)
+========  =====  =================================================
+
+Underflow and overflow are allowed during arithmetic operations,
+meaning the 64-bit or 32-bit value will wrap.
+
+``BPF_DIV`` has an implicit program exit condition as well. If
+eBPF program execution would result in division by zero,
+program execution must be gracefully aborted.
+
+Examples:

-BPF_ADD | BPF_X | BPF_ALU means::
+``BPF_ADD | BPF_X | BPF_ALU`` (0x0c) means::

-  dst_reg = (u32) dst_reg + (u32) src_reg;
+  dst = (uint32_t) (dst + src);

-BPF_ADD | BPF_X | BPF_ALU64 means::
+where '(uint32_t)' indicates truncation to 32 bits.

-  dst_reg = dst_reg + src_reg
+   **Note**

-BPF_XOR | BPF_K | BPF_ALU means::
+   *Linux implementation*: In the Linux kernel, uint32_t is expressed as u32,
+   uint64_t is expressed as u64, etc.  This document uses the standard C terminology
+   as the cross-platform specification.

-  src_reg = (u32) src_reg ^ (u32) imm32
+``BPF_ADD | BPF_X | BPF_ALU64`` (0x0f) means::

-BPF_XOR | BPF_K | BPF_ALU64 means::
+  dst = dst + src

-  src_reg = src_reg ^ imm32
+``BPF_XOR | BPF_K | BPF_ALU`` (0xa4) means::
+
+  src = (uint32_t) src ^ (uint32_t) imm
+
+``BPF_XOR | BPF_K | BPF_ALU64`` (0xa7) means::
+
+  src = src ^ imm

 
 Byte swap instructions
-----------------------
+~~~~~~~~~~~~~~~~~~~~~~

 The byte swap instructions use an instruction class of ``BPF_ALU`` and a 4-bit
-code field of ``BPF_END``.
+'code' field of ``BPF_END``.

 The byte swap instructions operate on the destination register
only and do not use a separate source register or immediate value.

-The 1-bit source operand field in the opcode is used to to select what byte
-order the operation convert from or to:
+Byte swap instructions use non-default semantics of the 1-bit 'source' field in
+the 'opcode' field.  Instead of indicating the source operator, it is instead
+used to select what byte order the operation converts from or to:

-  =========  =====  =================================================
-  source     value  description
-  =========  =====  =================================================
-  BPF_TO_LE  0x00   convert between host byte order and little endian
-  BPF_TO_BE  0x08   convert between host byte order and big endian
-  =========  =====  =================================================
-
-The imm field encodes the width of the swap operations.  The following widths
-are supported: 16, 32 and 64.
-
-Examples:
+=========  =====  =================================================
+source     value  description
+=========  =====  =================================================
+BPF_TO_LE  0x00   convert between host byte order and little endian
+BPF_TO_BE  0x08   convert between host byte order and big endian
+=========  =====  =================================================

-``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means::
+   **Note**

-  dst_reg = htole16(dst_reg)
+   *Linux implementation*:
+   ``BPF_FROM_LE`` and ``BPF_FROM_BE`` exist as aliases for ``BPF_TO_LE`` and
+   ``BPF_TO_BE`` respectively.

-``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means::
+The 'imm' field encodes the width of the swap operations.  The following widths
+are supported: 16, 32 and 64. The following table summarizes the resulting
+possibilities:

-  dst_reg = htobe64(dst_reg)
+=============================  =========  ===  ========  ==================
+opcode construction            opcode     imm  mnemonic  pseudocode
+=============================  =========  ===  ========  ==================
+BPF_END | BPF_TO_LE | BPF_ALU  0xd4       16   le16 dst  dst = htole16(dst)
+BPF_END | BPF_TO_LE | BPF_ALU  0xd4       32   le32 dst  dst = htole32(dst)
+BPF_END | BPF_TO_LE | BPF_ALU  0xd4       64   le64 dst  dst = htole64(dst)
+BPF_END | BPF_TO_BE | BPF_ALU  0xdc       16   be16 dst  dst = htobe16(dst)
+BPF_END | BPF_TO_BE | BPF_ALU  0xdc       32   be32 dst  dst = htobe32(dst)
+BPF_END | BPF_TO_BE | BPF_ALU  0xdc       64   be64 dst  dst = htobe64(dst)
+=============================  =========  ===  ========  ==================

-``BPF_FROM_LE`` and ``BPF_FROM_BE`` exist as aliases for ``BPF_TO_LE`` and
-``BPF_TO_BE`` respectively.
+where

+* mnenomic indicates a short form that might be displayed by some tools such as disassemblers
+* 'htoleNN()' indicates converting a NN-bit value from host byte order to little-endian byte order
+* 'htobeNN()' indicates converting a NN-bit value from host byte order to big-endian byte order

 Jump instructions
-----------------

-BPF_JMP32 uses 32-bit wide operands while BPF_JMP uses 64-bit wide operands for
+Instruction class ``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
otherwise identical operations.
-The code field encodes the operation as below:
-
-  ========  =====  =========================  ============
-  code      value  description                notes
-  ========  =====  =========================  ============
-  BPF_JA    0x00   PC += off                  BPF_JMP only
-  BPF_JEQ   0x10   PC += off if dst == src
-  BPF_JGT   0x20   PC += off if dst > src     unsigned
-  BPF_JGE   0x30   PC += off if dst >= src    unsigned
-  BPF_JSET  0x40   PC += off if dst & src
-  BPF_JNE   0x50   PC += off if dst != src
-  BPF_JSGT  0x60   PC += off if dst > src     signed
-  BPF_JSGE  0x70   PC += off if dst >= src    signed
-  BPF_CALL  0x80   function call
-  BPF_EXIT  0x90   function / program return  BPF_JMP only
-  BPF_JLT   0xa0   PC += off if dst < src     unsigned
-  BPF_JLE   0xb0   PC += off if dst <= src    unsigned
-  BPF_JSLT  0xc0   PC += off if dst < src     signed
-  BPF_JSLE  0xd0   PC += off if dst <= src    signed
-  ========  =====  =========================  ============
-
-The eBPF program needs to store the return value into register R0 before doing a
-BPF_EXIT.

+Support for ``BPF_JMP32`` is required in ISA version 3, and optional in earlier
+versions.
+
+The 4-bit 'code' field encodes the operation as below, where PC is the program counter:
+
+========  =====  ============================  =======  ============
+code      value  description                   version  notes
+========  =====  ============================  =======  ============
+BPF_JA    0x00   PC += offset                  1        BPF_JMP only
+BPF_JEQ   0x10   PC += offset if dst == src    1
+BPF_JGT   0x20   PC += offset if dst > src     1        unsigned
+BPF_JGE   0x30   PC += offset if dst >= src    1        unsigned
+BPF_JSET  0x40   PC += offset if dst & src     1
+BPF_JNE   0x50   PC += offset if dst != src    1
+BPF_JSGT  0x60   PC += offset if dst > src     1        signed
+BPF_JSGE  0x70   PC += offset if dst >= src    1        signed
+BPF_CALL  0x80   call function imm             1        see `Helper functions`_
+BPF_EXIT  0x90   function / program return     1        BPF_JMP only
+BPF_JLT   0xa0   PC += offset if dst < src     2        unsigned
+BPF_JLE   0xb0   PC += offset if dst <= src    2        unsigned
+BPF_JSLT  0xc0   PC += offset if dst < src     2        signed
+BPF_JSLE  0xd0   PC += offset if dst <= src    2        signed
+========  =====  ============================  =======  ============
+
+where 'version' indicates the first ISA version in which the value was supported.
+
+Helper functions
+~~~~~~~~~~~~~~~~
+Helper functions are a concept whereby BPF programs can call into
+set of function calls exposed by the eBPF runtime.  Each helper
+function is identified by an integer used in a ``BPF_CALL`` instruction.
+The available helper functions may differ for each eBPF program type.
+
+Conceptually, each helper function is implemented with a commonly shared function
+signature defined as:
+
+  uint64_t function(uint64_t r1, uint64_t r2, uint64_t r3, uint64_t r4, uint64_t r5)
+
+In actuality, each helper function is defined as taking between 0 and 5 arguments,
+with the remaining registers being ignored.  The definition of a helper function
+is responsible for specifying the type (e.g., integer, pointer, etc.) of the value returned,
+the number of arguments, and the type of each argument.

 Load and store instructions
===========================

-For load and store instructions (BPF_LD, BPF_LDX, BPF_ST and BPF_STX), the
+For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_STX``), the
8-bit 'opcode' field is divided as:

-  ============  ======  =================
-  3 bits (MSB)  2 bits  3 bits (LSB)
-  ============  ======  =================
-  mode          size    instruction class
-  ============  ======  =================
+============  ======  =================
+3 bits (MSB)  2 bits  3 bits (LSB)
+============  ======  =================
+mode          size    instruction class
+============  ======  =================

-The size modifier is one of:
+mode
+  one of:
+
+  =============  =====  ====================================  =============
+  mode modifier  value  description                           reference
+  =============  =====  ====================================  =============
+  BPF_IMM        0x00   64-bit immediate instructions         `64-bit immediate instructions`_
+  BPF_ABS        0x20   legacy BPF packet access (absolute)   `Legacy BPF Packet access instructions`_
+  BPF_IND        0x40   legacy BPF packet access (indirect)   `Legacy BPF Packet access instructions`_
+  BPF_MEM        0x60   regular load and store operations     `Regular load and store operations`_
+  BPF_ATOMIC     0xc0   atomic operations                     `Atomic operations`_
+  =============  =====  ====================================  =============
+
+size
+  one of:

   =============  =====  =====================
   size modifier  value  description
@@ -213,18 +362,8 @@ The size modifier is one of:
   BPF_DW         0x18   double word (8 bytes)
   =============  =====  =====================

-The mode modifier is one of:
-
-  =============  =====  ====================================
-  mode modifier  value  description
-  =============  =====  ====================================
-  BPF_IMM        0x00   64-bit immediate instructions
-  BPF_ABS        0x20   legacy BPF packet access (absolute)
-  BPF_IND        0x40   legacy BPF packet access (indirect)
-  BPF_MEM        0x60   regular load and store operations
-  BPF_ATOMIC     0xc0   atomic operations
-  =============  =====  ====================================
-
+instruction class
+  the instruction class (see `Instruction classes`_)

 Regular load and store operations
---------------------------------
@@ -232,19 +371,22 @@ Regular load and store operations
The ``BPF_MEM`` mode modifier is used to encode regular load and store
instructions that transfer data between a register and memory.

-``BPF_MEM | <size> | BPF_STX`` means::
-
-  *(size *) (dst_reg + off) = src_reg
-
-``BPF_MEM | <size> | BPF_ST`` means::
-
-  *(size *) (dst_reg + off) = imm32
-
-``BPF_MEM | <size> | BPF_LDX`` means::
-
-  dst_reg = *(size *) (src_reg + off)
-
-Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
+=============================  =========  ==================================
+opcode construction            opcode     pseudocode
+=============================  =========  ==================================
+BPF_MEM | BPF_B | BPF_LDX      0x71       dst = *(uint8_t *) (src + offset)  
+BPF_MEM | BPF_H | BPF_LDX      0x69       dst = *(uint16_t *) (src + offset)
+BPF_MEM | BPF_W | BPF_LDX      0x61       dst = *(uint32_t *) (src + offset)
+BPF_MEM | BPF_DW | BPF_LDX     0x79       dst = *(uint64_t *) (src + offset)
+BPF_MEM | BPF_B | BPF_ST       0x72       *(uint8_t *) (dst + offset) = imm
+BPF_MEM | BPF_H | BPF_ST       0x6a       *(uint16_t *) (dst + offset) = imm
+BPF_MEM | BPF_W | BPF_ST       0x62       *(uint32_t *) (dst + offset) = imm
+BPF_MEM | BPF_DW | BPF_ST      0x7a       *(uint64_t *) (dst + offset) = imm
+BPF_MEM | BPF_B | BPF_STX      0x73       *(uint8_t *) (dst + offset) = src
+BPF_MEM | BPF_H | BPF_STX      0x6b       *(uint16_t *) (dst + offset) = src
+BPF_MEM | BPF_W | BPF_STX      0x63       *(uint32_t *) (dst + offset) = src
+BPF_MEM | BPF_DW | BPF_STX     0x7b       *(uint64_t *) (dst + offset) = src
+=============================  =========  ==================================

 Atomic operations
-----------------
@@ -256,76 +398,83 @@ by other eBPF programs or means outside of this specification.
All atomic operations supported by eBPF are encoded as store operations
that use the ``BPF_ATOMIC`` mode modifier as follows:

-  * ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations
-  * ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
-  * 8-bit and 16-bit wide atomic operations are not supported.
+* ``BPF_ATOMIC | BPF_W | BPF_STX`` (0xc3) for 32-bit operations
+* ``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) for 64-bit operations
+
+Note that 8-bit (``BPF_B``) and 16-bit (``BPF_H``) wide atomic operations are not supported,
+nor is ``BPF_ATOMIC | <size> | BPF_ST``.

-The imm field is used to encode the actual atomic operation.
+The 'imm' field is used to encode the actual atomic operation.
Simple atomic operation use a subset of the values defined to encode
-arithmetic operations in the imm field to encode the atomic operation:
+arithmetic operations in the 'imm' field to encode the atomic operation:

-  ========  =====  ===========
-  imm       value  description
-  ========  =====  ===========
-  BPF_ADD   0x00   atomic add
-  BPF_OR    0x40   atomic or
-  BPF_AND   0x50   atomic and
-  BPF_XOR   0xa0   atomic xor
-  ========  =====  ===========
+========  =====  ===========  =======
+imm       value  description  version
+========  =====  ===========  =======
+BPF_ADD   0x00   atomic add   1
+BPF_OR    0x40   atomic or    3
+BPF_AND   0x50   atomic and   3
+BPF_XOR   0xa0   atomic xor   3
+========  =====  ===========  =======

+where 'version' indicates the first ISA version in which the value was supported.

-``BPF_ATOMIC | BPF_W  | BPF_STX`` with imm = BPF_ADD means::
+``BPF_ATOMIC | BPF_W  | BPF_STX`` (0xc3) with 'imm' = BPF_ADD means::

-  *(u32 *)(dst_reg + off16) += src_reg
+  *(uint32_t *)(dst + offset) += src

-``BPF_ATOMIC | BPF_DW | BPF_STX`` with imm = BPF ADD means::
+``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) with 'imm' = BPF ADD means::

-  *(u64 *)(dst_reg + off16) += src_reg
+  *(uint64_t *)(dst + offset) += src

-``BPF_XADD`` is a deprecated name for ``BPF_ATOMIC | BPF_ADD``.
+``BPF_XADD`` appeared in version 1, but is now considered to be a deprecated alias
+for ``BPF_ATOMIC | BPF_ADD``.

-In addition to the simple atomic operations, there also is a modifier and
+In addition to the simple atomic operations above, there also is a modifier and
two complex atomic operations:

-  ===========  ================  ===========================
-  imm          value             description
-  ===========  ================  ===========================
-  BPF_FETCH    0x01              modifier: return old value
-  BPF_XCHG     0xe0 | BPF_FETCH  atomic exchange
-  BPF_CMPXCHG  0xf0 | BPF_FETCH  atomic compare and exchange
-  ===========  ================  ===========================
+===========  ================  ===========================  =======
+imm          value             description                  version
+===========  ================  ===========================  =======
+BPF_FETCH    0x01              modifier: return old value   3
+BPF_XCHG     0xe0 | BPF_FETCH  atomic exchange              3
+BPF_CMPXCHG  0xf0 | BPF_FETCH  atomic compare and exchange  3
+===========  ================  ===========================  =======

 The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
always set for the complex atomic operations.  If the ``BPF_FETCH`` flag
-is set, then the operation also overwrites ``src_reg`` with the value that
+is set, then the operation also overwrites ``src`` with the value that
was in memory before it was modified.

-The ``BPF_XCHG`` operation atomically exchanges ``src_reg`` with the value
-addressed by ``dst_reg + off``.
+The ``BPF_XCHG`` operation atomically exchanges ``src`` with the value
+addressed by ``dst + offset``.

 The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
-``dst_reg + off`` with ``R0``. If they match, the value addressed by
-``dst_reg + off`` is replaced with ``src_reg``. In either case, the
-value that was at ``dst_reg + off`` before the operation is zero-extended
+``dst + offset`` with ``R0``. If they match, the value addressed by
+``dst + offset`` is replaced with ``src``. In either case, the
+value that was at ``dst + offset`` before the operation is zero-extended
and loaded back to ``R0``.

-Clang can generate atomic instructions by default when ``-mcpu=v3`` is
-enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction
-Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable
-the atomics features, while keeping a lower ``-mcpu`` version, you can use
-``-Xclang -target-feature -Xclang +alu32``.
+   **Note**
+
+   *Clang implementation*:
+   Clang can generate atomic instructions by default when ``-mcpu=v3`` is
+   enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction
+   Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable
+   the atomics features, while keeping a lower ``-mcpu`` version, you can use
+   ``-Xclang -target-feature -Xclang +alu32``.

 64-bit immediate instructions
-----------------------------

-Instructions with the ``BPF_IMM`` mode modifier use the wide instruction
+Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
encoding for an extra imm64 value.

 There is currently only one such instruction.

-``BPF_LD | BPF_DW | BPF_IMM`` means::
+``BPF_IMM | BPF_DW | BPF_LD`` (0x18) means::

-  dst_reg = imm64
+  dst = imm64

 
 Legacy BPF Packet access instructions
@@ -333,34 +482,236 @@ Legacy BPF Packet access instructions

 eBPF has special instructions for access to packet data that have been
carried over from classic BPF to retain the performance of legacy socket
-filters running in the eBPF interpreter.
+filters running in an eBPF interpreter.

 The instructions come in two forms: ``BPF_ABS | <size> | BPF_LD`` and
``BPF_IND | <size> | BPF_LD``.

 These instructions are used to access packet data and can only be used when
-the program context is a pointer to networking packet.  ``BPF_ABS``
+the program context contains a pointer to a networking packet.  ``BPF_ABS``
accesses packet data at an absolute offset specified by the immediate data
and ``BPF_IND`` access packet data at an offset that includes the value of
a register in addition to the immediate data.

 These instructions have seven implicit operands:

- * Register R6 is an implicit input that must contain pointer to a
-   struct sk_buff.
- * Register R0 is an implicit output which contains the data fetched from
-   the packet.
- * Registers R1-R5 are scratch registers that are clobbered after a call to
-   ``BPF_ABS | BPF_LD`` or ``BPF_IND | BPF_LD`` instructions.
-
-These instructions have an implicit program exit condition as well. When an
-eBPF program is trying to access the data beyond the packet boundary, the
-program execution will be aborted.
-
-``BPF_ABS | BPF_W | BPF_LD`` means::
-
-  R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + imm32))
-
-``BPF_IND | BPF_W | BPF_LD`` means::
-
-  R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
+* Register R6 is an implicit input that must contain a pointer to a
+  context structure with a packet data pointer.
+* Register R0 is an implicit output which contains the data fetched from
+  the packet.
+* Registers R1-R5 are scratch registers that are clobbered by the
+  instruction.
+
+   **Note**
+
+   *Linux implementation*: In Linux, R6 references a struct sk_buff.
+
+These instructions have an implicit program exit condition as well. If an
+eBPF program attempts access data beyond the packet boundary, the
+program execution must be gracefully aborted.
+
+``BPF_ABS | BPF_W | BPF_LD`` (0x20) means::
+
+  R0 = ntohl(*(uint32_t *) (R6->data + imm))
+
+where ``ntohl()`` converts a 32-bit value from network byte order to host byte order.
+
+``BPF_IND | BPF_W | BPF_LD`` (0x40) means::
+
+  R0 = ntohl(*(uint32_t *) (R6->data + src + imm))
+
+Appendix
+========
+
+For reference, the following table lists opcodes in order by value.
+
+======  ====  ===================================================  =============
+opcode  imm   description                                          reference 
+======  ====  ===================================================  =============
+0x04    any   dst = (uint32_t)(dst + imm)                          `Arithmetic instructions`_
+0x05    0x00  goto +offset                                         `Jump instructions`_
+0x07    any   dst += imm                                           `Arithmetic instructions`_
+0x0c    0x00  dst = (uint32_t)(dst + src)                          `Arithmetic instructions`_
+0x0f    0x00  dst += src                                           `Arithmetic instructions`_
+0x14    any   dst = (uint32_t)(dst - imm)                          `Arithmetic instructions`_
+0x15    any   if dst == imm goto +offset                           `Jump instructions`_
+0x16    any   if (uint32_t)dst == imm goto +offset                 `Jump instructions`_
+0x17    any   dst -= imm                                           `Arithmetic instructions`_
+0x18    any   dst = imm                                            `Load and store instructions`_
+0x1c    0x00  dst = (uint32_t)(dst - src)                          `Arithmetic instructions`_
+0x1d    0x00  if dst == src goto +offset                           `Jump instructions`_
+0x1e    0x00  if (uint32_t)dst == (uint32_t)src goto +offset       `Jump instructions`_
+0x1f    0x00  dst -= src                                           `Arithmetic instructions`_
+0x20    any   dst = ntohl(\*(uint32_t \*)(R6->data + imm))         `Load and store instructions`_
+0x24    any   dst = (uint32_t)(dst \* imm)                         `Arithmetic instructions`_
+0x25    any   if dst > imm goto +offset                            `Jump instructions`_
+0x26    any   if (uint32_t)dst > imm goto +offset                  `Jump instructions`_
+0x27    any   dst \*= imm                                          `Arithmetic instructions`_
+0x28    any   dst = ntohs(\*(uint16_t \*)(R6->data + imm))         `Load and store instructions`_
+0x2c    0x00  dst = (uint32_t)(dst \* src)                         `Arithmetic instructions`_
+0x2d    0x00  if dst > src goto +offset                            `Jump instructions`_
+0x2e    0x00  if (uint32_t)dst > (uint32_t)src goto +offset        `Jump instructions`_
+0x2f    0x00  dst \*= src                                          `Arithmetic instructions`_
+0x30    any   dst = (\*(uint8_t \*)(R6->data + imm))               `Load and store instructions`_
+0x34    any   dst = (uint32_t)(dst / imm)                          `Arithmetic instructions`_
+0x35    any   if dst >= imm goto +offset                           `Jump instructions`_
+0x36    any   if (uint32_t)dst >= imm goto +offset                 `Jump instructions`_
+0x37    any   dst /= imm                                           `Arithmetic instructions`_
+0x38    any   dst = ntohll(\*(uint64_t \*)(R6->data + imm))        `Load and store instructions`_
+0x3c    0x00  dst = (uint32_t)(dst / src)                          `Arithmetic instructions`_
+0x3d    0x00  if dst >= src goto +offset                           `Jump instructions`_
+0x3e    0x00  if (uint32_t)dst >= (uint32_t)src goto +offset       `Jump instructions`_
+0x3f    0x00  dst /= src                                           `Arithmetic instructions`_
+0x40    any   dst = ntohl(\*(uint32_t \*)(R6->data + src + imm))   `Load and store instructions`_
+0x44    any   dst = (uint32_t)(dst \| imm)                         `Arithmetic instructions`_
+0x45    any   if dst & imm goto +offset                            `Jump instructions`_
+0x46    any   if (uint32_t)dst & imm goto +offset                  `Jump instructions`_
+0x47    any   dst \|= imm                                          `Arithmetic instructions`_
+0x48    any   dst = ntohs(\*(uint16_t \*)(R6->data + src + imm))   `Load and store instructions`_
+0x4c    0x00  dst = (uint32_t)(dst \| src)                         `Arithmetic instructions`_
+0x4d    0x00  if dst & src goto +offset                            `Jump instructions`_
+0x4e    0x00  if (uint32_t)dst & (uint32_t)src goto +offset        `Jump instructions`_
+0x4f    0x00  dst \|= src                                          `Arithmetic instructions`_
+0x50    any   dst = \*(uint8_t \*)(R6->data + src + imm))          `Load and store instructions`_
+0x54    any   dst = (uint32_t)(dst & imm)                          `Arithmetic instructions`_
+0x55    any   if dst != imm goto +offset                           `Jump instructions`_
+0x56    any   if (uint32_t)dst != imm goto +offset                 `Jump instructions`_
+0x57    any   dst &= imm                                           `Arithmetic instructions`_
+0x58    any   dst = ntohll(\*(uint64_t \*)(R6->data + src + imm))  `Load and store instructions`_
+0x5c    0x00  dst = (uint32_t)(dst & src)                          `Arithmetic instructions`_
+0x5d    0x00  if dst != src goto +offset                           `Jump instructions`_
+0x5e    0x00  if (uint32_t)dst != (uint32_t)src goto +offset       `Jump instructions`_
+0x5f    0x00  dst &= src                                           `Arithmetic instructions`_
+0x61    0x00  dst = \*(uint32_t \*)(src + offset)                  `Load and store instructions`_
+0x62    any   \*(uint32_t \*)(dst + offset) = imm                  `Load and store instructions`_
+0x63    0x00  \*(uint32_t \*)(dst + offset) = src                  `Load and store instructions`_
+0x64    any   dst = (uint32_t)(dst << imm)                         `Arithmetic instructions`_
+0x65    any   if dst s> imm goto +offset                           `Jump instructions`_
+0x66    any   if (int32_t)dst s> (int32_t)imm goto +offset         `Jump instructions`_
+0x67    any   dst <<= imm                                          `Arithmetic instructions`_
+0x69    0x00  dst = \*(uint16_t \*)(src + offset)                  `Load and store instructions`_
+0x6a    any   \*(uint16_t \*)(dst + offset) = imm                  `Load and store instructions`_
+0x6b    0x00  \*(uint16_t \*)(dst + offset) = src                  `Load and store instructions`_
+0x6c    0x00  dst = (uint32_t)(dst << src)                         `Arithmetic instructions`_
+0x6d    0x00  if dst s> src goto +offset                           `Jump instructions`_
+0x6e    0x00  if (int32_t)dst s> (int32_t)src goto +offset         `Jump instructions`_
+0x6f    0x00  dst <<= src                                          `Arithmetic instructions`_
+0x71    0x00  dst = \*(uint8_t \*)(src + offset)                   `Load and store instructions`_
+0x72    any   \*(uint8_t \*)(dst + offset) = imm                   `Load and store instructions`_
+0x73    0x00  \*(uint8_t \*)(dst + offset) = src                   `Load and store instructions`_
+0x74    any   dst = (uint32_t)(dst >> imm)                         `Arithmetic instructions`_
+0x75    any   if dst s>= imm goto +offset                          `Jump instructions`_
+0x76    any   if (int32_t)dst s>= (int32_t)imm goto +offset        `Jump instructions`_
+0x77    any   dst >>= imm                                          `Arithmetic instructions`_
+0x79    0x00  dst = \*(uint64_t \*)(src + offset)                  `Load and store instructions`_
+0x7a    any   \*(uint64_t \*)(dst + offset) = imm                  `Load and store instructions`_
+0x7b    0x00  \*(uint64_t \*)(dst + offset) = src                  `Load and store instructions`_
+0x7c    0x00  dst = (uint32_t)(dst >> src)                         `Arithmetic instructions`_
+0x7d    0x00  if dst s>= src goto +offset                          `Jump instructions`_
+0x7e    0x00  if (int32_t)dst s>= (int32_t)src goto +offset        `Jump instructions`_
+0x7f    0x00  dst >>= src                                          `Arithmetic instructions`_
+0x84    0x00  dst = (uint32_t)-dst                                 `Arithmetic instructions`_
+0x85    any   call imm                                             `Jump instructions`_
+0x87    0x00  dst = -dst                                           `Arithmetic instructions`_
+0x94    any   dst = (uint32_t)(dst % imm)                          `Arithmetic instructions`_
+0x95    0x00  return                                               `Jump instructions`_
+0x97    any   dst %= imm                                           `Arithmetic instructions`_
+0x9c    0x00  dst = (uint32_t)(dst % src)                          `Arithmetic instructions`_
+0x9f    0x00  dst %= src                                           `Arithmetic instructions`_
+0xa4    any   dst = (uint32_t)(dst ^ imm)                          `Arithmetic instructions`_
+0xa5    any   if dst < imm goto +offset                            `Jump instructions`_
+0xa6    any   if (uint32_t)dst < imm goto +offset                  `Jump instructions`_
+0xa7    any   dst ^= imm                                           `Arithmetic instructions`_
+0xac    0x00  dst = (uint32_t)(dst ^ src)                          `Arithmetic instructions`_
+0xad    0x00  if dst < src goto +offset                            `Jump instructions`_
+0xae    0x00  if (uint32_t)dst < (uint32_t)src goto +offset        `Jump instructions`_
+0xaf    0x00  dst ^= src                                           `Arithmetic instructions`_
+0xb4    any   dst = (uint32_t) imm                                 `Arithmetic instructions`_
+0xb5    any   if dst <= imm goto +offset                           `Jump instructions`_
+0xa6    any   if (uint32_t)dst <= imm goto +offset                 `Jump instructions`_
+0xb7    any   dst = imm                                            `Arithmetic instructions`_
+0xbc    0x00  dst = (uint32_t) src                                 `Arithmetic instructions`_
+0xbd    0x00  if dst <= src goto +offset                           `Jump instructions`_
+0xbe    0x00  if (uint32_t)dst <= (uint32_t)src goto +offset       `Jump instructions`_
+0xbf    0x00  dst = src                                            `Arithmetic instructions`_
+0xc3    0x00  lock \*(uint32_t \*)(dst + offset) += src            `Atomic operations`_
+0xc3    0x01  lock::                                               `Atomic operations`_
+
+                  *(uint32_t *)(dst + offset) += src
+                  src = *(uint32_t *)(dst + offset)
+0xc3    0x40  \*(uint32_t \*)(dst + offset) \|= src                `Atomic operations`_
+0xc3    0x41  lock::                                               `Atomic operations`_
+
+                  *(uint32_t *)(dst + offset) |= src
+                  src = *(uint32_t *)(dst + offset)
+0xc3    0x50  \*(uint32_t \*)(dst + offset) &= src                 `Atomic operations`_
+0xc3    0x51  lock::                                               `Atomic operations`_
+
+                  *(uint32_t *)(dst + offset) &= src
+                  src = *(uint32_t *)(dst + offset)
+0xc3    0xa0  \*(uint32_t \*)(dst + offset) ^= src                 `Atomic operations`_
+0xc3    0xa1  lock::                                               `Atomic operations`_
+
+                  *(uint32_t *)(dst + offset) ^= src
+                  src = *(uint32_t *)(dst + offset)
+0xc3    0xe1  lock::                                               `Atomic operations`_
+
+                  temp = *(uint32_t *)(dst + offset)
+                  *(uint32_t *)(dst + offset) = src
+                  src = temp
+0xc3    0xf1  lock::                                               `Atomic operations`_
+
+                  temp = *(uint32_t *)(dst + offset)
+                  if *(uint32_t)(dst + offset) == R0
+                     *(uint32_t)(dst + offset) = src
+                  R0 = temp
+0xc4    any   dst = (uint32_t)(dst s>> imm)                        `Arithmetic instructions`_
+0xc5    any   if dst s< imm goto +offset                           `Jump instructions`_
+0xc6    any   if (int32_t)dst s< (int32_t)imm goto +offset         `Jump instructions`_
+0xc7    any   dst s>>= imm                                         `Arithmetic instructions`_
+0xcc    0x00  dst = (uint32_t)(dst s>> src)                        `Arithmetic instructions`_
+0xcd    0x00  if dst s< src goto +offset                           `Jump instructions`_
+0xce    0x00  if (int32_t)dst s< (int32_t)src goto +offset         `Jump instructions`_
+0xcf    0x00  dst s>>= src                                         `Arithmetic instructions`_
+0xd4    0x10  dst = htole16(dst)                                   `Byte swap instructions`_
+0xd4    0x20  dst = htole32(dst)                                   `Byte swap instructions`_
+0xd4    0x40  dst = htole64(dst)                                   `Byte swap instructions`_
+0xd5    any   if dst s<= imm goto +offset                          `Jump instructions`_
+0xd6    any   if (int32_t)dst s<= (int32_t)imm goto +offset        `Jump instructions`_
+0xc3    0x00  lock \*(uint64_t \*)(dst + offset) += src            `Atomic operations`_
+0xdb    0x01  lock::                                               `Atomic operations`_
+
+                  *(uint64_t *)(dst + offset) += src
+                  src = *(uint64_t *)(dst + offset)
+0xdb    0x40  \*(uint64_t \*)(dst + offset) \|= src                `Atomic operations`_
+0xdb    0x41  lock::                                               `Atomic operations`_
+
+                  *(uint64_t *)(dst + offset) |= src
+                  lock src = *(uint64_t *)(dst + offset)
+0xdb    0x50  \*(uint64_t \*)(dst + offset) &= src                 `Atomic operations`_
+0xdb    0x51  lock::                                               `Atomic operations`_
+
+                  *(uint64_t *)(dst + offset) &= src
+                  src = *(uint64_t *)(dst + offset)
+0xdb    0xa0  \*(uint64_t \*)(dst + offset) ^= src                 `Atomic operations`_
+0xdb    0xa1  lock::                                               `Atomic operations`_
+
+                  *(uint64_t *)(dst + offset) ^= src
+                  src = *(uint64_t *)(dst + offset)
+0xdb    0xe1  lock::                                               `Atomic operations`_
+
+                  temp = *(uint64_t *)(dst + offset)
+                  *(uint64_t *)(dst + offset) = src
+                  src = temp
+0xdb    0xf1  lock::                                               `Atomic operations`_
+
+                  temp = *(uint64_t *)(dst + offset)
+                  if *(uint64_t)(dst + offset) == R0
+                     *(uint64_t)(dst + offset) = src
+                  R0 = temp
+0xdc    0x10  dst = htobe16(dst)                                   `Byte swap instructions`_
+0xdc    0x20  dst = htobe32(dst)                                   `Byte swap instructions`_
+0xdc    0x40  dst = htobe64(dst)                                   `Byte swap instructions`_
+0xdd    0x00  if dst s<= src goto +offset                          `Jump instructions`_
+0xde    0x00  if (int32_t)dst s<= (int32_t)src goto +offset        `Jump instructions`_
+======  ====  ===================================================  =============
-- 
2.32.0.windows.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: FW: ebpf-docs: draft of ISA doc updates in progress
  2022-09-13  8:12 ` FW: ebpf-docs: draft of ISA doc updates in progress Dave Thaler
@ 2022-09-14  6:22   ` Shung-Hsi Yu
  2022-09-14  9:35     ` Dave Thaler
  2022-09-19 17:04     ` Christoph Hellwig
  2022-09-19 16:58   ` Christoph Hellwig
  1 sibling, 2 replies; 16+ messages in thread
From: Shung-Hsi Yu @ 2022-09-14  6:22 UTC (permalink / raw)
  To: Dave Thaler; +Cc: bpf

On Tue, Sep 13, 2022 at 08:12:52AM +0000, Dave Thaler wrote:
> Resending since the list apparently never got the emails I sent last Friday...
> 
> From: Dave Thaler 
> Sent: Friday, September 9, 2022 9:36 PM
> To: bpf <bpf@vger.kernel.org>
> Subject: ebpf-docs: draft of ISA doc updates in progress
> 
> I've been working on ISA docs, with review by Jim Harris, Quentin Monnet, and Alan
> Jowett so wanted to post this before LPC when we can discuss progress and open
> questions.
> 
> A rendered draft can be viewed at:
> https://github.com/dthaler/ebpf-docs/blob/update/isa/kernel.org/instruction-set.rst
> 
> For people who prefer github format: https://github.com/dthaler/ebpf-docs/pull/4
> 
> For people who use format-patch format, see below.  The instruction-set.rst
> changes are directly diffs against the kernel.org copy mirrored
> to github for rendering via the link above.
> 
> Feedback welcome on the direction so far.

Thanks for working on this! A few inlined comments below.

> Thanks,
> Dave
> 
> [...]
> 
> diff --git a/isa/kernel.org/instruction-set.rst b/isa/kernel.org/instruction-set.rst
> index 1b0e671..7f8ee00 100644
> --- a/isa/kernel.org/instruction-set.rst
> +++ b/isa/kernel.org/instruction-set.rst
> @@ -1,8 +1,24 @@
> +.. contents::
> +.. sectnum::
> 
>  ====================
> eBPF Instruction Set
> ====================
> 
> +The eBPF instruction set consists of eleven 64 bit registers, a program counter,
> +and 512 bytes of stack space.
> +
> +Versions
> +========
> +
> +The current Instruction Set Architecture (ISA) version, sometimes referred to in other documents
> +as a "CPU" version, is 3.  This document also covers older versions of the ISA.
> +
> +   **Note**
> +
> +   *Clang implementation*: Clang can select the eBPF ISA version using
> +   ``-mcpu=v2`` for example to select version 2.
> +
> Registers and calling convention
> ================================
> 
> @@ -11,198 +27,331 @@ all of which are 64-bits wide.
> 
>  The eBPF calling convention is defined as:
> 
> - * R0: return value from function calls, and exit value for eBPF programs
> - * R1 - R5: arguments for function calls
> - * R6 - R9: callee saved registers that function calls will preserve
> - * R10: read-only frame pointer to access stack
> +* R0: return value from function calls, and exit value for eBPF programs
> +* R1 - R5: arguments for function calls
> +* R6 - R9: callee saved registers that function calls will preserve
> +* R10: read-only frame pointer to access stack
> +
> +Registers R0 - R5 are scratch registers, meaning the BPF program needs to either
> +spill them to the BPF stack or move them to callee saved registers if these
> +arguments are to be reused across multiple function calls. Spilling means
> +that the value in the register is moved to the BPF stack. The reverse operation
> +of moving the variable from the BPF stack to the register is called filling.
> +The reason for spilling/filling is due to the limited number of registers.
> 
> -R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if
> -necessary across calls.
> +   **Note**
> +
> +   *Linux implementation*: In the Linux kernel, the exit value for eBPF
> +   programs is passed as a 32 bit value.
> +
> +Upon entering execution of an eBPF program, registers R1 - R5 initially can contain
> +the input arguments for the program (similar to the argc/argv pair for a typical C program).
> +The actual number of registers used, and their meaning, is defined by the program type;
> +for example, a networking program might have an argument that includes network packet data
> +and/or metadata.
> +
> +   **Note**
> +
> +   *Linux implementation*: In the Linux kernel, all program types only use
> +   R1 which contains the "context", which is typically a structure containing all
> +   the inputs needed.  
>  
>  Instruction encoding
> ====================
> 
> +An eBPF program is a sequence of instructions.
> +
> eBPF has two instruction encodings:
> 
> - * the basic instruction encoding, which uses 64 bits to encode an instruction
> - * the wide instruction encoding, which appends a second 64-bit immediate value
> -   (imm64) after the basic instruction for a total of 128 bits.
> +* the basic instruction encoding, which uses 64 bits to encode an instruction
> +* the wide instruction encoding, which appends a second 64-bit immediate (i.e.,
> +  constant) value after the basic instruction for a total of 128 bits.
> +
> +The basic instruction encoding is as follows:
> 
> -The basic instruction encoding looks as follows:
> +=============  =======  ===============  ====================  ============
> +32 bits (MSB)  16 bits  4 bits           4 bits                8 bits (LSB)
> +=============  =======  ===============  ====================  ============
> +imm            offset   src              dst                   opcode
> +=============  =======  ===============  ====================  ============
> 
> - =============  =======  ===============  ====================  ============
> - 32 bits (MSB)  16 bits  4 bits           4 bits                8 bits (LSB)
> - =============  =======  ===============  ====================  ============
> - immediate      offset   source register  destination register  opcode
> - =============  =======  ===============  ====================  ============
> +imm         
> +  integer immediate value

Perhaps mention that imm is a _signed_ integer just like offset below?

> +
> +offset
> +  signed integer offset used with pointer arithmetic
> +
> +src
> +  source register number (0-10)
> +
> +dst
> +  destination register number (0-10)
> +
> +opcode
> +  operation to perform
> 
>  Note that most instructions do not use all of the fields.
> -Unused fields shall be cleared to zero.
> +Unused fields must be set to zero.
> +
> +As discussed below in `64-bit immediate instructions`_, some basic
> +instructions denote that a 64-bit immediate value follows.  Thus
> +the wide instruction encoding is as follows:
> +
> +=================  =============
> +64 bits (MSB)      64 bits (LSB)
> +=================  =============
> +basic instruction  imm64
> +=================  =============
> +
> +where MSB and LSB mean the most significant bits and least significant bits, respectively.
> +
> +In the remainder of this document 'src' and 'dst' refer to the values of the source
> +and destination registers, respectively, rather than the register number.
> 
>  Instruction classes
> -------------------
> 
> -The three LSB bits of the 'opcode' field store the instruction class:
> -
> -  =========  =====  ===============================
> -  class      value  description
> -  =========  =====  ===============================
> -  BPF_LD     0x00   non-standard load operations
> -  BPF_LDX    0x01   load into register operations
> -  BPF_ST     0x02   store from immediate operations
> -  BPF_STX    0x03   store from register operations
> -  BPF_ALU    0x04   32-bit arithmetic operations
> -  BPF_JMP    0x05   64-bit jump operations
> -  BPF_JMP32  0x06   32-bit jump operations
> -  BPF_ALU64  0x07   64-bit arithmetic operations
> -  =========  =====  ===============================
> +The encoding of the 'opcode' field varies and can be determined from
> +the three least significant bits (LSB) of the 'opcode' field which holds
> +the "instruction class", as follows:
> +
> +=========  =====  ===============================  =======  =================
> +class      value  description                      version  reference
> +=========  =====  ===============================  =======  =================
> +BPF_LD     0x00   non-standard load operations     1        `Load and store instructions`_
> +BPF_LDX    0x01   load into register operations    1        `Load and store instructions`_
> +BPF_ST     0x02   store from immediate operations  1        `Load and store instructions`_
> +BPF_STX    0x03   store from register operations   1        `Load and store instructions`_
> +BPF_ALU    0x04   32-bit arithmetic operations     3        `Arithmetic and jump instructions`_
> +BPF_JMP    0x05   64-bit jump operations           1        `Arithmetic and jump instructions`_
> +BPF_JMP32  0x06   32-bit jump operations           3        `Arithmetic and jump instructions`_
> +BPF_ALU64  0x07   64-bit arithmetic operations     1        `Arithmetic and jump instructions`_
> +=========  =====  ===============================  =======  =================
> +
> +where 'version' indicates the first ISA version in which support for the value was mandatory.
> 
>  Arithmetic and jump instructions
> ================================
> 
> -For arithmetic and jump instructions (BPF_ALU, BPF_ALU64, BPF_JMP and
> -BPF_JMP32), the 8-bit 'opcode' field is divided into three parts:
> +For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` and
> +``BPF_JMP32``), the 8-bit 'opcode' field is divided into three parts:
> +
> +==============  ======  =================
> +4 bits (MSB)    1 bit   3 bits (LSB)
> +==============  ======  =================
> +code            source  instruction class
> +==============  ======  =================
> 
> -  ==============  ======  =================
> -  4 bits (MSB)    1 bit   3 bits (LSB)
> -  ==============  ======  =================
> -  operation code  source  instruction class
> -  ==============  ======  =================
> +code
> +  the operation code, whose meaning varies by instruction class
> 
> -The 4th bit encodes the source operand:
> +source
> +  the source operand location, which unless otherwise specified is one of:
> 
>    ======  =====  ========================================
>    source  value  description
>    ======  =====  ========================================
> -  BPF_K   0x00   use 32-bit immediate as source operand
> -  BPF_X   0x08   use 'src_reg' register as source operand
> +  BPF_K   0x00   use 32-bit 'imm' value as source operand
> +  BPF_X   0x08   use 'src' register value as source operand
>    ======  =====  ========================================
> 
> -The four MSB bits store the operation code.
> -
> +instruction class
> +  the instruction class (see `Instruction classes`_)
> 
>  Arithmetic instructions
> -----------------------
> 
> -BPF_ALU uses 32-bit wide operands while BPF_ALU64 uses 64-bit wide operands for
> +Instruction class ``BPF_ALU`` uses 32-bit wide operands (zeroing the upper 32 bits
> +of the destination register) while ``BPF_ALU64`` uses 64-bit wide operands for
> otherwise identical operations.
> -The code field encodes the operation as below:
> 
> -  ========  =====  =================================================
> -  code      value  description
> -  ========  =====  =================================================
> -  BPF_ADD   0x00   dst += src
> -  BPF_SUB   0x10   dst -= src
> -  BPF_MUL   0x20   dst \*= src
> -  BPF_DIV   0x30   dst /= src
> -  BPF_OR    0x40   dst \|= src
> -  BPF_AND   0x50   dst &= src
> -  BPF_LSH   0x60   dst <<= src
> -  BPF_RSH   0x70   dst >>= src
> -  BPF_NEG   0x80   dst = ~src
> -  BPF_MOD   0x90   dst %= src
> -  BPF_XOR   0xa0   dst ^= src
> -  BPF_MOV   0xb0   dst = src
> -  BPF_ARSH  0xc0   sign extending shift right
> -  BPF_END   0xd0   byte swap operations (see separate section below)
> -  ========  =====  =================================================
> +Support for ``BPF_ALU`` is required in ISA version 3, and optional in earlier
> +versions.
> +
> +   **Note**
> +
> +   *Clang implementation*:
> +   For ISA versions prior to 3, Clang v7.0 and later can enable ``BPF_ALU`` support with
> +   ``-Xclang -target-feature -Xclang +alu32``.
> +
> +The 4-bit 'code' field encodes the operation as follows:
> +
> +========  =====  =================================================
> +code      value  description
> +========  =====  =================================================
> +BPF_ADD   0x00   dst += src
> +BPF_SUB   0x10   dst -= src
> +BPF_MUL   0x20   dst \*= src
> +BPF_DIV   0x30   dst /= src
> +BPF_OR    0x40   dst \|= src
> +BPF_AND   0x50   dst &= src
> +BPF_LSH   0x60   dst <<= src
> +BPF_RSH   0x70   dst >>= src
> +BPF_NEG   0x80   dst = ~src
> +BPF_MOD   0x90   dst %= src
> +BPF_XOR   0xa0   dst ^= src
> +BPF_MOV   0xb0   dst = src
> +BPF_ARSH  0xc0   sign extending shift right
> +BPF_END   0xd0   byte swap operations (see `Byte swap instructions`_ below)
> +========  =====  =================================================
> +
> +Underflow and overflow are allowed during arithmetic operations,
> +meaning the 64-bit or 32-bit value will wrap.
> +
> +``BPF_DIV`` has an implicit program exit condition as well. If
> +eBPF program execution would result in division by zero,
> +program execution must be gracefully aborted.

As discussed in yesterday's session, there's no graceful abortion on
division by zero, instead, the BPF verifier in Linux prevents division by
zero from happening. Here a few additional notes:

1. Modulo by zero is also prevented for the same reason

2. If the divisor comes from imm, then the verifier checks that its value is
   not zero. If the divisor comes from register, the verifier patch the
   programs so that division/modulo by zero doesn't happens, somewhat
   equivalent to the following pseudo code:

   /* divisor in src */
   if (src == 0) { 
     if (code == BPF_DIV)
       dst = 0;
     else if (code == BPF_MOD)
       dst = dst;
   } else {
     if (code == BPF_DIV)
       dst = dst / src;
     else if (code == BPF_MOD)
       dst = dst % src;
   }

> +Examples:
> 
> -BPF_ADD | BPF_X | BPF_ALU means::
> +``BPF_ADD | BPF_X | BPF_ALU`` (0x0c) means::
> 
> -  dst_reg = (u32) dst_reg + (u32) src_reg;
> +  dst = (uint32_t) (dst + src);
> 
> -BPF_ADD | BPF_X | BPF_ALU64 means::
> +where '(uint32_t)' indicates truncation to 32 bits.
> 
> -  dst_reg = dst_reg + src_reg
> +   **Note**
> 
> -BPF_XOR | BPF_K | BPF_ALU means::
> +   *Linux implementation*: In the Linux kernel, uint32_t is expressed as u32,
> +   uint64_t is expressed as u64, etc.  This document uses the standard C terminology
> +   as the cross-platform specification.
> 
> -  src_reg = (u32) src_reg ^ (u32) imm32
> +``BPF_ADD | BPF_X | BPF_ALU64`` (0x0f) means::
> 
> -BPF_XOR | BPF_K | BPF_ALU64 means::
> +  dst = dst + src
> 
> -  src_reg = src_reg ^ imm32
> +``BPF_XOR | BPF_K | BPF_ALU`` (0xa4) means::
> +
> +  src = (uint32_t) src ^ (uint32_t) imm
> +
> +``BPF_XOR | BPF_K | BPF_ALU64`` (0xa7) means::
> +
> +  src = src ^ imm
> 
>  
>  Byte swap instructions
> -----------------------
> +~~~~~~~~~~~~~~~~~~~~~~
> 
>  The byte swap instructions use an instruction class of ``BPF_ALU`` and a 4-bit
> -code field of ``BPF_END``.
> +'code' field of ``BPF_END``.
> 
>  The byte swap instructions operate on the destination register
> only and do not use a separate source register or immediate value.
> 
> -The 1-bit source operand field in the opcode is used to to select what byte
> -order the operation convert from or to:
> +Byte swap instructions use non-default semantics of the 1-bit 'source' field in
> +the 'opcode' field.  Instead of indicating the source operator, it is instead
> +used to select what byte order the operation converts from or to:
> 
> -  =========  =====  =================================================
> -  source     value  description
> -  =========  =====  =================================================
> -  BPF_TO_LE  0x00   convert between host byte order and little endian
> -  BPF_TO_BE  0x08   convert between host byte order and big endian
> -  =========  =====  =================================================
> -
> -The imm field encodes the width of the swap operations.  The following widths
> -are supported: 16, 32 and 64.
> -
> -Examples:
> +=========  =====  =================================================
> +source     value  description
> +=========  =====  =================================================
> +BPF_TO_LE  0x00   convert between host byte order and little endian
> +BPF_TO_BE  0x08   convert between host byte order and big endian
> +=========  =====  =================================================
> 
> -``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means::
> +   **Note**
> 
> -  dst_reg = htole16(dst_reg)
> +   *Linux implementation*:
> +   ``BPF_FROM_LE`` and ``BPF_FROM_BE`` exist as aliases for ``BPF_TO_LE`` and
> +   ``BPF_TO_BE`` respectively.
> 
> -``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means::
> +The 'imm' field encodes the width of the swap operations.  The following widths
> +are supported: 16, 32 and 64. The following table summarizes the resulting
> +possibilities:
> 
> -  dst_reg = htobe64(dst_reg)
> +=============================  =========  ===  ========  ==================
> +opcode construction            opcode     imm  mnemonic  pseudocode
> +=============================  =========  ===  ========  ==================
> +BPF_END | BPF_TO_LE | BPF_ALU  0xd4       16   le16 dst  dst = htole16(dst)
> +BPF_END | BPF_TO_LE | BPF_ALU  0xd4       32   le32 dst  dst = htole32(dst)
> +BPF_END | BPF_TO_LE | BPF_ALU  0xd4       64   le64 dst  dst = htole64(dst)
> +BPF_END | BPF_TO_BE | BPF_ALU  0xdc       16   be16 dst  dst = htobe16(dst)
> +BPF_END | BPF_TO_BE | BPF_ALU  0xdc       32   be32 dst  dst = htobe32(dst)
> +BPF_END | BPF_TO_BE | BPF_ALU  0xdc       64   be64 dst  dst = htobe64(dst)
> +=============================  =========  ===  ========  ==================
> 
> -``BPF_FROM_LE`` and ``BPF_FROM_BE`` exist as aliases for ``BPF_TO_LE`` and
> -``BPF_TO_BE`` respectively.
> +where
> 
> +* mnenomic indicates a short form that might be displayed by some tools such as disassemblers
> +* 'htoleNN()' indicates converting a NN-bit value from host byte order to little-endian byte order
> +* 'htobeNN()' indicates converting a NN-bit value from host byte order to big-endian byte order
> 
>  Jump instructions
> -----------------
> 
> -BPF_JMP32 uses 32-bit wide operands while BPF_JMP uses 64-bit wide operands for
> +Instruction class ``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
> otherwise identical operations.
> -The code field encodes the operation as below:
> -
> -  ========  =====  =========================  ============
> -  code      value  description                notes
> -  ========  =====  =========================  ============
> -  BPF_JA    0x00   PC += off                  BPF_JMP only
> -  BPF_JEQ   0x10   PC += off if dst == src
> -  BPF_JGT   0x20   PC += off if dst > src     unsigned
> -  BPF_JGE   0x30   PC += off if dst >= src    unsigned
> -  BPF_JSET  0x40   PC += off if dst & src
> -  BPF_JNE   0x50   PC += off if dst != src
> -  BPF_JSGT  0x60   PC += off if dst > src     signed
> -  BPF_JSGE  0x70   PC += off if dst >= src    signed
> -  BPF_CALL  0x80   function call
> -  BPF_EXIT  0x90   function / program return  BPF_JMP only
> -  BPF_JLT   0xa0   PC += off if dst < src     unsigned
> -  BPF_JLE   0xb0   PC += off if dst <= src    unsigned
> -  BPF_JSLT  0xc0   PC += off if dst < src     signed
> -  BPF_JSLE  0xd0   PC += off if dst <= src    signed
> -  ========  =====  =========================  ============
> -
> -The eBPF program needs to store the return value into register R0 before doing a
> -BPF_EXIT.
> 
> +Support for ``BPF_JMP32`` is required in ISA version 3, and optional in earlier
> +versions.
> +
> +The 4-bit 'code' field encodes the operation as below, where PC is the program counter:
> +
> +========  =====  ============================  =======  ============
> +code      value  description                   version  notes
> +========  =====  ============================  =======  ============
> +BPF_JA    0x00   PC += offset                  1        BPF_JMP only
> +BPF_JEQ   0x10   PC += offset if dst == src    1
> +BPF_JGT   0x20   PC += offset if dst > src     1        unsigned
> +BPF_JGE   0x30   PC += offset if dst >= src    1        unsigned
> +BPF_JSET  0x40   PC += offset if dst & src     1
> +BPF_JNE   0x50   PC += offset if dst != src    1
> +BPF_JSGT  0x60   PC += offset if dst > src     1        signed
> +BPF_JSGE  0x70   PC += offset if dst >= src    1        signed
> +BPF_CALL  0x80   call function imm             1        see `Helper functions`_
> +BPF_EXIT  0x90   function / program return     1        BPF_JMP only
> +BPF_JLT   0xa0   PC += offset if dst < src     2        unsigned
> +BPF_JLE   0xb0   PC += offset if dst <= src    2        unsigned
> +BPF_JSLT  0xc0   PC += offset if dst < src     2        signed
> +BPF_JSLE  0xd0   PC += offset if dst <= src    2        signed
> +========  =====  ============================  =======  ============
> +
> +where 'version' indicates the first ISA version in which the value was supported.
> +
> +Helper functions
> +~~~~~~~~~~~~~~~~
> +Helper functions are a concept whereby BPF programs can call into
> +set of function calls exposed by the eBPF runtime.  Each helper
> +function is identified by an integer used in a ``BPF_CALL`` instruction.
> +The available helper functions may differ for each eBPF program type.

While BPF ISA only supports direct call BPF_CALL[1], technically there is an
opcode 0x8d (BPF_JUMP | BPF_CALL | BPF_X) that has the indirect call
semantic, and Clang emit such indirect call instruction if user attempt to
compile with -O0.

I think it worth mentioning in this document for better clarity, perhaps
simply saying that indirect call is not part of BPF ISA is enough.

1: https://lore.kernel.org/bpf/20220713011851.4a2tnqhdd5f5iwak@macbook-pro-3.dhcp.thefacebook.com/

> +Conceptually, each helper function is implemented with a commonly shared function
> +signature defined as:
> +
> +  uint64_t function(uint64_t r1, uint64_t r2, uint64_t r3, uint64_t r4, uint64_t r5)
> +
> +In actuality, each helper function is defined as taking between 0 and 5 arguments,
> +with the remaining registers being ignored.  The definition of a helper function
> +is responsible for specifying the type (e.g., integer, pointer, etc.) of the value returned,
> +the number of arguments, and the type of each argument.
> 
>  Load and store instructions
> ===========================
> 
> -For load and store instructions (BPF_LD, BPF_LDX, BPF_ST and BPF_STX), the
> +For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_STX``), the
> 8-bit 'opcode' field is divided as:
> 
> -  ============  ======  =================
> -  3 bits (MSB)  2 bits  3 bits (LSB)
> -  ============  ======  =================
> -  mode          size    instruction class
> -  ============  ======  =================
> +============  ======  =================
> +3 bits (MSB)  2 bits  3 bits (LSB)
> +============  ======  =================
> +mode          size    instruction class
> +============  ======  =================
> 
> -The size modifier is one of:
> +mode
> +  one of:
> +
> +  =============  =====  ====================================  =============
> +  mode modifier  value  description                           reference
> +  =============  =====  ====================================  =============
> +  BPF_IMM        0x00   64-bit immediate instructions         `64-bit immediate instructions`_
> +  BPF_ABS        0x20   legacy BPF packet access (absolute)   `Legacy BPF Packet access instructions`_
> +  BPF_IND        0x40   legacy BPF packet access (indirect)   `Legacy BPF Packet access instructions`_
> +  BPF_MEM        0x60   regular load and store operations     `Regular load and store operations`_
> +  BPF_ATOMIC     0xc0   atomic operations                     `Atomic operations`_
> +  =============  =====  ====================================  =============
> +
> +size
> +  one of:
> 
>    =============  =====  =====================
>    size modifier  value  description
> @@ -213,18 +362,8 @@ The size modifier is one of:
>    BPF_DW         0x18   double word (8 bytes)
>    =============  =====  =====================
> 
> -The mode modifier is one of:
> -
> -  =============  =====  ====================================
> -  mode modifier  value  description
> -  =============  =====  ====================================
> -  BPF_IMM        0x00   64-bit immediate instructions
> -  BPF_ABS        0x20   legacy BPF packet access (absolute)
> -  BPF_IND        0x40   legacy BPF packet access (indirect)
> -  BPF_MEM        0x60   regular load and store operations
> -  BPF_ATOMIC     0xc0   atomic operations
> -  =============  =====  ====================================
> -
> +instruction class
> +  the instruction class (see `Instruction classes`_)
> 
>  Regular load and store operations
> ---------------------------------
> @@ -232,19 +371,22 @@ Regular load and store operations
> The ``BPF_MEM`` mode modifier is used to encode regular load and store
> instructions that transfer data between a register and memory.
> 
> -``BPF_MEM | <size> | BPF_STX`` means::
> -
> -  *(size *) (dst_reg + off) = src_reg
> -
> -``BPF_MEM | <size> | BPF_ST`` means::
> -
> -  *(size *) (dst_reg + off) = imm32
> -
> -``BPF_MEM | <size> | BPF_LDX`` means::
> -
> -  dst_reg = *(size *) (src_reg + off)
> -
> -Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
> +=============================  =========  ==================================
> +opcode construction            opcode     pseudocode
> +=============================  =========  ==================================
> +BPF_MEM | BPF_B | BPF_LDX      0x71       dst = *(uint8_t *) (src + offset)  
> +BPF_MEM | BPF_H | BPF_LDX      0x69       dst = *(uint16_t *) (src + offset)
> +BPF_MEM | BPF_W | BPF_LDX      0x61       dst = *(uint32_t *) (src + offset)
> +BPF_MEM | BPF_DW | BPF_LDX     0x79       dst = *(uint64_t *) (src + offset)
> +BPF_MEM | BPF_B | BPF_ST       0x72       *(uint8_t *) (dst + offset) = imm
> +BPF_MEM | BPF_H | BPF_ST       0x6a       *(uint16_t *) (dst + offset) = imm
> +BPF_MEM | BPF_W | BPF_ST       0x62       *(uint32_t *) (dst + offset) = imm
> +BPF_MEM | BPF_DW | BPF_ST      0x7a       *(uint64_t *) (dst + offset) = imm
> +BPF_MEM | BPF_B | BPF_STX      0x73       *(uint8_t *) (dst + offset) = src
> +BPF_MEM | BPF_H | BPF_STX      0x6b       *(uint16_t *) (dst + offset) = src
> +BPF_MEM | BPF_W | BPF_STX      0x63       *(uint32_t *) (dst + offset) = src
> +BPF_MEM | BPF_DW | BPF_STX     0x7b       *(uint64_t *) (dst + offset) = src
> +=============================  =========  ==================================
> 
>  Atomic operations
> -----------------
> @@ -256,76 +398,83 @@ by other eBPF programs or means outside of this specification.
> All atomic operations supported by eBPF are encoded as store operations
> that use the ``BPF_ATOMIC`` mode modifier as follows:
> 
> -  * ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations
> -  * ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
> -  * 8-bit and 16-bit wide atomic operations are not supported.
> +* ``BPF_ATOMIC | BPF_W | BPF_STX`` (0xc3) for 32-bit operations
> +* ``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) for 64-bit operations
> +
> +Note that 8-bit (``BPF_B``) and 16-bit (``BPF_H``) wide atomic operations are not supported,
> +nor is ``BPF_ATOMIC | <size> | BPF_ST``.
> 
> -The imm field is used to encode the actual atomic operation.
> +The 'imm' field is used to encode the actual atomic operation.
> Simple atomic operation use a subset of the values defined to encode
> -arithmetic operations in the imm field to encode the atomic operation:
> +arithmetic operations in the 'imm' field to encode the atomic operation:
> 
> -  ========  =====  ===========
> -  imm       value  description
> -  ========  =====  ===========
> -  BPF_ADD   0x00   atomic add
> -  BPF_OR    0x40   atomic or
> -  BPF_AND   0x50   atomic and
> -  BPF_XOR   0xa0   atomic xor
> -  ========  =====  ===========
> +========  =====  ===========  =======
> +imm       value  description  version
> +========  =====  ===========  =======
> +BPF_ADD   0x00   atomic add   1
> +BPF_OR    0x40   atomic or    3
> +BPF_AND   0x50   atomic and   3
> +BPF_XOR   0xa0   atomic xor   3
> +========  =====  ===========  =======
> 
> +where 'version' indicates the first ISA version in which the value was supported.
> 
> -``BPF_ATOMIC | BPF_W  | BPF_STX`` with imm = BPF_ADD means::
> +``BPF_ATOMIC | BPF_W  | BPF_STX`` (0xc3) with 'imm' = BPF_ADD means::
> 
> -  *(u32 *)(dst_reg + off16) += src_reg
> +  *(uint32_t *)(dst + offset) += src
> 
> -``BPF_ATOMIC | BPF_DW | BPF_STX`` with imm = BPF ADD means::
> +``BPF_ATOMIC | BPF_DW | BPF_STX`` (0xdb) with 'imm' = BPF ADD means::
> 
> -  *(u64 *)(dst_reg + off16) += src_reg
> +  *(uint64_t *)(dst + offset) += src
> 
> -``BPF_XADD`` is a deprecated name for ``BPF_ATOMIC | BPF_ADD``.
> +``BPF_XADD`` appeared in version 1, but is now considered to be a deprecated alias
> +for ``BPF_ATOMIC | BPF_ADD``.
> 
> -In addition to the simple atomic operations, there also is a modifier and
> +In addition to the simple atomic operations above, there also is a modifier and
> two complex atomic operations:
> 
> -  ===========  ================  ===========================
> -  imm          value             description
> -  ===========  ================  ===========================
> -  BPF_FETCH    0x01              modifier: return old value
> -  BPF_XCHG     0xe0 | BPF_FETCH  atomic exchange
> -  BPF_CMPXCHG  0xf0 | BPF_FETCH  atomic compare and exchange
> -  ===========  ================  ===========================
> +===========  ================  ===========================  =======
> +imm          value             description                  version
> +===========  ================  ===========================  =======
> +BPF_FETCH    0x01              modifier: return old value   3
> +BPF_XCHG     0xe0 | BPF_FETCH  atomic exchange              3
> +BPF_CMPXCHG  0xf0 | BPF_FETCH  atomic compare and exchange  3
> +===========  ================  ===========================  =======
> 
>  The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
> always set for the complex atomic operations.  If the ``BPF_FETCH`` flag
> -is set, then the operation also overwrites ``src_reg`` with the value that
> +is set, then the operation also overwrites ``src`` with the value that
> was in memory before it was modified.
> 
> -The ``BPF_XCHG`` operation atomically exchanges ``src_reg`` with the value
> -addressed by ``dst_reg + off``.
> +The ``BPF_XCHG`` operation atomically exchanges ``src`` with the value
> +addressed by ``dst + offset``.
> 
>  The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
> -``dst_reg + off`` with ``R0``. If they match, the value addressed by
> -``dst_reg + off`` is replaced with ``src_reg``. In either case, the
> -value that was at ``dst_reg + off`` before the operation is zero-extended
> +``dst + offset`` with ``R0``. If they match, the value addressed by
> +``dst + offset`` is replaced with ``src``. In either case, the
> +value that was at ``dst + offset`` before the operation is zero-extended
> and loaded back to ``R0``.
> 
> -Clang can generate atomic instructions by default when ``-mcpu=v3`` is
> -enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction
> -Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable
> -the atomics features, while keeping a lower ``-mcpu`` version, you can use
> -``-Xclang -target-feature -Xclang +alu32``.
> +   **Note**
> +
> +   *Clang implementation*:
> +   Clang can generate atomic instructions by default when ``-mcpu=v3`` is
> +   enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction
> +   Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable
> +   the atomics features, while keeping a lower ``-mcpu`` version, you can use
> +   ``-Xclang -target-feature -Xclang +alu32``.
> 
>  64-bit immediate instructions
> -----------------------------
> 
> -Instructions with the ``BPF_IMM`` mode modifier use the wide instruction
> +Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
> encoding for an extra imm64 value.
> 
>  There is currently only one such instruction.
> 
> -``BPF_LD | BPF_DW | BPF_IMM`` means::
> +``BPF_IMM | BPF_DW | BPF_LD`` (0x18) means::
> 
> -  dst_reg = imm64
> +  dst = imm64
> 
>  
>  Legacy BPF Packet access instructions
> @@ -333,34 +482,236 @@ Legacy BPF Packet access instructions
> 
>  eBPF has special instructions for access to packet data that have been
> carried over from classic BPF to retain the performance of legacy socket
> -filters running in the eBPF interpreter.
> +filters running in an eBPF interpreter.
> 
>  The instructions come in two forms: ``BPF_ABS | <size> | BPF_LD`` and
> ``BPF_IND | <size> | BPF_LD``.
> 
>  These instructions are used to access packet data and can only be used when
> -the program context is a pointer to networking packet.  ``BPF_ABS``
> +the program context contains a pointer to a networking packet.  ``BPF_ABS``
> accesses packet data at an absolute offset specified by the immediate data
> and ``BPF_IND`` access packet data at an offset that includes the value of
> a register in addition to the immediate data.
> 
>  These instructions have seven implicit operands:
> 
> - * Register R6 is an implicit input that must contain pointer to a
> -   struct sk_buff.
> - * Register R0 is an implicit output which contains the data fetched from
> -   the packet.
> - * Registers R1-R5 are scratch registers that are clobbered after a call to
> -   ``BPF_ABS | BPF_LD`` or ``BPF_IND | BPF_LD`` instructions.
> -
> -These instructions have an implicit program exit condition as well. When an
> -eBPF program is trying to access the data beyond the packet boundary, the
> -program execution will be aborted.
> -
> -``BPF_ABS | BPF_W | BPF_LD`` means::
> -
> -  R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + imm32))
> -
> -``BPF_IND | BPF_W | BPF_LD`` means::
> -
> -  R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
> +* Register R6 is an implicit input that must contain a pointer to a
> +  context structure with a packet data pointer.
> +* Register R0 is an implicit output which contains the data fetched from
> +  the packet.
> +* Registers R1-R5 are scratch registers that are clobbered by the
> +  instruction.
> +
> +   **Note**
> +
> +   *Linux implementation*: In Linux, R6 references a struct sk_buff.
> +
> +These instructions have an implicit program exit condition as well. If an
> +eBPF program attempts access data beyond the packet boundary, the
> +program execution must be gracefully aborted.
> +
> +``BPF_ABS | BPF_W | BPF_LD`` (0x20) means::
> +
> +  R0 = ntohl(*(uint32_t *) (R6->data + imm))
> +
> +where ``ntohl()`` converts a 32-bit value from network byte order to host byte order.
> +
> +``BPF_IND | BPF_W | BPF_LD`` (0x40) means::
> +
> +  R0 = ntohl(*(uint32_t *) (R6->data + src + imm))
> +
> +Appendix
> +========
> +
> +For reference, the following table lists opcodes in order by value.
> +
> +======  ====  ===================================================  =============
> +opcode  imm   description                                          reference 
> +======  ====  ===================================================  =============
> +0x04    any   dst = (uint32_t)(dst + imm)                          `Arithmetic instructions`_
> +0x05    0x00  goto +offset                                         `Jump instructions`_
> +0x07    any   dst += imm                                           `Arithmetic instructions`_
> +0x0c    0x00  dst = (uint32_t)(dst + src)                          `Arithmetic instructions`_
> +0x0f    0x00  dst += src                                           `Arithmetic instructions`_
> +0x14    any   dst = (uint32_t)(dst - imm)                          `Arithmetic instructions`_
> +0x15    any   if dst == imm goto +offset                           `Jump instructions`_
> +0x16    any   if (uint32_t)dst == imm goto +offset                 `Jump instructions`_
> +0x17    any   dst -= imm                                           `Arithmetic instructions`_
> +0x18    any   dst = imm                                            `Load and store instructions`_
> +0x1c    0x00  dst = (uint32_t)(dst - src)                          `Arithmetic instructions`_
> +0x1d    0x00  if dst == src goto +offset                           `Jump instructions`_
> +0x1e    0x00  if (uint32_t)dst == (uint32_t)src goto +offset       `Jump instructions`_
> +0x1f    0x00  dst -= src                                           `Arithmetic instructions`_
> +0x20    any   dst = ntohl(\*(uint32_t \*)(R6->data + imm))         `Load and store instructions`_
> +0x24    any   dst = (uint32_t)(dst \* imm)                         `Arithmetic instructions`_
> +0x25    any   if dst > imm goto +offset                            `Jump instructions`_
> +0x26    any   if (uint32_t)dst > imm goto +offset                  `Jump instructions`_
> +0x27    any   dst \*= imm                                          `Arithmetic instructions`_
> +0x28    any   dst = ntohs(\*(uint16_t \*)(R6->data + imm))         `Load and store instructions`_
> +0x2c    0x00  dst = (uint32_t)(dst \* src)                         `Arithmetic instructions`_
> +0x2d    0x00  if dst > src goto +offset                            `Jump instructions`_
> +0x2e    0x00  if (uint32_t)dst > (uint32_t)src goto +offset        `Jump instructions`_
> +0x2f    0x00  dst \*= src                                          `Arithmetic instructions`_
> +0x30    any   dst = (\*(uint8_t \*)(R6->data + imm))               `Load and store instructions`_
> +0x34    any   dst = (uint32_t)(dst / imm)                          `Arithmetic instructions`_
> +0x35    any   if dst >= imm goto +offset                           `Jump instructions`_
> +0x36    any   if (uint32_t)dst >= imm goto +offset                 `Jump instructions`_
> +0x37    any   dst /= imm                                           `Arithmetic instructions`_
> +0x38    any   dst = ntohll(\*(uint64_t \*)(R6->data + imm))        `Load and store instructions`_
> +0x3c    0x00  dst = (uint32_t)(dst / src)                          `Arithmetic instructions`_
> +0x3d    0x00  if dst >= src goto +offset                           `Jump instructions`_
> +0x3e    0x00  if (uint32_t)dst >= (uint32_t)src goto +offset       `Jump instructions`_
> +0x3f    0x00  dst /= src                                           `Arithmetic instructions`_
> +0x40    any   dst = ntohl(\*(uint32_t \*)(R6->data + src + imm))   `Load and store instructions`_
> +0x44    any   dst = (uint32_t)(dst \| imm)                         `Arithmetic instructions`_
> +0x45    any   if dst & imm goto +offset                            `Jump instructions`_
> +0x46    any   if (uint32_t)dst & imm goto +offset                  `Jump instructions`_
> +0x47    any   dst \|= imm                                          `Arithmetic instructions`_
> +0x48    any   dst = ntohs(\*(uint16_t \*)(R6->data + src + imm))   `Load and store instructions`_
> +0x4c    0x00  dst = (uint32_t)(dst \| src)                         `Arithmetic instructions`_
> +0x4d    0x00  if dst & src goto +offset                            `Jump instructions`_
> +0x4e    0x00  if (uint32_t)dst & (uint32_t)src goto +offset        `Jump instructions`_
> +0x4f    0x00  dst \|= src                                          `Arithmetic instructions`_
> +0x50    any   dst = \*(uint8_t \*)(R6->data + src + imm))          `Load and store instructions`_
> +0x54    any   dst = (uint32_t)(dst & imm)                          `Arithmetic instructions`_
> +0x55    any   if dst != imm goto +offset                           `Jump instructions`_
> +0x56    any   if (uint32_t)dst != imm goto +offset                 `Jump instructions`_
> +0x57    any   dst &= imm                                           `Arithmetic instructions`_
> +0x58    any   dst = ntohll(\*(uint64_t \*)(R6->data + src + imm))  `Load and store instructions`_
> +0x5c    0x00  dst = (uint32_t)(dst & src)                          `Arithmetic instructions`_
> +0x5d    0x00  if dst != src goto +offset                           `Jump instructions`_
> +0x5e    0x00  if (uint32_t)dst != (uint32_t)src goto +offset       `Jump instructions`_
> +0x5f    0x00  dst &= src                                           `Arithmetic instructions`_
> +0x61    0x00  dst = \*(uint32_t \*)(src + offset)                  `Load and store instructions`_
> +0x62    any   \*(uint32_t \*)(dst + offset) = imm                  `Load and store instructions`_
> +0x63    0x00  \*(uint32_t \*)(dst + offset) = src                  `Load and store instructions`_
> +0x64    any   dst = (uint32_t)(dst << imm)                         `Arithmetic instructions`_
> +0x65    any   if dst s> imm goto +offset                           `Jump instructions`_
> +0x66    any   if (int32_t)dst s> (int32_t)imm goto +offset         `Jump instructions`_
> +0x67    any   dst <<= imm                                          `Arithmetic instructions`_
> +0x69    0x00  dst = \*(uint16_t \*)(src + offset)                  `Load and store instructions`_
> +0x6a    any   \*(uint16_t \*)(dst + offset) = imm                  `Load and store instructions`_
> +0x6b    0x00  \*(uint16_t \*)(dst + offset) = src                  `Load and store instructions`_
> +0x6c    0x00  dst = (uint32_t)(dst << src)                         `Arithmetic instructions`_
> +0x6d    0x00  if dst s> src goto +offset                           `Jump instructions`_
> +0x6e    0x00  if (int32_t)dst s> (int32_t)src goto +offset         `Jump instructions`_
> +0x6f    0x00  dst <<= src                                          `Arithmetic instructions`_
> +0x71    0x00  dst = \*(uint8_t \*)(src + offset)                   `Load and store instructions`_
> +0x72    any   \*(uint8_t \*)(dst + offset) = imm                   `Load and store instructions`_
> +0x73    0x00  \*(uint8_t \*)(dst + offset) = src                   `Load and store instructions`_
> +0x74    any   dst = (uint32_t)(dst >> imm)                         `Arithmetic instructions`_
> +0x75    any   if dst s>= imm goto +offset                          `Jump instructions`_
> +0x76    any   if (int32_t)dst s>= (int32_t)imm goto +offset        `Jump instructions`_
> +0x77    any   dst >>= imm                                          `Arithmetic instructions`_
> +0x79    0x00  dst = \*(uint64_t \*)(src + offset)                  `Load and store instructions`_
> +0x7a    any   \*(uint64_t \*)(dst + offset) = imm                  `Load and store instructions`_
> +0x7b    0x00  \*(uint64_t \*)(dst + offset) = src                  `Load and store instructions`_
> +0x7c    0x00  dst = (uint32_t)(dst >> src)                         `Arithmetic instructions`_
> +0x7d    0x00  if dst s>= src goto +offset                          `Jump instructions`_
> +0x7e    0x00  if (int32_t)dst s>= (int32_t)src goto +offset        `Jump instructions`_
> +0x7f    0x00  dst >>= src                                          `Arithmetic instructions`_
> +0x84    0x00  dst = (uint32_t)-dst                                 `Arithmetic instructions`_
> +0x85    any   call imm                                             `Jump instructions`_
> +0x87    0x00  dst = -dst                                           `Arithmetic instructions`_
> +0x94    any   dst = (uint32_t)(dst % imm)                          `Arithmetic instructions`_
> +0x95    0x00  return                                               `Jump instructions`_
> +0x97    any   dst %= imm                                           `Arithmetic instructions`_
> +0x9c    0x00  dst = (uint32_t)(dst % src)                          `Arithmetic instructions`_
> +0x9f    0x00  dst %= src                                           `Arithmetic instructions`_
> +0xa4    any   dst = (uint32_t)(dst ^ imm)                          `Arithmetic instructions`_
> +0xa5    any   if dst < imm goto +offset                            `Jump instructions`_
> +0xa6    any   if (uint32_t)dst < imm goto +offset                  `Jump instructions`_
> +0xa7    any   dst ^= imm                                           `Arithmetic instructions`_
> +0xac    0x00  dst = (uint32_t)(dst ^ src)                          `Arithmetic instructions`_
> +0xad    0x00  if dst < src goto +offset                            `Jump instructions`_
> +0xae    0x00  if (uint32_t)dst < (uint32_t)src goto +offset        `Jump instructions`_
> +0xaf    0x00  dst ^= src                                           `Arithmetic instructions`_
> +0xb4    any   dst = (uint32_t) imm                                 `Arithmetic instructions`_
> +0xb5    any   if dst <= imm goto +offset                           `Jump instructions`_
> +0xa6    any   if (uint32_t)dst <= imm goto +offset                 `Jump instructions`_
> +0xb7    any   dst = imm                                            `Arithmetic instructions`_
> +0xbc    0x00  dst = (uint32_t) src                                 `Arithmetic instructions`_
> +0xbd    0x00  if dst <= src goto +offset                           `Jump instructions`_
> +0xbe    0x00  if (uint32_t)dst <= (uint32_t)src goto +offset       `Jump instructions`_
> +0xbf    0x00  dst = src                                            `Arithmetic instructions`_
> +0xc3    0x00  lock \*(uint32_t \*)(dst + offset) += src            `Atomic operations`_
> +0xc3    0x01  lock::                                               `Atomic operations`_
> +
> +                  *(uint32_t *)(dst + offset) += src
> +                  src = *(uint32_t *)(dst + offset)
> +0xc3    0x40  \*(uint32_t \*)(dst + offset) \|= src                `Atomic operations`_
> +0xc3    0x41  lock::                                               `Atomic operations`_
> +
> +                  *(uint32_t *)(dst + offset) |= src
> +                  src = *(uint32_t *)(dst + offset)
> +0xc3    0x50  \*(uint32_t \*)(dst + offset) &= src                 `Atomic operations`_
> +0xc3    0x51  lock::                                               `Atomic operations`_
> +
> +                  *(uint32_t *)(dst + offset) &= src
> +                  src = *(uint32_t *)(dst + offset)
> +0xc3    0xa0  \*(uint32_t \*)(dst + offset) ^= src                 `Atomic operations`_
> +0xc3    0xa1  lock::                                               `Atomic operations`_
> +
> +                  *(uint32_t *)(dst + offset) ^= src
> +                  src = *(uint32_t *)(dst + offset)
> +0xc3    0xe1  lock::                                               `Atomic operations`_
> +
> +                  temp = *(uint32_t *)(dst + offset)
> +                  *(uint32_t *)(dst + offset) = src
> +                  src = temp
> +0xc3    0xf1  lock::                                               `Atomic operations`_
> +
> +                  temp = *(uint32_t *)(dst + offset)
> +                  if *(uint32_t)(dst + offset) == R0
> +                     *(uint32_t)(dst + offset) = src
> +                  R0 = temp
> +0xc4    any   dst = (uint32_t)(dst s>> imm)                        `Arithmetic instructions`_
> +0xc5    any   if dst s< imm goto +offset                           `Jump instructions`_
> +0xc6    any   if (int32_t)dst s< (int32_t)imm goto +offset         `Jump instructions`_
> +0xc7    any   dst s>>= imm                                         `Arithmetic instructions`_
> +0xcc    0x00  dst = (uint32_t)(dst s>> src)                        `Arithmetic instructions`_
> +0xcd    0x00  if dst s< src goto +offset                           `Jump instructions`_
> +0xce    0x00  if (int32_t)dst s< (int32_t)src goto +offset         `Jump instructions`_
> +0xcf    0x00  dst s>>= src                                         `Arithmetic instructions`_
> +0xd4    0x10  dst = htole16(dst)                                   `Byte swap instructions`_
> +0xd4    0x20  dst = htole32(dst)                                   `Byte swap instructions`_
> +0xd4    0x40  dst = htole64(dst)                                   `Byte swap instructions`_
> +0xd5    any   if dst s<= imm goto +offset                          `Jump instructions`_
> +0xd6    any   if (int32_t)dst s<= (int32_t)imm goto +offset        `Jump instructions`_
> +0xc3    0x00  lock \*(uint64_t \*)(dst + offset) += src            `Atomic operations`_
   ^^^^
The opcode should be 0xdb as well

Otherwise,
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>

> +0xdb    0x01  lock::                                               `Atomic operations`_
> +
> +                  *(uint64_t *)(dst + offset) += src
> +                  src = *(uint64_t *)(dst + offset)
> +0xdb    0x40  \*(uint64_t \*)(dst + offset) \|= src                `Atomic operations`_
> +0xdb    0x41  lock::                                               `Atomic operations`_
> +
> +                  *(uint64_t *)(dst + offset) |= src
> +                  lock src = *(uint64_t *)(dst + offset)
> +0xdb    0x50  \*(uint64_t \*)(dst + offset) &= src                 `Atomic operations`_
> +0xdb    0x51  lock::                                               `Atomic operations`_
> +
> +                  *(uint64_t *)(dst + offset) &= src
> +                  src = *(uint64_t *)(dst + offset)
> +0xdb    0xa0  \*(uint64_t \*)(dst + offset) ^= src                 `Atomic operations`_
> +0xdb    0xa1  lock::                                               `Atomic operations`_
> +
> +                  *(uint64_t *)(dst + offset) ^= src
> +                  src = *(uint64_t *)(dst + offset)
> +0xdb    0xe1  lock::                                               `Atomic operations`_
> +
> +                  temp = *(uint64_t *)(dst + offset)
> +                  *(uint64_t *)(dst + offset) = src
> +                  src = temp
> +0xdb    0xf1  lock::                                               `Atomic operations`_
> +
> +                  temp = *(uint64_t *)(dst + offset)
> +                  if *(uint64_t)(dst + offset) == R0
> +                     *(uint64_t)(dst + offset) = src
> +                  R0 = temp
> +0xdc    0x10  dst = htobe16(dst)                                   `Byte swap instructions`_
> +0xdc    0x20  dst = htobe32(dst)                                   `Byte swap instructions`_
> +0xdc    0x40  dst = htobe64(dst)                                   `Byte swap instructions`_
> +0xdd    0x00  if dst s<= src goto +offset                          `Jump instructions`_
> +0xde    0x00  if (int32_t)dst s<= (int32_t)src goto +offset        `Jump instructions`_
> +======  ====  ===================================================  =============
> -- 
> 2.32.0.windows.1
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: FW: ebpf-docs: draft of ISA doc updates in progress
  2022-09-14  6:22   ` Shung-Hsi Yu
@ 2022-09-14  9:35     ` Dave Thaler
  2022-09-19 17:04     ` Christoph Hellwig
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Thaler @ 2022-09-14  9:35 UTC (permalink / raw)
  To: Shung-Hsi Yu; +Cc: bpf

Shung-Hsi Yu <shung-hsi.yu@suse.com> writes:
[...]
> > +imm
> > +  integer immediate value
> 
> Perhaps mention that imm is a _signed_ integer just like offset below?

Thanks, will add.

[...]
> > +``BPF_DIV`` has an implicit program exit condition as well. If
> > +eBPF program execution would result in division by zero,
> > +program execution must be gracefully aborted.
> 
> As discussed in yesterday's session, there's no graceful abortion on
> division by zero, instead, the BPF verifier in Linux prevents division by
> zero from happening. Here a few additional notes:
> 
> 1. Modulo by zero is also prevented for the same reason
[...]

Thanks, Daniel pointed that out too after the session so I am adding this info.

> > +Helper functions are a concept whereby BPF programs can call into
> > +set of function calls exposed by the eBPF runtime.  Each helper
> > +function is identified by an integer used in a ``BPF_CALL`` instruction.
> > +The available helper functions may differ for each eBPF program type.
> 
> While BPF ISA only supports direct call BPF_CALL[1], technically there is an
> opcode 0x8d (BPF_JUMP | BPF_CALL | BPF_X) that has the indirect call
> semantic, and Clang emit such indirect call instruction if user attempt to
> compile with -O0.
> 
> I think it worth mentioning in this document for better clarity, perhaps
> simply saying that indirect call is not part of BPF ISA is enough.

Noted, will try to add something to that effect.

[...]
> > +0xc3    0x00  lock \*(uint64_t \*)(dst + offset) += src            `Atomic
> operations`_
>    ^^^^
> The opcode should be 0xdb as well

Ack.

> Otherwise,
> Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>

Thanks,
Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: FW: ebpf-docs: draft of ISA doc updates in progress
  2022-09-13  8:12 ` FW: ebpf-docs: draft of ISA doc updates in progress Dave Thaler
  2022-09-14  6:22   ` Shung-Hsi Yu
@ 2022-09-19 16:58   ` Christoph Hellwig
  2022-09-20 19:37     ` Dave Thaler
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-19 16:58 UTC (permalink / raw)
  To: Dave Thaler; +Cc: bpf

Hi Dave,

there is a lot of good thing in here, but it is a bit hard to review,
mostly because it is a giant patch instead of a single well-documented
patch per logical thing to change.

A bunch of nitpicks below, mostly style or organizational:


> +The current Instruction Set Architecture (ISA) version, sometimes referred to in other documents
> +as a "CPU" version, is 3.  This document also covers older versions of the ISA.

Hmm, I thought the versioning was a bit more complicated based on
the mailing list interactions and the call.  Especially with things
like the full atomics not even supported by all gits.

> +   **Note**
> +
> +   *Linux implementation*: In the Linux kernel, the exit value for eBPF
> +   programs is passed as a 32 bit value.

Is this Linux, a specific program type, or the ISA?

> +   *Linux implementation*: In the Linux kernel, all program types only use
> +   R1 which contains the "context", which is typically a structure containing all
> +   the inputs needed.  

I also think these Linux notes do not belong into the main instruction
set document, which tries to really just describe the ISA.

> - * the basic instruction encoding, which uses 64 bits to encode an instruction
> - * the wide instruction encoding, which appends a second 64-bit immediate value
> -   (imm64) after the basic instruction for a total of 128 bits.
> +* the basic instruction encoding, which uses 64 bits to encode an instruction
> +* the wide instruction encoding, which appends a second 64-bit immediate (i.e.,

Btw, can you explain why you de-indent these?  I picked the space before
the * because that seems to be what most Linux RST documents do.

> +   For ISA versions prior to 3, Clang v7.0 and later can enable ``BPF_ALU`` support with
> +   ``-Xclang -target-feature -Xclang +alu32``.

I also suspect the clang notes would be better off in a separate
document from the main ISA.

> -BPF_XOR | BPF_K | BPF_ALU means::
> +   *Linux implementation*: In the Linux kernel, uint32_t is expressed as u32,
> +   uint64_t is expressed as u64, etc.  This document uses the standard C terminology
> +   as the cross-platform specification.

I don't think this makes sense in the document.  Instead we probably
need a "Conventions" section that defines the type and syntax we use.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: FW: ebpf-docs: draft of ISA doc updates in progress
  2022-09-14  6:22   ` Shung-Hsi Yu
  2022-09-14  9:35     ` Dave Thaler
@ 2022-09-19 17:04     ` Christoph Hellwig
  2022-09-20 19:12       ` Dave Thaler
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2022-09-19 17:04 UTC (permalink / raw)
  To: Shung-Hsi Yu; +Cc: Dave Thaler, bpf

On Wed, Sep 14, 2022 at 02:22:51PM +0800, Shung-Hsi Yu wrote:
> As discussed in yesterday's session, there's no graceful abortion on
> division by zero, instead, the BPF verifier in Linux prevents division by
> zero from happening. Here a few additional notes:

Hmm, I thought Alexei pointed out a while ago that divide by zero is
now defined to return 0 following.  Ok, reading further along I think
that is what you describe with the pseudo-code below. 

> While BPF ISA only supports direct call BPF_CALL[1], technically there is an
> opcode 0x8d (BPF_JUMP | BPF_CALL | BPF_X) that has the indirect call
> semantic, and Clang emit such indirect call instruction if user attempt to
> compile with -O0.
> 
> I think it worth mentioning in this document for better clarity, perhaps
> simply saying that indirect call is not part of BPF ISA is enough.

Which brings up another question:  Do we need a list of opcodes
that someone else defined somewhere that are not considered valid
eBPF?  Or how do we get clang and gcc to stop producing invalid
eBPF might be the better question.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: FW: ebpf-docs: draft of ISA doc updates in progress
  2022-09-19 17:04     ` Christoph Hellwig
@ 2022-09-20 19:12       ` Dave Thaler
  2022-09-20 23:39         ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Thaler @ 2022-09-20 19:12 UTC (permalink / raw)
  To: Christoph Hellwig, Shung-Hsi Yu; +Cc: bpf

> -----Original Message-----
> From: Christoph Hellwig <hch@infradead.org>
> Sent: Monday, September 19, 2022 10:04 AM
> To: Shung-Hsi Yu <shung-hsi.yu@suse.com>
> Cc: Dave Thaler <dthaler@microsoft.com>; bpf <bpf@vger.kernel.org>
> Subject: Re: FW: ebpf-docs: draft of ISA doc updates in progress
> 
> On Wed, Sep 14, 2022 at 02:22:51PM +0800, Shung-Hsi Yu wrote:
> > As discussed in yesterday's session, there's no graceful abortion on
> > division by zero, instead, the BPF verifier in Linux prevents division
> > by zero from happening. Here a few additional notes:
> 
> Hmm, I thought Alexei pointed out a while ago that divide by zero is now
> defined to return 0 following.  Ok, reading further along I think that is what
> you describe with the pseudo-code below.

Based on the discussion at LPC, and the fact that older implementations,
as well as uBPF and rbpf still terminate the program, I've added this text
to permit both behaviors:

> If eBPF program execution would result in division by zero,
> the destination register SHOULD instead be set to zero, but
> program execution MAY be gracefully aborted instead.
> Similarly, if execution would result in modulo by zero,
> the destination register SHOULD instead be set to the source value,
> but program execution MAY be gracefully aborted instead.

And elsewhere in the doc defined gracefully aborted as:

> After execution of an eBPF program, register R0 contains the exit code
> whose meaning is defined by the program type, except that an exit code
> of -1 means the program was gracefully aborted.  That is, if a program
> is gracefully aborted for any reason, it means that no further instructions
> are executed, and a value of -1 is returned in register R0 to the caller of
> the program.

The problem with that, as Quentin pointed out, is that -1 is a valid return
code from some program types like TC.  Do we suddenly declare
uBPF etc as being non-compliant?  My preference is just to document
the issue, since such runtimes might choose to make -1 be a reserved
value for all program types they support.  After all the ISA does not
define program type details so they can use the ISA without TC etc.

> > While BPF ISA only supports direct call BPF_CALL[1], technically there
> > is an opcode 0x8d (BPF_JUMP | BPF_CALL | BPF_X) that has the indirect
> > call semantic, and Clang emit such indirect call instruction if user
> > attempt to compile with -O0.
> >
> > I think it worth mentioning in this document for better clarity,
> > perhaps simply saying that indirect call is not part of BPF ISA is enough.
> 
> Which brings up another question:  Do we need a list of opcodes that
> someone else defined somewhere that are not considered valid eBPF?

That's what my document currently does, saying:

> Note that ``BPF_CALL | BPF_X | BPF_JMP`` (0x8d), where the helper
> function integer would be read from a specified register, is not currently
> permitted.
>
>   **Note**
>
>   *Clang implementation*:
>   Clang will generate this invalid instruction if ``-O0`` is used.

The latest version can be viewed at 
https://github.com/dthaler/ebpf-docs/blob/update/isa/kernel.org/instruction-set.rst
I can post another patchset after addressing any other comments.

Dave



^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: FW: ebpf-docs: draft of ISA doc updates in progress
  2022-09-19 16:58   ` Christoph Hellwig
@ 2022-09-20 19:37     ` Dave Thaler
  2022-09-20 23:45       ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Thaler @ 2022-09-20 19:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bpf

Christoph Hellwig <hch@infradead.org> writes:
[...]
> > +The current Instruction Set Architecture (ISA) version, sometimes
> > +referred to in other documents as a "CPU" version, is 3.  This document
> also covers older versions of the ISA.
> 
> Hmm, I thought the versioning was a bit more complicated based on the
> mailing list interactions and the call.  Especially with things like the full
> atomics not even supported by all gits.

Yeah some features, like BPF_ALU, are optional in some versions so yes it's
more complicated so some things have to be described at the granularity
of a feature.

> > +   **Note**
> > +
> > +   *Linux implementation*: In the Linux kernel, the exit value for
> > +eBPF
> > +   programs is passed as a 32 bit value.
> 
> Is this Linux, a specific program type, or the ISA?

https://docs.cilium.io/en/stable/bpf/ just said
> Register r0 is also the register containing the exit value for the BPF program. The
> semantics of the exit value are defined by the type of program. Furthermore, when
> handing execution back to the kernel, the exit value is passed as a 32 bit value.

But other runtimes like ubpf have no such restriction and it is not mentioned
in the original instruction-set.rst, so I assumed from the above that it is a Linux
implementation specific convention, not part of the ISA.

If you have evidence otherwise, let me know.

> > +   *Linux implementation*: In the Linux kernel, all program types
> > +only use
> > +   R1 which contains the "context", which is typically a structure
> > +containing all
> > +   the inputs needed.
> 
> I also think these Linux notes do not belong into the main instruction set
> document, which tries to really just describe the ISA.

I asked this question at LPC and no one had any comments about the
current approach so assumed the default consensus was to leave them
in.  The original instruction-set.rst (prior to my proposed updates)
combined both Linux specific info and general ISA info but during the LPC
discussion we said we'd move the legacy packet instructions to a separate doc
on "Linux historical notes", which I have a draft of at
https://github.com/dthaler/ebpf-docs/blob/update/isa/kernel.org/linux-historical-notes.rst

If we do move Linux notes out of the ISA spec, then we could rename the
other doc to just be "Linux implementation notes", and contain all the Linux
specific notes.   It will be slightly more complicated though to explain what the notes
refer to in the main spec if they're not inline line at present.

Any other opinions?  I don't feel strongly either way.
 
> > - * the basic instruction encoding, which uses 64 bits to encode an
> > instruction
> > - * the wide instruction encoding, which appends a second 64-bit
> > immediate value
> > -   (imm64) after the basic instruction for a total of 128 bits.
> > +* the basic instruction encoding, which uses 64 bits to encode an
> > +instruction
> > +* the wide instruction encoding, which appends a second 64-bit
> > +immediate (i.e.,
> 
> Btw, can you explain why you de-indent these?  I picked the space before the
> * because that seems to be what most Linux RST documents do.

Quentin had pointed out to me some rendering problems with the space.
You can see the difference in GitHub's renderer in a test file at
https://github.com/dthaler/ebpf-docs/blob/test-formatting/test/test-formatting.rst#lists

> > +   For ISA versions prior to 3, Clang v7.0 and later can enable
> > +``BPF_ALU`` support with
> > +   ``-Xclang -target-feature -Xclang +alu32``.
> 
> I also suspect the clang notes would be better off in a separate document
> from the main ISA.

No one else raised concerns at LPC when I explicitly asked this, but I
have no strong opinion either way other than whatever we do for Linux
notes and for Clang notes, the answer should be the same.

> > -BPF_XOR | BPF_K | BPF_ALU means::
> > +   *Linux implementation*: In the Linux kernel, uint32_t is expressed
> > +as u32,
> > +   uint64_t is expressed as u64, etc.  This document uses the
> > +standard C terminology
> > +   as the cross-platform specification.
> 
> I don't think this makes sense in the document.  Instead we probably need a
> "Conventions" section that defines the type and syntax we use.

I'm interpreting this comment as part of the overall Linux note feedback.
Would love to hear if anyone else has opinion on the matter of Linux
implementation notes.

Thanks for the feedback!

Dave


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: FW: ebpf-docs: draft of ISA doc updates in progress
  2022-09-20 19:12       ` Dave Thaler
@ 2022-09-20 23:39         ` Alexei Starovoitov
  2022-09-21  8:34           ` Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress) Shung-Hsi Yu
  2022-09-21 17:53           ` FW: ebpf-docs: draft of ISA doc updates in progress Dave Thaler
  0 siblings, 2 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2022-09-20 23:39 UTC (permalink / raw)
  To: Dave Thaler; +Cc: Christoph Hellwig, Shung-Hsi Yu, bpf

On Tue, Sep 20, 2022 at 12:51 PM Dave Thaler <dthaler@microsoft.com> wrote:
>
> > -----Original Message-----
> > From: Christoph Hellwig <hch@infradead.org>
> > Sent: Monday, September 19, 2022 10:04 AM
> > To: Shung-Hsi Yu <shung-hsi.yu@suse.com>
> > Cc: Dave Thaler <dthaler@microsoft.com>; bpf <bpf@vger.kernel.org>
> > Subject: Re: FW: ebpf-docs: draft of ISA doc updates in progress
> >
> > On Wed, Sep 14, 2022 at 02:22:51PM +0800, Shung-Hsi Yu wrote:
> > > As discussed in yesterday's session, there's no graceful abortion on
> > > division by zero, instead, the BPF verifier in Linux prevents division
> > > by zero from happening. Here a few additional notes:
> >
> > Hmm, I thought Alexei pointed out a while ago that divide by zero is now
> > defined to return 0 following.  Ok, reading further along I think that is what
> > you describe with the pseudo-code below.
>
> Based on the discussion at LPC, and the fact that older implementations,
> as well as uBPF and rbpf still terminate the program, I've added this text
> to permit both behaviors:

That's not right. ubpf and rbpf are broken.
We shouldn't be adding descriptions of broken implementations
to the standard.
There is no way to 'gracefully abort' in eBPF.
There is a way to 'return 0' in cBPF, but that's different. See below.

>
> > If eBPF program execution would result in division by zero,
> > the destination register SHOULD instead be set to zero, but
> > program execution MAY be gracefully aborted instead.
> > Similarly, if execution would result in modulo by zero,
> > the destination register SHOULD instead be set to the source value,
> > but program execution MAY be gracefully aborted instead.
>
> And elsewhere in the doc defined gracefully aborted as:
>
> > After execution of an eBPF program, register R0 contains the exit code
> > whose meaning is defined by the program type, except that an exit code
> > of -1 means the program was gracefully aborted.  That is, if a program
> > is gracefully aborted for any reason, it means that no further instructions
> > are executed, and a value of -1 is returned in register R0 to the caller of
> > the program.
>
> The problem with that, as Quentin pointed out, is that -1 is a valid return
> code from some program types like TC.  Do we suddenly declare
> uBPF etc as being non-compliant?

yes. ubpf is non-compliant.
-1 is just a radom number that ubpf picked.
Classic BPF defines a certain situation for obsolete LD_ABS
as 'return 0'.
We had to keep it this way due to classic baggage,
but, as we agreed, we're not going to define these two classic insns
in the standard doc.
So there should be no 'graceful abort' paragraph anywhere
in the standard.


> My preference is just to document
> the issue, since such runtimes might choose to make -1 be a reserved
> value for all program types they support.  After all the ISA does not
> define program type details so they can use the ISA without TC etc.

We can document with certainty the returns codes of XDP prog type
and stress that they should be the same across all run-times.
There is no reason for windows to be different from linux in XDP progs.
What to do with TC and other returns codes is a different matter.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: FW: ebpf-docs: draft of ISA doc updates in progress
  2022-09-20 19:37     ` Dave Thaler
@ 2022-09-20 23:45       ` Alexei Starovoitov
  0 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2022-09-20 23:45 UTC (permalink / raw)
  To: Dave Thaler; +Cc: Christoph Hellwig, bpf

On Tue, Sep 20, 2022 at 12:58 PM Dave Thaler <dthaler@microsoft.com> wrote:

> > > +   For ISA versions prior to 3, Clang v7.0 and later can enable
> > > +``BPF_ALU`` support with
> > > +   ``-Xclang -target-feature -Xclang +alu32``.
> >
> > I also suspect the clang notes would be better off in a separate document
> > from the main ISA.
>
> No one else raised concerns at LPC when I explicitly asked this, but I
> have no strong opinion either way other than whatever we do for Linux
> notes and for Clang notes, the answer should be the same.

It feels to me it would be better to document the latest ISA
instead of diverging into -mcpu=v1,v2,v3 differences.
So the standard would include all insns as of writing of the doc
including atomics.
Older compilers and compilers with certain flags may generate
a subset of full ISA, but that can be a footnote or 'clang notes'.
No one needs to read a history of how instructions were added
over the last 8 years. That bit is in git logs anyway.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress)
  2022-09-20 23:39         ` Alexei Starovoitov
@ 2022-09-21  8:34           ` Shung-Hsi Yu
  2022-09-21 13:50             ` Alexei Starovoitov
  2022-09-21 14:41             ` Dave Thaler
  2022-09-21 17:53           ` FW: ebpf-docs: draft of ISA doc updates in progress Dave Thaler
  1 sibling, 2 replies; 16+ messages in thread
From: Shung-Hsi Yu @ 2022-09-21  8:34 UTC (permalink / raw)
  To: bpf; +Cc: Dave Thaler, Christoph Hellwig, bpf, Alexei Starovoitov

Subject changed to reflect this reply is out of scope of the original topic
(ISA doc).

On Tue, Sep 20, 2022 at 04:39:52PM -0700, Alexei Starovoitov wrote:
> On Tue, Sep 20, 2022 at 12:51 PM Dave Thaler <dthaler@microsoft.com> wrote:
> > > -----Original Message-----
> > > From: Christoph Hellwig <hch@infradead.org>
> > > Sent: Monday, September 19, 2022 10:04 AM
> > > To: Shung-Hsi Yu <shung-hsi.yu@suse.com>
> > > Cc: Dave Thaler <dthaler@microsoft.com>; bpf <bpf@vger.kernel.org>
> > > Subject: Re: FW: ebpf-docs: draft of ISA doc updates in progress
> > >
> > > On Wed, Sep 14, 2022 at 02:22:51PM +0800, Shung-Hsi Yu wrote:
> > > > As discussed in yesterday's session, there's no graceful abortion on
> > > > division by zero, instead, the BPF verifier in Linux prevents division
> > > > by zero from happening. Here a few additional notes:
> > >
> > > Hmm, I thought Alexei pointed out a while ago that divide by zero is now
> > > defined to return 0 following.  Ok, reading further along I think that is what
> > > you describe with the pseudo-code below.
> >
> > Based on the discussion at LPC, and the fact that older implementations,
> > as well as uBPF and rbpf still terminate the program, I've added this text
> > to permit both behaviors:
> 
> That's not right. ubpf and rbpf are broken.
> We shouldn't be adding descriptions of broken implementations
> to the standard.
> There is no way to 'gracefully abort' in eBPF.
> There is a way to 'return 0' in cBPF, but that's different. See below.
> 
> > > If eBPF program execution would result in division by zero,
> > > the destination register SHOULD instead be set to zero, but
> > > program execution MAY be gracefully aborted instead.
> > > Similarly, if execution would result in modulo by zero,
> > > the destination register SHOULD instead be set to the source value,
> > > but program execution MAY be gracefully aborted instead.

While this makes implementation a lot easier, come to think of it though,
this behavior actually is more like a hack around having to deal with
division/modulo-on-zero at runtime. User doing statistic calculations with
BPF will get bit by this sooner or later, arriving at the wrong calculation
(fast-math comes to mind).

This seems to go against some general BPF principle[1] of preventing the
users from shooting themselves in the foot.

Just like how BPF verifier prevents a _possible_ out-of-bound memory access,
e.g. arr[i] when `i` is not bound-checked. Ideally I'd expect a coherent
approach toward division/modulo-on-zero as well; the verifier should prevent
program that _might_ do division-on-zero from loading in the first place.
(Maybe possible to achieve if we introduce something like SCALAR_OR_NULL
composed type, but that's definitely not easy)

Admittedly even if achievable, this is a radical approach that is not
backward compatible. If such check is implemented, programs that used to
load may now be rejected. (Usually new BPF verifier feature allows more BPF
program to pass the verifier, rather then the other way around)

So, I don't have a good proposal at the moment. The purpose to this email is
to point what I see as an issue out and hope to start a discussion.

1: Okay, I'm making this up a bit, strictly speaking the BPF foundation is
safe program (one of Alexei's talk) rather than preventing users from
shooting themselves in the foot.

[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress)
  2022-09-21  8:34           ` Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress) Shung-Hsi Yu
@ 2022-09-21 13:50             ` Alexei Starovoitov
  2022-09-22  5:00               ` Shung-Hsi Yu
  2022-09-21 14:41             ` Dave Thaler
  1 sibling, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2022-09-21 13:50 UTC (permalink / raw)
  To: Shung-Hsi Yu; +Cc: bpf, Dave Thaler, Christoph Hellwig

On Wed, Sep 21, 2022 at 1:34 AM Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
>
> Subject changed to reflect this reply is out of scope of the original topic
> (ISA doc).
>
> On Tue, Sep 20, 2022 at 04:39:52PM -0700, Alexei Starovoitov wrote:
> > On Tue, Sep 20, 2022 at 12:51 PM Dave Thaler <dthaler@microsoft.com> wrote:
> > > > -----Original Message-----
> > > > From: Christoph Hellwig <hch@infradead.org>
> > > > Sent: Monday, September 19, 2022 10:04 AM
> > > > To: Shung-Hsi Yu <shung-hsi.yu@suse.com>
> > > > Cc: Dave Thaler <dthaler@microsoft.com>; bpf <bpf@vger.kernel.org>
> > > > Subject: Re: FW: ebpf-docs: draft of ISA doc updates in progress
> > > >
> > > > On Wed, Sep 14, 2022 at 02:22:51PM +0800, Shung-Hsi Yu wrote:
> > > > > As discussed in yesterday's session, there's no graceful abortion on
> > > > > division by zero, instead, the BPF verifier in Linux prevents division
> > > > > by zero from happening. Here a few additional notes:
> > > >
> > > > Hmm, I thought Alexei pointed out a while ago that divide by zero is now
> > > > defined to return 0 following.  Ok, reading further along I think that is what
> > > > you describe with the pseudo-code below.
> > >
> > > Based on the discussion at LPC, and the fact that older implementations,
> > > as well as uBPF and rbpf still terminate the program, I've added this text
> > > to permit both behaviors:
> >
> > That's not right. ubpf and rbpf are broken.
> > We shouldn't be adding descriptions of broken implementations
> > to the standard.
> > There is no way to 'gracefully abort' in eBPF.
> > There is a way to 'return 0' in cBPF, but that's different. See below.
> >
> > > > If eBPF program execution would result in division by zero,
> > > > the destination register SHOULD instead be set to zero, but
> > > > program execution MAY be gracefully aborted instead.
> > > > Similarly, if execution would result in modulo by zero,
> > > > the destination register SHOULD instead be set to the source value,
> > > > but program execution MAY be gracefully aborted instead.
>
> While this makes implementation a lot easier, come to think of it though,
> this behavior actually is more like a hack around having to deal with
> division/modulo-on-zero at runtime. User doing statistic calculations with
> BPF will get bit by this sooner or later, arriving at the wrong calculation
> (fast-math comes to mind).

lol. If that was the case arm64 would be on fire long ago
and users would complain in masses.
Same with risc-v.

> This seems to go against some general BPF principle[1] of preventing the
> users from shooting themselves in the foot.
>
> Just like how BPF verifier prevents a _possible_ out-of-bound memory access,
> e.g. arr[i] when `i` is not bound-checked. Ideally I'd expect a coherent
> approach toward division/modulo-on-zero as well; the verifier should prevent
> program that _might_ do division-on-zero from loading in the first place.
> (Maybe possible to achieve if we introduce something like SCALAR_OR_NULL
> composed type, but that's definitely not easy)
>
> Admittedly even if achievable, this is a radical approach that is not
> backward compatible. If such check is implemented, programs that used to
> load may now be rejected. (Usually new BPF verifier feature allows more BPF
> program to pass the verifier, rather then the other way around)
>
> So, I don't have a good proposal at the moment. The purpose to this email is
> to point what I see as an issue out and hope to start a discussion.
>
> 1: Okay, I'm making this up a bit, strictly speaking the BPF foundation is
> safe program (one of Alexei's talk) rather than preventing users from
> shooting themselves in the foot.

Safe != invalid.
BPF doesn't prevent invalid programs.
BPF ensures safety only, not crashing the kernel, not leaking data, etc.
For example: under speculation arr[i] can go out of bounds
and to prevent data leaks we insert masking operations.
Similar with div_by_0. If the verifier can detect that it will reject
the prog, otherwise it will insert if(==0) xor, because not
all architectures behave like arm64.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress)
  2022-09-21  8:34           ` Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress) Shung-Hsi Yu
  2022-09-21 13:50             ` Alexei Starovoitov
@ 2022-09-21 14:41             ` Dave Thaler
  2022-09-22  5:00               ` Shung-Hsi Yu
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Thaler @ 2022-09-21 14:41 UTC (permalink / raw)
  To: Shung-Hsi Yu, bpf; +Cc: Christoph Hellwig, bpf, Alexei Starovoitov

Shung-Hsi Yu <shung-hsi.yu@suse.com> writes: 
> Just like how BPF verifier prevents a _possible_ out-of-bound memory access,
> e.g. arr[i] when `i` is not bound-checked. Ideally I'd expect a coherent
> approach toward division/modulo-on-zero as well; the verifier should prevent
> program that _might_ do division-on-zero from loading in the first place.
[...]
> Admittedly even if achievable, this is a radical approach that is not backward
> compatible. If such check is implemented, programs that used to load may
> now be rejected.

FWIW, the PREVAIL verifier attempted to do that, although it was incomplete until a patch I just submitted to it yesterday.  However, when running the patched version, it would reject some cilium, falco, suricata, etc. programs that it uses as test cases,
so my patch proposed making it optional in that verifier although maybe there's
some better alternative.

Certainly I think a runtime should implement the 0 check itself regardless of whether it's rejected or allowed by verification, but I wanted to share evidence that your "may now be rejected" is demonstrably true.

Dave


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: FW: ebpf-docs: draft of ISA doc updates in progress
  2022-09-20 23:39         ` Alexei Starovoitov
  2022-09-21  8:34           ` Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress) Shung-Hsi Yu
@ 2022-09-21 17:53           ` Dave Thaler
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Thaler @ 2022-09-21 17:53 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Christoph Hellwig, Shung-Hsi Yu, bpf

Alexei wrote:
> > Based on the discussion at LPC, and the fact that older
> > implementations, as well as uBPF and rbpf still terminate the program,
> > I've added this text to permit both behaviors:
> 
> That's not right. ubpf and rbpf are broken.
> We shouldn't be adding descriptions of broken implementations to the
> standard.
> There is no way to 'gracefully abort' in eBPF.

Just had a discussion with one of the ubpf maintainers (Alan, who was
in the office-hours-slot standardization meeting) who agreed that ubpf
should be fixed, so will update to remove that.

Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress)
  2022-09-21 13:50             ` Alexei Starovoitov
@ 2022-09-22  5:00               ` Shung-Hsi Yu
  2022-09-23 23:15                 ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Shung-Hsi Yu @ 2022-09-22  5:00 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: bpf, Dave Thaler, Christoph Hellwig

On Wed, Sep 21, 2022 at 06:50:54AM -0700, Alexei Starovoitov wrote:
> On Wed, Sep 21, 2022 at 1:34 AM Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
> >
> > Subject changed to reflect this reply is out of scope of the original topic
> > (ISA doc).
> >
> > On Tue, Sep 20, 2022 at 04:39:52PM -0700, Alexei Starovoitov wrote:
> > > On Tue, Sep 20, 2022 at 12:51 PM Dave Thaler <dthaler@microsoft.com> wrote:
> > > > > -----Original Message-----
> > > > > From: Christoph Hellwig <hch@infradead.org>
> > > > > Sent: Monday, September 19, 2022 10:04 AM
> > > > > To: Shung-Hsi Yu <shung-hsi.yu@suse.com>
> > > > > Cc: Dave Thaler <dthaler@microsoft.com>; bpf <bpf@vger.kernel.org>
> > > > > Subject: Re: FW: ebpf-docs: draft of ISA doc updates in progress
> > > > >
> > > > > On Wed, Sep 14, 2022 at 02:22:51PM +0800, Shung-Hsi Yu wrote:
> > > > > > As discussed in yesterday's session, there's no graceful abortion on
> > > > > > division by zero, instead, the BPF verifier in Linux prevents division
> > > > > > by zero from happening. Here a few additional notes:
> > > > >
> > > > > Hmm, I thought Alexei pointed out a while ago that divide by zero is now
> > > > > defined to return 0 following.  Ok, reading further along I think that is what
> > > > > you describe with the pseudo-code below.
> > > >
> > > > Based on the discussion at LPC, and the fact that older implementations,
> > > > as well as uBPF and rbpf still terminate the program, I've added this text
> > > > to permit both behaviors:
> > >
> > > That's not right. ubpf and rbpf are broken.
> > > We shouldn't be adding descriptions of broken implementations
> > > to the standard.
> > > There is no way to 'gracefully abort' in eBPF.
> > > There is a way to 'return 0' in cBPF, but that's different. See below.
> > >
> > > > > If eBPF program execution would result in division by zero,
> > > > > the destination register SHOULD instead be set to zero, but
> > > > > program execution MAY be gracefully aborted instead.
> > > > > Similarly, if execution would result in modulo by zero,
> > > > > the destination register SHOULD instead be set to the source value,
> > > > > but program execution MAY be gracefully aborted instead.
> >
> > While this makes implementation a lot easier, come to think of it though,
> > this behavior actually is more like a hack around having to deal with
> > division/modulo-on-zero at runtime. User doing statistic calculations with
> > BPF will get bit by this sooner or later, arriving at the wrong calculation
> > (fast-math comes to mind).
> 
> lol. If that was the case arm64 would be on fire long ago
> and users would complain in masses.
> Same with risc-v.

whoa, I had no idea.

And looking around I don't see complains. Taking what I said back and +1 for
using the current division/modulo-by-zero behavior as standard.

> > This seems to go against some general BPF principle[1] of preventing the
> > users from shooting themselves in the foot.
> >
> > Just like how BPF verifier prevents a _possible_ out-of-bound memory access,
> > e.g. arr[i] when `i` is not bound-checked. Ideally I'd expect a coherent
> > approach toward division/modulo-on-zero as well; the verifier should prevent
> > program that _might_ do division-on-zero from loading in the first place.
> > (Maybe possible to achieve if we introduce something like SCALAR_OR_NULL
> > composed type, but that's definitely not easy)
> >
> > Admittedly even if achievable, this is a radical approach that is not
> > backward compatible. If such check is implemented, programs that used to
> > load may now be rejected. (Usually new BPF verifier feature allows more BPF
> > program to pass the verifier, rather then the other way around)
> >
> > So, I don't have a good proposal at the moment. The purpose to this email is
> > to point what I see as an issue out and hope to start a discussion.
> >
> > 1: Okay, I'm making this up a bit, strictly speaking the BPF foundation is
> > safe program (one of Alexei's talk) rather than preventing users from
> > shooting themselves in the foot.
> 
> Safe != invalid.
> BPF doesn't prevent invalid programs.
> BPF ensures safety only, not crashing the kernel, not leaking data, etc.
> For example: under speculation arr[i] can go out of bounds
> and to prevent data leaks we insert masking operations.

Point taken, thanks for clarifying the difference between safe and invalid.

> Similar with div_by_0. If the verifier can detect that it will reject
> the prog, otherwise it will insert if(==0) xor, because not
> all architectures behave like arm64.

Speaking of which, should the "if(==0) xor" patching in do_misc_fixups() be
moved into JIT implementations and the interpretor? 

Given that the standard now mandates BPF_DIV with divisor of zero to return
zero, it would be rather confusing to see the output of `bpftool prog dump
xlated` contains the extra "if(==0) xor" instruction that's inserted by the
verifier.

Also, maybe there'll be performance benefit for arm64 and riscv where
"if(==0) xor" is not needed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress)
  2022-09-21 14:41             ` Dave Thaler
@ 2022-09-22  5:00               ` Shung-Hsi Yu
  0 siblings, 0 replies; 16+ messages in thread
From: Shung-Hsi Yu @ 2022-09-22  5:00 UTC (permalink / raw)
  To: Dave Thaler; +Cc: bpf, Christoph Hellwig, Alexei Starovoitov

On Wed, Sep 21, 2022 at 02:41:28PM +0000, Dave Thaler wrote:
> Shung-Hsi Yu <shung-hsi.yu@suse.com> writes: 
> > Just like how BPF verifier prevents a _possible_ out-of-bound memory access,
> > e.g. arr[i] when `i` is not bound-checked. Ideally I'd expect a coherent
> > approach toward division/modulo-on-zero as well; the verifier should prevent
> > program that _might_ do division-on-zero from loading in the first place.
> [...]
> > Admittedly even if achievable, this is a radical approach that is not backward
> > compatible. If such check is implemented, programs that used to load may
> > now be rejected.
> 
> FWIW, the PREVAIL verifier attempted to do that, although it was incomplete until a patch I just submitted to it yesterday.

Cool, and skimming through the PR I'm surprised by how minimal the changes
are. Interesting to know that it's possible.

> However, when running the patched version, it would reject some cilium, falco, suricata, etc. programs that it uses as test cases,
> so my patch proposed making it optional in that verifier although maybe there's
> some better alternative.
> 
> Certainly I think a runtime should implement the 0 check itself regardless of whether it's rejected or allowed by verification, but I wanted to share evidence that your "may now be rejected" is demonstrably true.

Thanks!

> Dave
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress)
  2022-09-22  5:00               ` Shung-Hsi Yu
@ 2022-09-23 23:15                 ` Alexei Starovoitov
  0 siblings, 0 replies; 16+ messages in thread
From: Alexei Starovoitov @ 2022-09-23 23:15 UTC (permalink / raw)
  To: Shung-Hsi Yu; +Cc: bpf, Dave Thaler, Christoph Hellwig

On Wed, Sep 21, 2022 at 10:00 PM Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
>
> On Wed, Sep 21, 2022 at 06:50:54AM -0700, Alexei Starovoitov wrote:
> > On Wed, Sep 21, 2022 at 1:34 AM Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
> > >
> > > Subject changed to reflect this reply is out of scope of the original topic
> > > (ISA doc).
> > >
> > > On Tue, Sep 20, 2022 at 04:39:52PM -0700, Alexei Starovoitov wrote:
> > > > On Tue, Sep 20, 2022 at 12:51 PM Dave Thaler <dthaler@microsoft.com> wrote:
> > > > > > -----Original Message-----
> > > > > > From: Christoph Hellwig <hch@infradead.org>
> > > > > > Sent: Monday, September 19, 2022 10:04 AM
> > > > > > To: Shung-Hsi Yu <shung-hsi.yu@suse.com>
> > > > > > Cc: Dave Thaler <dthaler@microsoft.com>; bpf <bpf@vger.kernel.org>
> > > > > > Subject: Re: FW: ebpf-docs: draft of ISA doc updates in progress
> > > > > >
> > > > > > On Wed, Sep 14, 2022 at 02:22:51PM +0800, Shung-Hsi Yu wrote:
> > > > > > > As discussed in yesterday's session, there's no graceful abortion on
> > > > > > > division by zero, instead, the BPF verifier in Linux prevents division
> > > > > > > by zero from happening. Here a few additional notes:
> > > > > >
> > > > > > Hmm, I thought Alexei pointed out a while ago that divide by zero is now
> > > > > > defined to return 0 following.  Ok, reading further along I think that is what
> > > > > > you describe with the pseudo-code below.
> > > > >
> > > > > Based on the discussion at LPC, and the fact that older implementations,
> > > > > as well as uBPF and rbpf still terminate the program, I've added this text
> > > > > to permit both behaviors:
> > > >
> > > > That's not right. ubpf and rbpf are broken.
> > > > We shouldn't be adding descriptions of broken implementations
> > > > to the standard.
> > > > There is no way to 'gracefully abort' in eBPF.
> > > > There is a way to 'return 0' in cBPF, but that's different. See below.
> > > >
> > > > > > If eBPF program execution would result in division by zero,
> > > > > > the destination register SHOULD instead be set to zero, but
> > > > > > program execution MAY be gracefully aborted instead.
> > > > > > Similarly, if execution would result in modulo by zero,
> > > > > > the destination register SHOULD instead be set to the source value,
> > > > > > but program execution MAY be gracefully aborted instead.
> > >
> > > While this makes implementation a lot easier, come to think of it though,
> > > this behavior actually is more like a hack around having to deal with
> > > division/modulo-on-zero at runtime. User doing statistic calculations with
> > > BPF will get bit by this sooner or later, arriving at the wrong calculation
> > > (fast-math comes to mind).
> >
> > lol. If that was the case arm64 would be on fire long ago
> > and users would complain in masses.
> > Same with risc-v.
>
> whoa, I had no idea.
>
> And looking around I don't see complains. Taking what I said back and +1 for
> using the current division/modulo-by-zero behavior as standard.
>
> > > This seems to go against some general BPF principle[1] of preventing the
> > > users from shooting themselves in the foot.
> > >
> > > Just like how BPF verifier prevents a _possible_ out-of-bound memory access,
> > > e.g. arr[i] when `i` is not bound-checked. Ideally I'd expect a coherent
> > > approach toward division/modulo-on-zero as well; the verifier should prevent
> > > program that _might_ do division-on-zero from loading in the first place.
> > > (Maybe possible to achieve if we introduce something like SCALAR_OR_NULL
> > > composed type, but that's definitely not easy)
> > >
> > > Admittedly even if achievable, this is a radical approach that is not
> > > backward compatible. If such check is implemented, programs that used to
> > > load may now be rejected. (Usually new BPF verifier feature allows more BPF
> > > program to pass the verifier, rather then the other way around)
> > >
> > > So, I don't have a good proposal at the moment. The purpose to this email is
> > > to point what I see as an issue out and hope to start a discussion.
> > >
> > > 1: Okay, I'm making this up a bit, strictly speaking the BPF foundation is
> > > safe program (one of Alexei's talk) rather than preventing users from
> > > shooting themselves in the foot.
> >
> > Safe != invalid.
> > BPF doesn't prevent invalid programs.
> > BPF ensures safety only, not crashing the kernel, not leaking data, etc.
> > For example: under speculation arr[i] can go out of bounds
> > and to prevent data leaks we insert masking operations.
>
> Point taken, thanks for clarifying the difference between safe and invalid.
>
> > Similar with div_by_0. If the verifier can detect that it will reject
> > the prog, otherwise it will insert if(==0) xor, because not
> > all architectures behave like arm64.
>
> Speaking of which, should the "if(==0) xor" patching in do_misc_fixups() be
> moved into JIT implementations and the interpretor?
>
> Given that the standard now mandates BPF_DIV with divisor of zero to return
> zero, it would be rather confusing to see the output of `bpftool prog dump
> xlated` contains the extra "if(==0) xor" instruction that's inserted by the
> verifier.
>
> Also, maybe there'll be performance benefit for arm64 and riscv where
> "if(==0) xor" is not needed.

Tiny perf gain is not worth extra complexity in JITs.
Integer divide is the slowest operation in a CPU.
We're trying to do everything in the verifier and as little
as possible in JITs.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-09-23 23:15 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CY5PR21MB377000AC95B475C47B702293A3439@CY5PR21MB3770.namprd21.prod.outlook.com>
2022-09-13  8:12 ` FW: ebpf-docs: draft of ISA doc updates in progress Dave Thaler
2022-09-14  6:22   ` Shung-Hsi Yu
2022-09-14  9:35     ` Dave Thaler
2022-09-19 17:04     ` Christoph Hellwig
2022-09-20 19:12       ` Dave Thaler
2022-09-20 23:39         ` Alexei Starovoitov
2022-09-21  8:34           ` Rethink how to deal with division/modulo-on-zero (was Re: FW: ebpf-docs: draft of ISA doc updates in progress) Shung-Hsi Yu
2022-09-21 13:50             ` Alexei Starovoitov
2022-09-22  5:00               ` Shung-Hsi Yu
2022-09-23 23:15                 ` Alexei Starovoitov
2022-09-21 14:41             ` Dave Thaler
2022-09-22  5:00               ` Shung-Hsi Yu
2022-09-21 17:53           ` FW: ebpf-docs: draft of ISA doc updates in progress Dave Thaler
2022-09-19 16:58   ` Christoph Hellwig
2022-09-20 19:37     ` Dave Thaler
2022-09-20 23:45       ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.