linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Brauner <christian.brauner@ubuntu.com>
To: linux-kernel@vger.kernel.org
Cc: libc-alpha@sourceware.org, Jonathan Corbet <corbet@lwn.net>,
	Christian Brauner <christian@brauner.io>,
	Federico Vaga <federico.vaga@vaga.pv.it>,
	Alessia Mantegazza <amantegazza@vaga.pv.it>,
	Guillaume Dore <corwin@poussif.eu>,
	linux-doc@vger.kernel.org,
	Christian Brauner <christian.brauner@ubuntu.com>,
	Aleksa Sarai <cyphar@cyphar.com>
Subject: [PATCH] Documentation: update about adding syscalls
Date: Wed,  2 Oct 2019 17:14:37 +0200	[thread overview]
Message-ID: <20191002151437.5367-1-christian.brauner@ubuntu.com> (raw)

Add additional information on how to ensure that syscalls with structure
arguments are extensible and add a section about naming conventions to
follow when adding revised versions of already existing syscalls.

Co-Developed-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
---
 Documentation/process/adding-syscalls.rst | 82 +++++++++++++++++++----
 1 file changed, 70 insertions(+), 12 deletions(-)

diff --git a/Documentation/process/adding-syscalls.rst b/Documentation/process/adding-syscalls.rst
index 1c3a840d06b9..93e0221fbb9a 100644
--- a/Documentation/process/adding-syscalls.rst
+++ b/Documentation/process/adding-syscalls.rst
@@ -79,7 +79,7 @@ flags, and reject the system call (with ``EINVAL``) if it does::
 For more sophisticated system calls that involve a larger number of arguments,
 it's preferred to encapsulate the majority of the arguments into a structure
 that is passed in by pointer.  Such a structure can cope with future extension
-by including a size argument in the structure::
+by either including a size argument in the structure::
 
     struct xyzzy_params {
         u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */
@@ -87,20 +87,56 @@ by including a size argument in the structure::
         u64 param_2;
         u64 param_3;
     };
+    int sys_xyzzy(struct xyzzy_params __user *uarg);
+    /* in case of -E2BIG, p->size is set to the in-kernel size and thus all
+       extensions after that offset are unsupported. */
 
-As long as any subsequently added field, say ``param_4``, is designed so that a
-zero value gives the previous behaviour, then this allows both directions of
-version mismatch:
+or by including a separate argument that specifies the size::
 
- - To cope with a later userspace program calling an older kernel, the kernel
-   code should check that any memory beyond the size of the structure that it
-   expects is zero (effectively checking that ``param_4 == 0``).
- - To cope with an older userspace program calling a newer kernel, the kernel
-   code can zero-extend a smaller instance of the structure (effectively
-   setting ``param_4 = 0``).
+    struct xyzzy_params {
+        u32 param_1;
+        u64 param_2;
+        u64 param_3;
+    };
+    /* userspace sets @usize = sizeof(struct xyzzy_params) */
+    int sys_xyzzy(struct xyzzy_params __user *uarg, size_t usize);
+    /* in case of -E2BIG, userspace has to attempt smaller @usize values
+       to figure out which extensions are unsupported. */
+
+Which model you choose to use is down to personal taste. However, please only
+pick one (for a counter-example, see :manpage:`sched_getattr(2)`).
+
+Then, any extensions can be implemented by appending fields to the structure.
+However, all new fields must be designed such that their zero value results in
+identical behaviour to the pre-extension syscall. This allows for compatibility
+between different-vintage kernels and userspace, no matter which is newer:
+
+ - If the userspace is newer, then the older kernel can check whether the
+   fields it doesn't understand are zero-valued. If they are, then it can
+   safely ignore them (since any future extensions will be backwards-compatible
+   as described above). If they aren't, then the kernel doesn't support the
+   feature and can give an error (``-E2BIG`` is traditional).
+
+ - If the userspace is older, then the kernel can treat all extensions that
+   userspace is unaware of as having their zero-value (and effectively
+   zero-extend the userspace structure when copying it for in-kernel usage).
 
-See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in
-``kernel/events/core.c``) for an example of this approach.
+ - If they're the same version, just use the structure as-is.
+
+As with the simpler flag-only syscalls, you must always check that any unknown
+values for flag-like parameters in the passed structure are zeroed.
+
+It is also critical to ensure your syscall handles larger-sized arguments from
+the outset, otherwise userspace will have to do additional (fairly pointless)
+fallbacks for some old kernels. An example where this mistake was made is
+:manpage:`rt_sigprocmask(2)` (where any unknown-sized arguments are
+unconditionally rejected).
+
+To help implement this correctly, there is a helper function
+``copy_struct_from_user()`` which handles the compatibility requirements for
+you. For examples using this helper, see :manpage:`perf_event_open(2)` (which
+uses the embedded-size model) and :manpage:`clone3(2)` (which uses the
+separate-argument model).
 
 
 Designing the API: Other Considerations
@@ -173,6 +209,28 @@ registers.  (This concern does not apply if the arguments are part of a
 structure that's passed in by pointer.)
 
 
+Designing the API: Revisions of syscalls
+-----------------------------------------------
+
+Syscalls that were not designed to be extensible or syscalls that use a flag
+argument for extensions running out of bits (e.g. :manpage:`clone(2)`)
+sometimes need to be replaced.
+
+If the revised syscall provides a superset (or a reasonably large subset, such
+as when a feature that turned out to be a design mistake is dropped) of the
+features of the old syscall it is common practice to give it the same name with
+a number appended. Examples for this include ``dup2``/``dup3``,
+``epoll_create``/``epoll_create1`` and others.
+
+For some syscalls the appended number indicates the number of arguments
+(``accept``/``accept4``) for others the number of the revision
+(``clone``/``clone3``, ``epoll_create``/``epoll_create1``). Recent discussions
+around new syscalls (such as ``clone3``) have pointed to a consensus around
+naming syscalls revisions by appending the number of the revision to the
+syscall name. That means, if you were to post a revised version of the
+``openat`` syscall it should be named ``openat2``.
+
+
 Proposing the API
 -----------------
 
-- 
2.23.0


             reply	other threads:[~2019-10-02 15:14 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-02 15:14 Christian Brauner [this message]
2019-10-03 16:39 ` [PATCH] Documentation: update about adding syscalls Jonathan Corbet
2019-10-03 16:53   ` Aleksa Sarai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191002151437.5367-1-christian.brauner@ubuntu.com \
    --to=christian.brauner@ubuntu.com \
    --cc=amantegazza@vaga.pv.it \
    --cc=christian@brauner.io \
    --cc=corbet@lwn.net \
    --cc=corwin@poussif.eu \
    --cc=cyphar@cyphar.com \
    --cc=federico.vaga@vaga.pv.it \
    --cc=libc-alpha@sourceware.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).