From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.4 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC427C5517A for ; Thu, 29 Oct 2020 07:26:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 66D54214DB for ; Thu, 29 Oct 2020 07:26:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="gZ8H/p46" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727058AbgJ2HYn (ORCPT ); Thu, 29 Oct 2020 03:24:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55364 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726732AbgJ2HYe (ORCPT ); Thu, 29 Oct 2020 03:24:34 -0400 Received: from mail-lj1-x243.google.com (mail-lj1-x243.google.com [IPv6:2a00:1450:4864:20::243]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E50C7C0613AB for ; Wed, 28 Oct 2020 21:44:03 -0700 (PDT) Received: by mail-lj1-x243.google.com with SMTP id t13so1626215ljk.12 for ; Wed, 28 Oct 2020 21:44:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8vJY7X+2qf9mk4yjNjSimkTKUqnRW4h3rIif/0iLOrE=; b=gZ8H/p46CECRx9wDoVffbN/h0TJ0bUslUCW4zEC9UGCLKm/cXqu5uiZC1A8cryr+E8 e0dLqac5CrQ0wZy68hcC5mDMEnQW3ZBUwc7+nrlVGLwjzOWHmxGB84B6A7tq9XDnv19s M7QjYPJ1Rgm+8fE1FFuZrcNgCnJkq5XlKbHAN6QN93QfeuU4o+r0DZj/rSPyFxLyR9HX goLYDldHw4PKN4C16oNQfIjUL+sxQTmKcTIGcQlhOv8tjXF8krfMfUE47zD8S0ABnIzk C8yOWDT4K5Mx4UbUn4ZaqYyEFrICmzbt6zNb1oXbkYLIFdMceLLg+ASw3I9w2T4L15rH 1NBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8vJY7X+2qf9mk4yjNjSimkTKUqnRW4h3rIif/0iLOrE=; b=cnm+RxQU8myCDiZHZ4kk32lGZ0Rjq4WkctBlHGYxHdaTSl9qRjWAubCW/Q0a0T93DQ 7dBVU/LumBivJBULgCJLsii2PT1xMfHLufWQNvY0y9tBTg8S7wxgJP1CK5sncBBtQ0On H4WsnjxIZ5LoSvEV4ttuJ2S9UbwQ7w3yJYBdkT+eXDVJ8a0y1ax7pMWAjgkTrYmWp3J6 GN6VOg0E+Lt1iCs0+TIFU6p0WeO8EJq/j5ZNDgB1xypDofWXyqSJcONA4YHWDtwOGiGu Fb/mwMPFf61WEFGDohReamF+TObeIZbpDb/F3Tdr5F5X89H5Knb92mstaMu8fkuem64Y zDCA== X-Gm-Message-State: AOAM5317FGmST7KEB4/p5uD8XfnDVwdXZqwvD2gbgbVp5dp4Nmz6FBqO iYxIXWbor7ZdclORaR1/f6h/ih57glGpYv4ka0o59A== X-Google-Smtp-Source: ABdhPJw/sZxb5qRToKpsO6Tov3q5O2h9QxUZH9e9awkjhLv1ZkCLPaqe/eslUoWDbcpwUPRAcE7vE1CzAPHEQGncXAA= X-Received: by 2002:a2e:9c84:: with SMTP id x4mr981272lji.326.1603946642027; Wed, 28 Oct 2020 21:44:02 -0700 (PDT) MIME-Version: 1.0 References: <63598b4f-6ce3-5a11-4552-cdfe308f68e4@gmail.com> <20201029020438.GA25673@cisco> In-Reply-To: <20201029020438.GA25673@cisco> From: Jann Horn Date: Thu, 29 Oct 2020 05:43:35 +0100 Message-ID: Subject: Re: For review: seccomp_user_notif(2) manual page [v2] To: Tycho Andersen , "Michael Kerrisk (man-pages)" Cc: Kees Cook , Sargun Dhillon , Christian Brauner , Daniel Borkmann , Giuseppe Scrivano , Song Liu , Robert Sesek , Containers , linux-man , lkml , Aleksa Sarai , Alexei Starovoitov , Will Drewry , bpf , Andy Lutomirski Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On Thu, Oct 29, 2020 at 3:04 AM Tycho Andersen wrote: > On Thu, Oct 29, 2020 at 02:42:58AM +0100, Jann Horn wrote: > > On Mon, Oct 26, 2020 at 10:55 AM Michael Kerrisk (man-pages) > > wrote: > > > static bool > > > getTargetPathname(struct seccomp_notif *req, int notifyFd, > > > char *path, size_t len) > > > { > > > char procMemPath[PATH_MAX]; > > > > > > snprintf(procMemPath, sizeof(procMemPath), "/proc/%d/mem", req->pid); > > > > > > int procMemFd = open(procMemPath, O_RDONLY); > > > if (procMemFd == -1) > > > errExit("\tS: open"); > > > > > > /* Check that the process whose info we are accessing is still alive. > > > If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performed > > > in checkNotificationIdIsValid()) succeeds, we know that the > > > /proc/PID/mem file descriptor that we opened corresponds to the > > > process for which we received a notification. If that process > > > subsequently terminates, then read() on that file descriptor > > > will return 0 (EOF). */ > > > > > > checkNotificationIdIsValid(notifyFd, req->id); > > > > > > /* Read bytes at the location containing the pathname argument > > > (i.e., the first argument) of the mkdir(2) call */ > > > > > > ssize_t nread = pread(procMemFd, path, len, req->data.args[0]); > > > if (nread == -1) > > > errExit("pread"); > > > > As discussed at > > , > > we need to re-check checkNotificationIdIsValid() after reading remote > > memory but before using the read value in any way. Otherwise, the > > syscall could in the meantime get interrupted by a signal handler, the > > signal handler could return, and then the function that performed the > > syscall could free() allocations or return (thereby freeing buffers on > > the stack). > > > > In essence, this pread() is (unavoidably) a potential use-after-free > > read; and to make that not have any security impact, we need to check > > whether UAF read occurred before using the read value. This should > > probably be called out elsewhere in the manpage, too... > > > > Now, of course, **reading** is the easy case. The difficult case is if > > we have to **write** to the remote process... because then we can't > > play games like that. If we write data to a freed pointer, we're > > screwed, that's it. (And for somewhat unrelated bonus fun, consider > > that /proc/$pid/mem is originally intended for process debugging, > > including installing breakpoints, and will therefore happily write > > over "readonly" private mappings, such as typical mappings of > > executable code.) > > > > So, uuuuh... I guess if anyone wants to actually write memory back to > > the target process, we'd better come up with some dedicated API for > > that, using an ioctl on the seccomp fd that magically freezes the > > By freeze here you mean a killable wait instead of an interruptible > wait, right? Nope, nonkillable. Consider the case of vfork(), where a target process does something like this: void spawn_executable(char **argv, char **envv) { pid_t child = vfork(); if (child == 0) { char path[1000]; sprintf(path, ...); execve(path, argv, envv); } } and the seccomp notifier wants to look at the execve() path (as a somewhat silly example). The child process is just borrowing the parent's stack, and as soon as the child either gets far enough into execve() or dies, the parent continues using that stack. So keeping the child in killable sleep would not be enough to prevent reuse of the child's stack. But conceptually that's not really a big problem - we already have a way to force the target task to stay inside the seccomp code no matter if it gets SIGKILLed or whatever, and that is to take the notify_lock. When the target task wakes up and wants to continue executing, it has to first get through mutex_lock(&match->notify_lock) - and that will always block until the lock is free. So we could e.g. do something like: - Grab references to the source pages in the supervisor's address space with get_user_pages_fast(). - Take mmap_sem on the target. - Grab references to pages in the relevant range with pin_user_pages_remote(). - Drop the mmap_sem. - Take the notify_lock. - Recheck whether the notification with the right ID is still there. - Copy data from the pinned source pages to the pinned target pages. - Drop the notify_lock. - Drop the page references. and this way we would still guarantee that the target process would only be blocked in noninterruptible sleep for a small amount of time (and would not be indirectly blocked on sleeping operations through the mutex). It'd be pretty straightforward, I think. But as long as we don't actually need it, it might be easier to just note in the manpage that this is not currently supported.