From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5941BC2BA1B for ; Wed, 8 Apr 2020 16:41:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3BF0B20769 for ; Wed, 8 Apr 2020 16:41:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730365AbgDHQl6 convert rfc822-to-8bit (ORCPT ); Wed, 8 Apr 2020 12:41:58 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:43191 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730356AbgDHQl5 (ORCPT ); Wed, 8 Apr 2020 12:41:57 -0400 Received: from mail-lj1-f175.google.com ([209.85.208.175]) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1jMDlx-0001To-LD for linux-kernel@vger.kernel.org; Wed, 08 Apr 2020 16:41:53 +0000 Received: by mail-lj1-f175.google.com with SMTP id g27so8301847ljn.10 for ; Wed, 08 Apr 2020 09:41:53 -0700 (PDT) X-Gm-Message-State: AGi0PuYN2quChaXSWaJiC+5P6iYrqpRBVuVbWV9OLwkXSpYUkEp38kdz oegBVnBJNTHjPkdr9lAMKlv4gbRhBElqS3jOlqacXQ== X-Google-Smtp-Source: APiQypLzUyLptlfj4UREBf2l5xN02F+xA0MQtCV/4zxKUkblPb/olSzPaY17gwOo3IIMcNmLwikMsI/ASDg0+VXPjE8= X-Received: by 2002:a2e:97c2:: with SMTP id m2mr5450395ljj.228.1586364113069; Wed, 08 Apr 2020 09:41:53 -0700 (PDT) MIME-Version: 1.0 References: <20200408152151.5780-1-christian.brauner@ubuntu.com> In-Reply-To: From: =?UTF-8?Q?St=C3=A9phane_Graber?= Date: Wed, 8 Apr 2020 12:41:41 -0400 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 0/8] loopfs To: Jann Horn Cc: Christian Brauner , Jens Axboe , Greg Kroah-Hartman , kernel list , linux-block@vger.kernel.org, Linux API , Jonathan Corbet , Serge Hallyn , "Rafael J. Wysocki" , Tejun Heo , "David S. Miller" , Saravana Kannan , Jan Kara , David Howells , Seth Forshee , David Rheinsberg , Tom Gundersen , Christian Kellner , Dmitry Vyukov , linux-doc@vger.kernel.org, Network Development , Matthew Garrett , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 8, 2020 at 12:24 PM Jann Horn wrote: > > On Wed, Apr 8, 2020 at 5:23 PM Christian Brauner > wrote: > > One of the use-cases for loopfs is to allow to dynamically allocate loop > > devices in sandboxed workloads without exposing /dev or > > /dev/loop-control to the workload in question and without having to > > implement a complex and also racy protocol to send around file > > descriptors for loop devices. With loopfs each mount is a new instance, > > i.e. loop devices created in one loopfs instance are independent of any > > loop devices created in another loopfs instance. This allows > > sufficiently privileged tools to have their own private stash of loop > > device instances. Dmitry has expressed his desire to use this for > > syzkaller in a private discussion. And various parties that want to use > > it are Cced here too. > > > > In addition, the loopfs filesystem can be mounted by user namespace root > > and is thus suitable for use in containers. Combined with syscall > > interception this makes it possible to securely delegate mounting of > > images on loop devices, i.e. when a user calls mount -o loop > > it will be possible to completely setup the loop device. > > The final mount syscall to actually perform the mount will be handled > > through syscall interception and be performed by a sufficiently > > privileged process. Syscall interception is already supported through a > > new seccomp feature we implemented in [1] and extended in [2] and is > > actively used in production workloads. The additional loopfs work will > > be used there and in various other workloads too. You'll find a short > > illustration how this works with syscall interception below in [4]. > > Would that privileged process then allow you to mount your filesystem > images with things like ext4? As far as I know, the filesystem > maintainers don't generally consider "untrusted filesystem image" to > be a strongly enforced security boundary; and worse, if an attacker > has access to a loop device from which something like ext4 is mounted, > things like "struct ext4_dir_entry_2" will effectively be in shared > memory, and an attacker can trivially bypass e.g. > ext4_check_dir_entry(). At the moment, that's not a huge problem (for > anything other than kernel lockdown) because only root normally has > access to loop devices. > > Ubuntu carries an out-of-tree patch that afaik blocks the shared > memory thing: > > But even with that patch, I'm not super excited about exposing > filesystem image parsing attack surface to containers unless you run > the filesystem in a sandboxed environment (at which point you don't > need a loop device anymore either). So in general we certainly agree that you should never expose someone that you wouldn't trust with root on the host to syscall interception mounting of real kernel filesystems. But that's not all that our syscall interception logic can do. We have support for rewriting a normal filesystem mount attempt to instead use an available FUSE implementation. As far as the user is concerned, they ran "mount /dev/sdaX /mnt" and got that ext4 filesystem mounted on /mnt as requested, except that the container manager intercepted the mount attempt and instead spawned fuse2fs for that mount. This requires absolutely no change to the software the user is running. loopfs, with that interception mode, will let us also handle all cases where a loop would be used, similarly without needing any change to the software being run. If a piece of software calls the command "mount -o loop blah.img /mnt", the "mount" command will setup a loop device as it normally would (doing so through loopfs) and then will call the "mount" syscall, which will get intercepted and redirected to a FUSE implementation if so configured, resulting in the expected filesystem being mounted for the user. LXD with syscall interception offers both straight up privileged mounting using the kernel fs or using a FUSE based implementation. This is configurable on a per-filesystem and per-container basis. I hope that clarifies what we're doing here :) Stéphane