From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7419C433DF for ; Wed, 27 May 2020 21:43:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C2C0F2075A for ; Wed, 27 May 2020 21:43:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=chromium.org header.i=@chromium.org header.b="NvbMBgkO" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727787AbgE0Vny (ORCPT ); Wed, 27 May 2020 17:43:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55810 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726114AbgE0Vny (ORCPT ); Wed, 27 May 2020 17:43:54 -0400 Received: from mail-pj1-x1044.google.com (mail-pj1-x1044.google.com [IPv6:2607:f8b0:4864:20::1044]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2F1B7C05BD1E for ; Wed, 27 May 2020 14:43:53 -0700 (PDT) Received: by mail-pj1-x1044.google.com with SMTP id 5so2140893pjd.0 for ; Wed, 27 May 2020 14:43:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=8xrSbOhXNcO53Jugnfn3DM3MYFsd0aIhSW+Q702590o=; b=NvbMBgkOqpmFrJjInrBVR2+iybHAphGK16x1eGVEHUUQhd41vCIA0JVrQudLGWiTJ1 UuGF2Iz1G0cReJeK7jJ7X/2ei83C4MIdO8KSDIqXbHLoV9orVRnxbvHAl6yY9awoa98s yYRTBTC2h2EK1VtE7Pu/f99uxmuB1JQ1914r0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=8xrSbOhXNcO53Jugnfn3DM3MYFsd0aIhSW+Q702590o=; b=WHct9V/exI1BJWIsvdOK7vCU5PM7BcJ1h7hf3DTX092WutGemcEeLFkLp+LRPVEl03 KMxjy7udhKIwnBk2Ufh8jcm6zU7X5MYza24+9YnLy55kyZHprMZou8CYIv2TKBfYcuA2 8VGktD2LW7HPlgiKFSkSZ8qdEtrWOcs6u20eDwT16zyb0cjqqRWD88W+ULoeqOlkZz+t 73NQOvCQ4dY9RhjJyxwBOEaBXaRX35I71tuJg8xuv1Q7aCOExMABD3ugFBCt3jNAbekW dBo4gTWCZ3nHcdpkZVplFWn2syvwm3GJuAZiCd1Mnty7eLQYK7FLgIXSr0Rqfh3a5z1L hhsA== X-Gm-Message-State: AOAM532wPNvDMDw8uxZB8QMXaOIO7mzkkbQK6GInogUbS8q+L+ndLHsJ oZRG30rn3VNzt0xGFccQjr2FoA== X-Google-Smtp-Source: ABdhPJxRUr0b3j6IjHfDpsbCKKpWAeNbp6kND1YduC9VB6pPkSW3FIGlBSz1/vgFhjY79NIkltnvdQ== X-Received: by 2002:a17:902:9d94:: with SMTP id c20mr402368plq.224.1590615831883; Wed, 27 May 2020 14:43:51 -0700 (PDT) Received: from www.outflux.net (smtp.outflux.net. [198.145.64.163]) by smtp.gmail.com with ESMTPSA id u4sm4643626pjf.3.2020.05.27.14.43.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 May 2020 14:43:50 -0700 (PDT) Date: Wed, 27 May 2020 14:43:49 -0700 From: Kees Cook To: Christian Brauner Cc: linux-kernel@vger.kernel.org, Andy Lutomirski , Tycho Andersen , Matt Denton , Sargun Dhillon , Jann Horn , Chris Palmer , Aleksa Sarai , Robert Sesek , Jeffrey Vander Stoep , Linux Containers Subject: Re: [PATCH 1/2] seccomp: notify user trap about unused filter Message-ID: <202005271408.58F806514@keescook> References: <20200527111902.163213-1-christian.brauner@ubuntu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200527111902.163213-1-christian.brauner@ubuntu.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 27, 2020 at 01:19:01PM +0200, Christian Brauner wrote: > loop. But this is problematic since we don't get a notification when the > seccomp filter has become unused and so we currently never remove the > seccomp notifier fd from the event loop and just keep accumulating fds > in the event loop. We've had this issue for a while but it has recently > become more pressing as more and larger users make use of this. I had been under the (seemingly very wrong) understanding that when all the tasks associated with a filter cease to exist, the notif fd is effectively closed. But I see now where I got confused: this is only half implemented: if the userspace end of the fd is closed, it'll get cleaned up in the kernel, but we have nothing going the other direction except the general object lifetime management on the filter itself. So, yes, I accept the basic problem statement, "we have fds hanging around that will never be used again, we need to notice that". :) Why is EPOLLHUP needed? Can't the fd just get closed on the kernel end? I would expect that to be visible as EPOLLHUP internally (though I haven't looked through the VFS yet). And I haven't found how to close/detach a anon file from the listener task. It strikes me that this would actually be much cleaner: then we actually don't need the additional __get_seccomp_filter() in init_listener() -- we simply invalidate the file during __put_seccomp_filter(). (While I'm here -- why can there be only one listener per task? The notifications are filter-specific, not task-specific?) > To fix this, we introduce a new "live" reference counter that tracks the > live tasks making use of a given filter and when a notifier is > registered waiting tasks will be notified that the filter is now empty > by receiving a (E)POLLHUP event. > The concept in this patch introduces is the same as for signal_struct, > i.e. reference counting for life-cycle management is decoupled from > reference counting live taks using the object. I will need convincing that life-cycle ref-counting needs to be decoupled from usage ref-counting. I see what you're saying here and in the other reply about where the notification is coming from (release vs put, etc), but I think it'd be better if the EPOLLHUP was handled internally to the VFS due to the kernel end of the file being closed. > There's probably some trickery possible but the second counter is just > the correct way of doing this imho and has precedence. The patch also > lifts the waitqeue from struct notification into into sruct > seccomp_filter. This is cleaner overall and let's us avoid having to > take the notifier mutex since we neither need to read nor modify the > notifier specific aspects of the seccomp filter. In the exit path I'd > very much like to avoid having to take the notifier mutex for each > filter in the task's filter hierarchy. I guess this is a minor size/speed trade-off (every seccomp_filter struct grows by 1 pointer regardless of the presence of USER_NOTIF rules attached...). But I think this is an optimization detail, and I need to understand why we can't just close the file on filter free. -- Kees Cook