linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@amacapital.net>
To: David Ahern <dsahern@gmail.com>
Cc: "Alexei Starovoitov" <alexei.starovoitov@gmail.com>,
	"Andy Lutomirski" <luto@kernel.org>,
	"Daniel Mack" <daniel@zonque.org>,
	"Mickaël Salaün" <mic@digikod.net>,
	"Kees Cook" <keescook@chromium.org>, "Jann Horn" <jann@thejh.net>,
	"Tejun Heo" <tj@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	"Thomas Graf" <tgraf@suug.ch>,
	"Michael Kerrisk" <mtk.manpages@gmail.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Linux API" <linux-api@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Network Development" <netdev@vger.kernel.org>
Subject: Re: Potential issues (security and otherwise) with the current cgroup-bpf API
Date: Mon, 19 Dec 2016 19:12:48 -0800	[thread overview]
Message-ID: <CALCETrVKu63BFVQFAJcLcd6ovPtq-WDdTh-BwyAPSprw8UarNQ@mail.gmail.com> (raw)
In-Reply-To: <80574175-3692-0278-a74e-23b752d44f73@gmail.com>

On Mon, Dec 19, 2016 at 6:52 PM, David Ahern <dsahern@gmail.com> wrote:
> On 12/19/16 6:56 PM, Andy Lutomirski wrote:
>> On Mon, Dec 19, 2016 at 5:44 PM, David Ahern <dsahern@gmail.com> wrote:
>>> On 12/19/16 5:25 PM, Andy Lutomirski wrote:
>>>> net.socket_create_filter = "none": no filter
>>>> net.socket_create_filter = "bpf:baadf00d": bpf filter
>>>> net.socket_create_filter = "disallow": no sockets created period
>>>> net.socket_create_filter = "iptables:foobar": some iptables thingy
>>>> net.socket_create_filter = "nft:blahblahblah": some nft thingy
>>>> net.socket_create_filter = "address_family_list:1,2,3": allow AF 1, 2, and 3
>>>
>>> Such a scheme works for the socket create filter b/c it is a very simple use case. It does not work for the ingress and egress which allow generic bpf filters.
>>
>> Can you elaborate on what goes wrong?  (Obviously the
>> "address_family_list" example makes no sense in that context.)
>
> Being able to dump a filter or see that one exists would be a great add-on, but I don't see how 'net.socket_create_filter = "bpf:baadf00d"' is a viable API for loading generic BPF filters. Simple cases like "disallow" are easy -- just return 0 in the filter, no complicated BPF code needed. The rest are specific cases of the moment which goes against the intent of ebpf and generic programmability.

Oh -- I'm not proposing that at all.  Let me clarify.  For the bpf
case, if you *read* the file, you'd see "bpf:baadf00d".  But writing
"bpf:baadf00d" is nonsense and would give you -EINVAL.  Instead you
install a bpf filter by opening the file for write (O_RDWR or
O_WRONLY) and doing ioctl(cgroup_socket_create_file_fd,
CGROUP_IOCTL_SET_BPF_FILTER, bpf_fd);  It's kind of like
BPF_PROG_ATTACH except that it respects filesystem permissions, it
isn't in the bpf() multiplexer, the filter being set is implied (by
the fd in use), and everything is nicely seccompable.

To remove the filter, you write "none" or "none\n" to the file.

As a future extension, if someone wanted more than one filter to be
able to coexist in the cgroup socket_create_filter slot, you could
plausibly do 'echo disallow >>net.socket_create_filter' or use a new
ioctl CGROUP_IOCTL_APPEND_BPF_FILTER, and then you'd read the file and
see more than one line.  But this would be a future extension and may
never be needed.

>>
>> a) sub-cgroups cannot have a filter at all of the parent has a filter.
>> (This is the "punt" approach -- it lets different semantics be
>> assigned later without breaking userspace.)
>>
>> b) sub-cgroups can have a filter if a parent does, too.  The semantics
>> are that the sub-cgroup filter runs first and all side-effects occur.
>> If that filter says "reject" then ancestor filters are skipped.  If
>> that filter says "accept", then the ancestor filter is run and its
>> side-effects happen as well.  (And so on, all the way up to the root.)
>
> That comes with a big performance hit for skb / data path cases.
>
> I'm riding my use case on Daniel's work, and as I understand it the nesting case has been discussed. I'll defer to Daniel and Alexei on this part.

I'm not sure I buy the performance hit.  If you do it naively, then
performance will indeed suck.  But there's already a bunch of code in
there to pre-populate a filter list for faster use.  Currently, we
have:

struct cgroup_bpf {
        /*
         * Store two sets of bpf_prog pointers, one for programs that are
         * pinned directly to this cgroup, and one for those that are effective
         * when this cgroup is accessed.
         */
        struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
        struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
};

in struct cgroup, there's a 'struct cgroup_bpf bpf;'.

This would change to something like:

struct cgroup_filter_slot {
  struct bpf_prog *effective;
  struct cgroup_filter_slot *next;
  struct bpf_prog *local;
}

local is NULL unless *this* cgroup has a filter.  effective points to
the bpf_prog that's active in this cgroup or the nearest ancestor that
has a filter.  next is NULL if there are no filters higher in the
chain or points to the next slot that has a filter.  struct cgroup
has:

struct cgroup_filter_slot filters[MAX_BPF_ATTACH_TYPE];

To evaluate it, you do:

struct cgroup_filter_slot *slot = &cgroup->slot[the index];

if (!slot->effective)
  return;

do {
  evaluate(slot->effective);
  slot = slot->next;
} while (unlikely(slot));

The old code was one branch per evaluation.  The new code is n+1
branches where n is the number of filters, so it's one extra branch in
the worst case.  But the extra branch is cache-hot (the variable is
right next to slot->effective, which is needed regardless) and highly
predictable (in the case where nesting isn't used, the branch is not
taken, and it's a backward branch which most CPUs will predict as not
taken).

I expect that, on x86, this adds at most a cycle or two and quite
possibly adds no measurable overhead at all.

  reply	other threads:[~2016-12-20  3:13 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-17 18:18 Potential issues (security and otherwise) with the current cgroup-bpf API Andy Lutomirski
2016-12-17 19:26 ` Mickaël Salaün
2016-12-17 20:02   ` Andy Lutomirski
2016-12-19 20:56 ` Alexei Starovoitov
2016-12-19 21:23   ` Andy Lutomirski
2016-12-20  0:02     ` Alexei Starovoitov
2016-12-20  0:25       ` Andy Lutomirski
2016-12-20  1:43         ` Andy Lutomirski
2016-12-20  1:44         ` David Ahern
2016-12-20  1:56           ` Andy Lutomirski
2016-12-20  2:52             ` David Ahern
2016-12-20  3:12               ` Andy Lutomirski [this message]
2016-12-20  4:44                 ` Alexei Starovoitov
2016-12-20  5:27                   ` Andy Lutomirski
2016-12-20  5:32                     ` Alexei Starovoitov
2016-12-20  9:11             ` Peter Zijlstra
2017-01-03 10:25               ` Michal Hocko
2017-01-16  1:19                 ` Tejun Heo
2017-01-17 13:03                   ` Michal Hocko
2017-01-17 13:32                     ` Peter Zijlstra
2017-01-17 13:58                       ` Michal Hocko
2017-01-17 20:23                         ` Andy Lutomirski
2017-01-18 22:18                         ` Tejun Heo
2017-01-19  9:00                           ` Michal Hocko
2016-12-20  3:18         ` Alexei Starovoitov
2016-12-20  3:50           ` Andy Lutomirski
2016-12-20  4:41             ` Alexei Starovoitov
2016-12-20 10:21             ` Daniel Mack
2016-12-20 17:23               ` Andy Lutomirski
2016-12-20 18:36                 ` Daniel Mack
2016-12-20 18:49                   ` Andy Lutomirski
2016-12-21  4:01                     ` Alexei Starovoitov
2016-12-20  1:34       ` David Miller
2016-12-20  1:40         ` Andy Lutomirski
2016-12-20  4:51           ` Alexei Starovoitov
2016-12-20  5:26             ` Andy Lutomirski
2017-01-17  5:18 Andy Lutomirski
2017-01-18 22:41 ` Potential issues (security and otherwise) with the current cgroup-bpf API Tejun Heo
2017-01-19  0:18   ` Andy Lutomirski
2017-01-19  0:59     ` Tejun Heo
2017-01-19  2:29       ` Andy Lutomirski
2017-01-20  2:39         ` Alexei Starovoitov
2017-01-20  4:04           ` Andy Lutomirski
2017-01-23  4:31             ` Alexei Starovoitov
2017-01-23 20:20               ` Andy Lutomirski
2017-02-03 21:07                 ` Andy Lutomirski
2017-02-03 23:21                   ` Alexei Starovoitov
2017-02-04 17:10                     ` Andy Lutomirski
2017-01-19  1:01     ` Mickaël Salaün

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrVKu63BFVQFAJcLcd6ovPtq-WDdTh-BwyAPSprw8UarNQ@mail.gmail.com \
    --to=luto@amacapital.net \
    --cc=alexei.starovoitov@gmail.com \
    --cc=daniel@zonque.org \
    --cc=davem@davemloft.net \
    --cc=dsahern@gmail.com \
    --cc=jann@thejh.net \
    --cc=keescook@chromium.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=mic@digikod.net \
    --cc=mtk.manpages@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tgraf@suug.ch \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).