On 02/27/2018 10:48 PM, Mickaël Salaün wrote: > > On 27/02/2018 17:39, Andy Lutomirski wrote: >> On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov >> wrote: >>> On Tue, Feb 27, 2018 at 05:20:55AM +0000, Andy Lutomirski wrote: >>>> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov >>>> wrote: >>>>> On Tue, Feb 27, 2018 at 04:40:34AM +0000, Andy Lutomirski wrote: >>>>>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov >>>>>> wrote: >>>>>>> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote: >>>>>>>> The seccomp(2) syscall can be used by a task to apply a Landlock program >>>>>>>> to itself. As a seccomp filter, a Landlock program is enforced for the >>>>>>>> current task and all its future children. A program is immutable and a >>>>>>>> task can only add new restricting programs to itself, forming a list of >>>>>>>> programss. >>>>>>>> >>>>>>>> A Landlock program is tied to a Landlock hook. If the action on a kernel >>>>>>>> object is allowed by the other Linux security mechanisms (e.g. DAC, >>>>>>>> capabilities, other LSM), then a Landlock hook related to this kind of >>>>>>>> object is triggered. The list of programs for this hook is then >>>>>>>> evaluated. Each program return a 32-bit value which can deny the action >>>>>>>> on a kernel object with a non-zero value. If every programs of the list >>>>>>>> return zero, then the action on the object is allowed. >>>>>>>> >>>>>>>> Multiple Landlock programs can be chained to share a 64-bits value for a >>>>>>>> call chain (e.g. evaluating multiple elements of a file path). This >>>>>>>> chaining is restricted when a process construct this chain by loading a >>>>>>>> program, but additional checks are performed when it requests to apply >>>>>>>> this chain of programs to itself. The restrictions ensure that it is >>>>>>>> not possible to call multiple programs in a way that would imply to >>>>>>>> handle multiple shared values (i.e. cookies) for one chain. For now, >>>>>>>> only a fs_pick program can be chained to the same type of program, >>>>>>>> because it may make sense if they have different triggers (cf. next >>>>>>>> commits). This restrictions still allows to reuse Landlock programs in >>>>>>>> a safe way (e.g. use the same loaded fs_walk program with multiple >>>>>>>> chains of fs_pick programs). >>>>>>>> >>>>>>>> Signed-off-by: Mickaël Salaün >>>>>>> >>>>>>> ... >>>>>>> >>>>>>>> +struct landlock_prog_set *landlock_prepend_prog( >>>>>>>> + struct landlock_prog_set *current_prog_set, >>>>>>>> + struct bpf_prog *prog) >>>>>>>> +{ >>>>>>>> + struct landlock_prog_set *new_prog_set = current_prog_set; >>>>>>>> + unsigned long pages; >>>>>>>> + int err; >>>>>>>> + size_t i; >>>>>>>> + struct landlock_prog_set tmp_prog_set = {}; >>>>>>>> + >>>>>>>> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK) >>>>>>>> + return ERR_PTR(-EINVAL); >>>>>>>> + >>>>>>>> + /* validate memory size allocation */ >>>>>>>> + pages = prog->pages; >>>>>>>> + if (current_prog_set) { >>>>>>>> + size_t i; >>>>>>>> + >>>>>>>> + for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) { >>>>>>>> + struct landlock_prog_list *walker_p; >>>>>>>> + >>>>>>>> + for (walker_p = current_prog_set->programs[i]; >>>>>>>> + walker_p; walker_p = walker_p->prev) >>>>>>>> + pages += walker_p->prog->pages; >>>>>>>> + } >>>>>>>> + /* count a struct landlock_prog_set if we need to allocate one */ >>>>>>>> + if (refcount_read(¤t_prog_set->usage) != 1) >>>>>>>> + pages += round_up(sizeof(*current_prog_set), PAGE_SIZE) >>>>>>>> + / PAGE_SIZE; >>>>>>>> + } >>>>>>>> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES) >>>>>>>> + return ERR_PTR(-E2BIG); >>>>>>>> + >>>>>>>> + /* ensure early that we can allocate enough memory for the new >>>>>>>> + * prog_lists */ >>>>>>>> + err = store_landlock_prog(&tmp_prog_set, current_prog_set, prog); >>>>>>>> + if (err) >>>>>>>> + return ERR_PTR(err); >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * Each task_struct points to an array of prog list pointers. These >>>>>>>> + * tables are duplicated when additions are made (which means each >>>>>>>> + * table needs to be refcounted for the processes using it). When a new >>>>>>>> + * table is created, all the refcounters on the prog_list are bumped (to >>>>>>>> + * track each table that references the prog). When a new prog is >>>>>>>> + * added, it's just prepended to the list for the new table to point >>>>>>>> + * at. >>>>>>>> + * >>>>>>>> + * Manage all the possible errors before this step to not uselessly >>>>>>>> + * duplicate current_prog_set and avoid a rollback. >>>>>>>> + */ >>>>>>>> + if (!new_prog_set) { >>>>>>>> + /* >>>>>>>> + * If there is no Landlock program set used by the current task, >>>>>>>> + * then create a new one. >>>>>>>> + */ >>>>>>>> + new_prog_set = new_landlock_prog_set(); >>>>>>>> + if (IS_ERR(new_prog_set)) >>>>>>>> + goto put_tmp_lists; >>>>>>>> + } else if (refcount_read(¤t_prog_set->usage) > 1) { >>>>>>>> + /* >>>>>>>> + * If the current task is not the sole user of its Landlock >>>>>>>> + * program set, then duplicate them. >>>>>>>> + */ >>>>>>>> + new_prog_set = new_landlock_prog_set(); >>>>>>>> + if (IS_ERR(new_prog_set)) >>>>>>>> + goto put_tmp_lists; >>>>>>>> + for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) { >>>>>>>> + new_prog_set->programs[i] = >>>>>>>> + READ_ONCE(current_prog_set->programs[i]); >>>>>>>> + if (new_prog_set->programs[i]) >>>>>>>> + refcount_inc(&new_prog_set->programs[i]->usage); >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * Landlock program set from the current task will not be freed >>>>>>>> + * here because the usage is strictly greater than 1. It is >>>>>>>> + * only prevented to be freed by another task thanks to the >>>>>>>> + * caller of landlock_prepend_prog() which should be locked if >>>>>>>> + * needed. >>>>>>>> + */ >>>>>>>> + landlock_put_prog_set(current_prog_set); >>>>>>>> + } >>>>>>>> + >>>>>>>> + /* prepend tmp_prog_set to new_prog_set */ >>>>>>>> + for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) { >>>>>>>> + /* get the last new list */ >>>>>>>> + struct landlock_prog_list *last_list = >>>>>>>> + tmp_prog_set.programs[i]; >>>>>>>> + >>>>>>>> + if (last_list) { >>>>>>>> + while (last_list->prev) >>>>>>>> + last_list = last_list->prev; >>>>>>>> + /* no need to increment usage (pointer replacement) */ >>>>>>>> + last_list->prev = new_prog_set->programs[i]; >>>>>>>> + new_prog_set->programs[i] = tmp_prog_set.programs[i]; >>>>>>>> + } >>>>>>>> + } >>>>>>>> + new_prog_set->chain_last = tmp_prog_set.chain_last; >>>>>>>> + return new_prog_set; >>>>>>>> + >>>>>>>> +put_tmp_lists: >>>>>>>> + for (i = 0; i < ARRAY_SIZE(tmp_prog_set.programs); i++) >>>>>>>> + put_landlock_prog_list(tmp_prog_set.programs[i]); >>>>>>>> + return new_prog_set; >>>>>>>> +} >>>>>>> >>>>>>> Nack on the chaining concept. >>>>>>> Please do not reinvent the wheel. >>>>>>> There is an existing mechanism for attaching/detaching/quering multiple >>>>>>> programs attached to cgroup and tracing hooks that are also >>>>>>> efficiently executed via BPF_PROG_RUN_ARRAY. >>>>>>> Please use that instead. >>>>>>> >>>>>> >>>>>> I don't see how that would help. Suppose you add a filter, then >>>>>> fork(), and then the child adds another filter. Do you want to >>>>>> duplicate the entire array? You certainly can't *modify* the array >>>>>> because you'll affect processes that shouldn't be affected. >>>>>> >>>>>> In contrast, doing this through seccomp like the earlier patches >>>>>> seemed just fine to me, and seccomp already had the right logic. >>>>> >>>>> it doesn't look to me that existing seccomp side of managing fork >>>>> situation can be reused. Here there is an attempt to add 'chaining' >>>>> concept which sort of an extension of existing seccomp style, >>>>> but somehow heavily done on bpf side and contradicts cgroup/tracing. >>>>> >>>> >>>> I don't see why the seccomp way can't be used. I agree with you that >>>> the seccomp *style* shouldn't be used in bpf code like this, but I >>>> think that Landlock programs can and should just live in the existing >>>> seccomp chain. If the existing seccomp code needs some modification >>>> to make this work, then so be it. >>> >>> +1 >>> if that was the case... >>> but that's not my reading of the patch set. >> >> An earlier version of the patch set used the seccomp filter chain. >> Mickaël, what exactly was wrong with that approach other than that the >> seccomp() syscall was awkward for you to use? You could add a >> seccomp_add_landlock_rule() syscall if you needed to. > > Nothing was wrong about about that, this part did not changed (see my > next comment). > >> >> As a side comment, why is this an LSM at all, let alone a non-stacking >> LSM? It would make a lot more sense to me to make Landlock depend on >> having LSMs configured in but to call the landlock hooks directly from >> the security_xyz() hooks. > > See Casey's answer and his patch series: https://lwn.net/Articles/741963/ > >> >>> >>>> In other words, the kernel already has two kinds of chaining: >>>> seccomp's and bpf's. bpf's doesn't work right for this type of usage >>>> across fork(), whereas seccomp's already handles that case correctly. >>>> (In contrast, seccomp's is totally wrong for cgroup-attached filters.) >>>> So IMO Landlock should use the seccomp core code and call into bpf >>>> for the actual filtering. >>> >>> +1 >>> in cgroup we had to invent this new BPF_PROG_RUN_ARRAY mechanism, >>> since cgroup hierarchy can be complicated with bpf progs attached >>> at different levels with different override/multiprog properties, >>> so walking link list and checking all flags at run-time would have >>> been too slow. That's why we added compute_effective_progs(). >> >> If we start adding override flags to Landlock, I think we're doing it >> wrong. With cgroup bpf programs, the whole mess is set up by the >> administrator. With seccomp, and with Landlock if done correctly, it >> *won't* be set up by the administrator, so the chance that everyone >> gets all the flags right is about zero. All attached filters should >> run unconditionally. > > > There is a misunderstanding about this chaining mechanism. This should > not be confused with the list of seccomp filters nor the cgroup > hierarchies. Landlock programs can be stacked the same way seccomp's > filters can (cf. struct landlock_prog_set, the "chain_last" field is an > optimization which is not used for this struct handling). This stackable > property did not changed from the previous patch series. The chaining > mechanism is for another use case, which does not make sense for seccomp > filters nor other eBPF program types, at least for now, from what I can > tell. > > You may want to get a look at my talk at FOSDEM > (https://landlock.io/talks/2018-02-04_landlock-fosdem.pdf), especially > slides 11 and 12. > > Let me explain my reasoning about this program chaining thing. > > To check if an action on a file is allowed, we first need to identify > this file and match it to the security policy. In a previous > (non-public) patch series, I tried to use one type of eBPF program to > check every kind of access to a file. To be able to identify a file, I > relied on an eBPF map, similar to the current inode map. This map store > a set of references to file descriptors. I then created a function > bpf_is_file_beneath() to check if the requested file was beneath a file > in the map. This way, no chaining, only one eBPF program type to check > an access to a file... but some issues then emerged. First, this design > create a side-channel which help an attacker using such a program to > infer some information not normally available, for example to get a hint > on where a file descriptor (received from a UNIX socket) come from. > Another issue is that this type of program would be called for each > component of a path. Indeed, when the kernel check if an access to a > file is allowed, it walk through all of the directories in its path > (checking if the current process is allowed to execute them). That first > attempt led me to rethink the way we could filter an access to a file > *path*. > > To minimize the number of called to an eBPF program dedicated to > validate an access to a file path, I decided to create three subtype of > eBPF programs. The FS_WALK type is called when walking through every > directory of a file path (except the last one if it is the target). We > can then restrict this type of program to the minimum set of functions > it is allowed to call and the minimum set of data available from its > context. The first implicit chaining is for this type of program. To be > able to evaluate a path while being called for all its components, this > program need to store a state (to remember what was the parent directory > of this path). There is no "previous" field in the subtype for this > program because it is chained with itself, for each directories. This > enable to create a FS_WALK program to evaluate a file hierarchy, thank > to the inode map which can be used to check if a directory of this > hierarchy is part of an allowed (or denied) list of directories. This > design enables to express a file hierarchy in a programmatic way, > without requiring an eBPF helper to do the job (unlike my first experiment). > > The explicit chaining is used to tied a path evaluation (with a FS_WALK > program) to an access to the actual file being requested (the last > component of a file path), with a FS_PICK program. It is only at this > time that the kernel check for the requested action (e.g. read, write, > chdir, append...). To be able to filter such access request we can have > one call to the same program for every action and let this program check > for which action it was called. However, this design does not allow the > kernel to know if the current action is indeed handled by this program. > Hence, it is not possible to implement a cache mechanism to only call > this program if it knows how to handle this action. > > The approach I took for this FS_PICK type of program is to add to its > subtype which action it can handle (with the "triggers" bitfield, seen > as ORed actions). This way, the kernel knows if a call to a FS_PICK > program is necessary. If the user wants to enforce a different security > policy according to the action requested on a file, then it needs > multiple FS_PICK programs. However, to reduce the number of such > programs, this patch series allow a FS_PICK program to be chained with > another, the same way a FS_WALK is chained with itself. This way, if the > user want to check if the action is a for example an "open" and a "read" > and not a "map" and a "read", then it can chain multiple FS_PICK > programs with different triggers actions. The OR check performed by the > kernel is not a limitation then, only a way to know if a call to an eBPF > program is needed. > > The last type of program is FS_GET. This one is called when a process > get a struct file or change its working directory. This is the only > program type able (and allowed) to tag a file. This restriction is > important to not being subject to resource exhaustion attacks (i.e. > tagging every inode accessible to an attacker, which would allocate too > much kernel memory). > > This design gives room for improvements to create a cache of eBPF > context (input data, including maps if any), with the result of an eBPF > program. This would help limit the number of call to an eBPF program the > same way SELinux or other kernel components do to limit costly checks. > > The eBPF maps of progs are useful to call the same type of eBPF > program. It does not fit with this use case because we may want multiple > eBPF program according to the action requested on a kernel object (e.g. > FS_GET). The other reason is because the eBPF program does not know what > will be the next (type of) access check performed by the kernel. > > To say it another way, this chaining mechanism is a way to split a > kernel object evaluation with multiple specialized programs, each of > them being able to deal with data tied to their type. Using a monolithic > eBPF program to check everything does not scale and does not fit with > unprivileged use either. > > As a side note, the cookie value is only an ephemeral value to keep a > state between multiple programs call. It can be used to create a state > machine for an object evaluation. > > I don't see a way to do an efficient and programmatic path evaluation, > with different access checks, with the current eBPF features. Please let > me know if you know how to do it another way. > Andy, Alexei, Daniel, what do you think about this Landlock program chaining and cookie?