On Thu, Nov 9, 2017 at 12:21 PM, Serge E. Hallyn wrote: > On Thu, Nov 09, 2017 at 09:55:41AM +0900, Mahesh Bandewar (महेश बंडेवार) wrote: >> On Thu, Nov 9, 2017 at 4:02 AM, Christian Brauner >> wrote: >> > On Wed, Nov 08, 2017 at 03:09:59AM -0800, Mahesh Bandewar (महेश बंडेवार) wrote: >> >> Sorry folks I was traveling and seems like lot happened on this thread. :p >> >> >> >> I will try to response few of these comments selectively - >> >> >> >> > The thing that makes me hesitate with this set is that it is a >> >> > permanent new feature to address what (I hope) is a temporary >> >> > problem. >> >> I agree this is permanent new feature but it's not solving a temporary >> >> problem. It's impossible to assess what and when new vulnerability >> >> that could show up. I think Daniel summed it up appropriately in his >> >> response >> >> >> >> > Seems like there are two naive ways to do it, the first being to just >> >> > look at all code under ns_capable() plus code called from there. It >> >> > seems like looking at the result of that could be fruitful. >> >> This is really hard. The main issue that there were features designed >> >> and developed before user-ns days with an assumption that unprivileged >> >> users will never get certain capabilities which only root user gets. >> >> Now that is not true anymore with user-ns creation with mapping root >> >> for any process. Also at the same time blocking user-ns creation for >> >> eveyone is a big-hammer which is not needed too. So it's not that easy >> >> to just perform a code-walk-though and correct those decisions now. >> >> >> >> > It seems to me that the existing control in >> >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape >> >> > in that case. >> >> This solution is essentially blocking unprivileged users from using >> >> the user-namespaces entirely. This is not really a solution that can >> >> work. The solution that this patch-set adds allows unprivileged users >> >> to create user-namespaces. Actually the proposed solution is more >> >> fine-grained approach than the unprivileged_userns_clone solution >> >> since you can selectively block capabilities rather than completely >> >> blocking the functionality. >> > >> > I've been talking to Stéphane today about this and we should also keep in mind >> > that we have: >> > >> > chb@conventiont|~ >> >> ls -al /proc/sys/user/ >> > total 0 >> > dr-xr-xr-x 1 root root 0 Nov 6 23:32 . >> > dr-xr-xr-x 1 root root 0 Nov 2 22:13 .. >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_cgroup_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_instances >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_inotify_watches >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_ipc_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_mnt_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_net_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_pid_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_user_namespaces >> > -rw-r--r-- 1 root root 0 Nov 8 19:48 max_uts_namespaces >> > >> > These files allow you to limit the number of namespaces that can be created >> > *per namespace* type. So let's say your system runs a bunch of user namespaces >> > you can do: >> > >> > chb@conventiont|~ >> >> echo 0 > /proc/sys/user/max_user_namespaces >> > >> > So that the next time you try to create a user namespaces you'd see: >> > >> > chb@conventiont|~ >> >> unshare -U >> > unshare: unshare failed: No space left on device >> > >> > So there's not even a need to upstream a new sysctl since we have ways of >> > blocking this. >> > >> I'm not sure how it's solving the problem that my patch-set is addressing? >> I agree though that the need for unprivileged_userns_clone sysctl goes >> away as this is equivalent to setting that sysctl to 0 as you have >> described above. > > oh right that was the reasoning iirc for not needing the other sysctl. > >> However as I mentioned earlier, blocking processes from creating >> user-namespaces is not the solution. Processes should be able to >> create namespaces as they are designed but at the same time we need to >> have controls to 'contain' them if a need arise. Setting max_no to 0 >> is not the solution that I'm looking for since it doesn't solve the >> problem. > > well yesterday we were told that was explicitly not the goal, but that was > not by you ... i just mention it to explain why we seem to be walking in > circles a bit. > > anyway the bounding set doesn't actually make sense so forget that. the > question then is just whether it makes sense to allow things to continue > at all in this situation. would you mind indulging me by giving one or two > concrete examples in the previous known cves of what capabilities you would > have dropped tto allow the rest to continue to be safely used? > Of course. Let's take an example of the CVE that I have mentioned in my cover-letter - CVE-2017-7308 . It's well documented and even has a exploit c-program that can demonstrate how it can be used against non-patched kernel. There is very nice blog post about this vulnerability by Andrey Konovalov. This is about the AF_PACKET socket interface that is protected behind NET_RAW capability. This capability is not available to unprivileged user. However, any unprivileged user can get NET_RAW capability (as demonstrated in the cover-letter code that I have attached in this patch series) so this NET_RAW capability is available to any unprivileged user on the host if the kernel has user-namespaces available. With this patch-set applied, all that is needed is to flip a bit with the sysctl (kernel.controlled_userns_caps_whitelist) as demonstrated below - root@lphh6:~# uname -a Linux lphh6 4.14.0-smp-DEV #97 SMP @1510203579 x86_64 GNU/Linux root@lphh6:~# sysctl -q kernel.controlled_userns_caps_whitelist kernel.controlled_userns_caps_whitelist = 1f,ffffffff Now when I run the program (demo from the cover-letter) as a normal unprivileged user I can't create a RAW socket in init-ns but I can in the child-ns. dumbo@lphh6:~$ /tmp/acquire_raw Attempting to open RAW socket before unshare()... socket() SOCK_RAW failed: : Operation not permitted Attempting to open RAW socket after unshare()... Successfully opened RAW-Sock after unshare(). dumbo@lphh6:~$ Now as a root user. Take off CAP_NET_RAW root@lphh6:~# sysctl -w kernel.controlled_userns_caps_whitelist=1f,ffffdfff kernel.controlled_userns_caps_whitelist = 1f,ffffdfff root@lphh6:~# Now run the same program as an unprivileged user - dumbo@lphh6:~$ /tmp/acquire_raw Attempting to open RAW socket before unshare()... socket() SOCK_RAW failed: : Operation not permitted Attempting to open RAW socket after unshare()... socket() SOCK_RAW failed: : Operation not permitted dumbo@lphh6:~$ Notice that it has failed to create a raw socket in init and in child namespace. It's not blocking creation of user-namespaces but allowing admin turn individual capability bits on and off. This is very simplistic example of just demonstrating how capability bits turn-on/off works. So let's assume a sandboxed environment where we don't know what a binary that we are about run in an environment which is identified as susceptible. By turning off the NET_RAW bit, the admin gets an assurance that system is safe and if binary fails because it's not getting this capability then that bad but a sad consequence (without compromising the host integrity) but if it doesn't use the NET_RAW capability but any other combination of remaining 36 capabilities, it would get whatever is necessary. This means we can safely allow processes to create user-namespaces by taking off certain capabilities in question for temporary/extended period until proper fix is applied without compromising the system integrity. The impact will vary based on which capability is taken off and admin would / should be ware of for the environment that he/she is dealing with. thanks, --mahesh.. > thanks, > serge