On May 10, 2017, at 11:10 PM, Eric Biggers wrote: > > On Wed, May 10, 2017 at 01:14:37PM -0700, Darrick J. Wong wrote: >> [cc btrfs, since afaict that's where most of the dedupe tool authors hang out] >> >> On Wed, May 10, 2017 at 02:27:33PM -0500, Eric W. Biederman wrote: >>> Theodore Ts'o writes: >>> >>>> On Tue, May 09, 2017 at 02:17:46PM -0700, Eric Biggers wrote: >>>>> 1.) Privacy implications. Say the filesystem is being shared between multiple >>>>> users, and one user unpacks foo.tar.gz into their home directory, which >>>>> they've set to mode 700 to hide from other users. Because of this new >>>>> ioctl, all users will be able to see every (inode number, size in blocks) >>>>> pair that was added to the filesystem, as well as the exact layout of the >>>>> physical block allocations which might hint at how the files were created. >>>>> If there is a known "fingerprint" for the unpacked foo.tar.gz in this >>>>> regard, its presence on the filesystem will be revealed to all users. And >>>>> if any filesystems happen to prefer allocating blocks near the containing >>>>> directory, the directory the files are in would likely be revealed too. >> >> Frankly, why are container users even allowed to make unrestricted ioctl >> calls? I thought we had a bunch of security infrastructure to constrain >> what userspace can do to a system, so why don't ioctls fall under these >> same protections? If your containers are really that adversarial, you >> ought to be blacklisting as much as you can. >> > > Personally I don't find the presence of sandboxing features to be a very good > excuse for introducing random insecure ioctls. Not everyone has everything > perfectly "sandboxed" all the time, for obvious reasons. It's easy to forget > about the filesystem ioctls, too, since they can be executed on any regular > file, without having to open some device node in /dev. > > (And this actually does happen; the SELinux policy in Android, for example, > still allows apps to call any ioctl on their data files, despite all the effort > that has gone into whitelisting other types of ioctls. Which should be fixed, > of course, but it shows that this kind of mistake is very easy to make.) > >>>> Unix/Linux has historically not been terribly concerned about trying >>>> to protect this kind of privacy between users. So for example, in >>>> order to do this, you would have to call GETFSMAP continously to track >>>> this sort of thing. Someone who wanted to do this could probably get >>>> this information (and much, much more) by continuously running "ps" to >>>> see what processes are running. >>>> >>>> (I will note. wryly, that in the bad old days, when dozens of users >>>> were sharing a one MIPS Vax/780, it was considered a *good* thing >>>> that social pressure could be applied when it was found that someone >>>> was running a CPU or memory hogger on a time sharing system. The >>>> privacy right of someone running "xtrek" to be able to hide this from >>>> other users on the system was never considered important at all. :-) >> >> Not to mention someone running GETFSMAP in a loop will be pretty obvious >> both from the high kernel cpu usage and the huge number of metadata >> operations. > > Well, only if that someone running GETFSMAP actually wants to watch things in > real-time (it's not necessary for all scenarios that have been mentioned), *and* > there is monitoring in place which actually detects it and can do something > about it. > > Yes, PIDs have traditionally been global, but today we have PID namespaces, and > many other isolation features such as mount namespaces. Nothing is perfect, of > course, and containers are a lot worse than VMs, but it seems weird to use that > as an excuse to knowingly make things worse... > >> >>>> Fortunately, the days of timesharing seem to well behind us. For >>>> those people who think that containers are as secure as VM's (hah, >>>> hah, hah), it might be that best way to handle this is to have a mount >>>> option that requires root access to this functionality. For those >>>> people who really care about this, they can disable access. >> >> Or use separate filesystems for each container so that exploitable bugs >> that shut down the filesystem can't be used to kill the other >> containers. You could use a torrent of metadata-heavy operations >> (fallocate a huge file, punch every block, truncate file, repeat) to DoS >> the other containers. >> >>> What would be the reason for not putting this behind >>> capable(CAP_SYS_ADMIN)? >>> >>> What possible legitimate function could this functionality serve to >>> users who don't own your filesystem? >> >> As I've said before, it's to enable dedupe tools to decide, given a set >> of files with shareable blocks, roughly how many other times each of >> those shareable blocks are shared so that they can make better decisions >> about which file keeps its shareable blocks, and which file gets >> remapped. Dedupe is not a privileged operation, nor are any of the >> tools. >> > > So why does the ioctl need to return all extent mappings for the entire > filesystem, instead of just the share count of each block in the file that the > ioctl is called on? One possibility is that the ioctl() can return the mapping for all inodes owned by the calling PID (or others if CAP_SYS_ADMIN, CAP_DAC_OVERRIDE, or CAP_FOWNER is set), and return an "filesystem aggregate inode" (or more than one if there is a reason to do so) with all the other allocated blocks for inodes the user doesn't have permission to access? IMHO, this would allow a non-root user the main benefit of GETFSMAP, which is trying to determine how fragmented their files are and/or how fragmented the free space is, without leaking any information about file sizes, location, or other information the user can't already get today in a less efficient manner. I don't know how hard this is to implement, but seems not impossible. Cheers, Andreas